Tuesday, May 12, 2015

My Top R Libraries

A couple of weeks ago I posted a list of my top five data science software tools, which received quite a few pageviews and shares across the internet.  As someone told me, people just freaking love numbered lists.

Later that week I saw Kirk Borne post on twitter regarding the top downloaded R packages, which was interesting, but a bit predictable.  The top packages contained elements that would be relevant across many fields that use R.  Packages like plyr, which has a lot of handy data-handling tools, and ggplot2 which is used to plot data.  This list was interesting, but for people just starting in the field,  I thought a post on data science specific tools would be useful.  So, in a non-random order, here they are:


  1. e1071:  Aside from having the weirdest name of the R packages, this one is probably one of the more most useful.  Because of the industry I'm in, the Support Vector Machine has great functionality, and I seem to always have this package loaded.  Outside of the SVM, there are a few other useful functions, including a Naive Bayes classifier, as well as a couple of clustering algorithms.
  2. randomForest: I've talked about the problems of decision trees before on this blog, so if I'm fitting any type of tree it's generally a random forest.  If you don't know how these work, it essentially fits several trees based on randomized subsets of features, and averages (bags) the results.  It's a great library, but does contain one piece of "functionality" that annoys me.  An informational message on calling the library, which is just annoying to those of us who use R in a production server.  See image below, with RODBC working without message, but randomForest providing an unnecessary message.
  3. RODBC: Speaking of RODBC, hey wait a minute didn't I just complain that Borne's list contained just a lot of general, non data-science specific packages? Don't care.  This one is just too useful. RODBC is the package I use to connect to databases for data pulling and pushing.    Best part?  Data comes into R with correct data types, which doesn't always happen when you import from flat files or csv's.   (A quick note though, I use RJDBC for similar functionality in production, because we use MSSQL and this allows me to use a *.jar proprietary driver)
  4. topicmodels (RTextTools?): These are the two libraries I use  for text mining.  Topicmodels provides a Latent Dirichlet Allocation (LDA) function that I use often.  To be  completely honest the two packages are complementary, and I can't remember which functions are contained in each package (I generally call them both at the same time), but together they provide most of the tools I need to process text data, as well as creating Document-Term matrices.
  5. nnet: If I want to test a simple neural network might perform well, or out-perform another model specification, I turn to this package.  While there are many other packages providing various types of neural network, this is the simple standard for a neural network in its simplest form.  I will use this as a test first before turning to the more complex and dynamic packages.

Honorable Mention:

The list above contains the most valuable functions I use, however these functions below also make my work life much easier:

For data processing: plyr
For Twitter connectivity: twitteR (requires that you acquire a twitter developer's account)
For Geo processing: rgeos,geosphere
For visualization/GUI functionality: rattle, grDevices, ggplot2


No comments:

Post a Comment