Later that week I saw Kirk Borne post on twitter regarding the top downloaded R packages, which was interesting, but a bit predictable. The top packages contained elements that would be relevant across many fields that use R. Packages like plyr, which has a lot of handy data-handling tools, and ggplot2 which is used to plot data. This list was interesting, but for people just starting in the field, I thought a post on data science specific tools would be useful. So, in a non-random order, here they are:
- e1071: Aside from having the weirdest name of the R packages, this one is probably one of the more most useful. Because of the industry I'm in, the Support Vector Machine has great functionality, and I seem to always have this package loaded. Outside of the SVM, there are a few other useful functions, including a Naive Bayes classifier, as well as a couple of clustering algorithms.
- randomForest: I've talked about the problems of decision trees before on this blog, so if I'm fitting any type of tree it's generally a random forest. If you don't know how these work, it essentially fits several trees based on randomized subsets of features, and averages (bags) the results. It's a great library, but does contain one piece of "functionality" that annoys me. An informational message on calling the library, which is just annoying to those of us who use R in a production server. See image below, with RODBC working without message, but randomForest providing an unnecessary message.
- RODBC: Speaking of RODBC, hey wait a minute didn't I just complain that Borne's list contained just a lot of general, non data-science specific packages? Don't care. This one is just too useful. RODBC is the package I use to connect to databases for data pulling and pushing. Best part? Data comes into R with correct data types, which doesn't always happen when you import from flat files or csv's. (A quick note though, I use RJDBC for similar functionality in production, because we use MSSQL and this allows me to use a *.jar proprietary driver)
- topicmodels (RTextTools?): These are the two libraries I use for text mining. Topicmodels provides a Latent Dirichlet Allocation (LDA) function that I use often. To be completely honest the two packages are complementary, and I can't remember which functions are contained in each package (I generally call them both at the same time), but together they provide most of the tools I need to process text data, as well as creating Document-Term matrices.
- nnet: If I want to test a simple neural network might perform well, or out-perform another model specification, I turn to this package. While there are many other packages providing various types of neural network, this is the simple standard for a neural network in its simplest form. I will use this as a test first before turning to the more complex and dynamic packages.
The list above contains the most valuable functions I use, however these functions below also make my work life much easier:
For Twitter connectivity: twitteR (requires that you acquire a twitter developer's account)
For Geo processing: rgeos,geosphere
For visualization/GUI functionality: rattle, grDevices, ggplot2