Anyways, I decided to compile my top 5 software tool list for Data Science. This of course isn't an exhaustive list of tools I use, but it's a good start for any new data scientist. If a job applicant came to me with just this list of tools, as well as a great statistics background, I would likely hire them immediately.
- R: R is still my primary statistical and analytical tool. As I've mentioned before on this blog, we are running as server version that live-decisions transactions on the fly, processing and predicting on individual transactions in less than a second. I still find R to be a robust and diverse system for statistical algorithm programming, and to have a great array of machine learning algorithms, as well as a large user base to consult with when I have problems. Homepage.
- Python: While I don't use Python for statistical programming, there are still a number of non-statistical applications for Python. Generally speaking, when I feel like I'm breaking new ground doing something, I seem to be using Python somewhere in the process. Most recently I've used Python to spool up loops of large processes and to convert speech to text and dump results into a database for later analysis. Homepage.
- QGIS: Most of the data I encounter has a spatial element of some type, sometimes important, and sometimes not. While both Python and R can process GIS data and display it, I prefer QGIS because it's designed specifically for GIS data, and has a lot of out of the box tools that help a GIS novice like myself. Also, it has Python internals, so if the functionality isn't there, I can script it out in a language I already know! It's like a smaller version of the industry standard ArcGIS, but is free and open source. With ArcGIS's high cost, I have never been able to make a business case for it, especially with QGIS just a free download away. Homepage.
- SQL: Sure, nosql databases and Hadoop are all the rage right now, but 90% of the data that I have to analyze are in some kind of SQL database. Obviously, there are a lot of flavors; when I'm doing a project on my own, I tend towards PostGRES. I was also quite impressed with the new versions of MySQL workbench, which is approaching the functionality of the exclusively commercial Microsoft SQL Server Management Studio.
- Notepad ++: A text editor, seriously? In reality, I use vim quite a bit too when in Linux production, but Notepad ++ is a tool I use on a daily basis. Why? Primarily because Notepad
++ works across all languages I use, and can use it for my deploy scrips, SQL, R, Python, XML, and any other language I may encounter. Homepage.