Saturday, June 27, 2015

Anatomy of an Analysis: How your Toolkit comes together.

This is a follow up to my original post on my Data Science Toolkit, which received a hugely positive response.  One of the questions I get from young analysts is "what tool should I use for this task?"  Generally when they are asking this question, they are suggesting a couple of software products, both capable of completing the task, but one has a distinct advantage. So, the advice I give goes something like this:
Use whatever tool will get you to an accurate answer fastest; putting you in the best position to followup with the data in the future. 
Rules like this are nice, but I think it's more interesting to show process.  Interestingly, my recent post on people driving to avoid potential sales tax involved every tool in my toolkit post.  So, let's go through my process as a demonstration to young analysts.

PROCESS

  1. I read initial post by another analyst on Twitter. Became initially annoyed, but also curious.  I used Google Maps to calculate the distance to Missouri from my home, and then Excel for some "back of the napkin" calculations.  I realized, hey I have something here, maybe I should write this one up? 
  2.  I needed a framework for analysis, what tool is good for "what if" scenarios?  Excel.  So I ran some final numbers in Excel, based on miles from the border, driving costs as well as time costs, and then created a chart of the cost curves.
  3. After the framework was developed, I knew I needed to apply it, and this certainly had a geographic aspect.  I acquired a couple of Johnson County specific shapefiles (Census Blocks; Major Highways) and imported them to QGIS.
  4. From my framework analysis I knew my mileage cutoffs (4,6,9), but I needed a geographically defined way to implement this. For this, I used some custom R functions I developed previously that are implementations of Haversine's method, and allow for a backsolve (put in mileages, get coordinates).  
  5. Next I needed to code my shapefiles by the distance breaks.  For this, I used the Python engine internal to QGIS, to create the maps, color code them for Data Viz effects. I also used this Python script to summarize census block data by distance to border, so that I could say things like "70% of people in Joco live within nine miles of the border." (I used Notepad++ for this code, and all the other code created for this project
  6. After completing my analysis, I wanted to quickly validate the data.  Luckily I have a Haversine backsolve for SQL, which allowed me to validate this analysis.  Also: GIS shapefiles have an internal *.dbf file that holds the attribute table, which holds the analytical information you need (lat/long, population, demographics, etc) and can be easily imported to SQL.  I validated, and everything checked out.

No comments:

Post a Comment