Monday, April 27, 2015

My Data Science Toolkit

People often ask me what software, programming languages and tools they need to learn to enter the analytics/data science field.  Some focus on products currently creating a lot of buzz like Hadoop or Apache Drill, but I spend the majority of my work time in other products that I consider to be much more important (I work in SQL more than Hadoop, no really!).

Anyways, I decided to compile my top 5 software tool list for Data Science.  This of course isn't an exhaustive list of tools I use, but it's a good start for any new data scientist.  If a job applicant came to me with just this list of tools, as well as a great statistics background, I would likely hire them immediately.

  1. R: R is still my primary statistical and analytical tool.  As I've mentioned before on this blog, we are running as server version that live-decisions transactions on the fly, processing and predicting on individual transactions in less than a second.  I still find R to be a robust and diverse system for statistical algorithm programming, and to have a great array of machine learning algorithms, as well as a large user base to consult with when I have problems.  Homepage.
  2. Python: While I don't use Python for statistical programming, there are still a number of non-statistical applications for Python.  Generally speaking, when I feel like I'm breaking new ground doing something, I seem to be using Python somewhere in the process.  Most recently I've used Python to spool up loops of large processes and to convert speech to text and dump results into a database for later analysis. Homepage.
  3. QGIS: Most of the data I encounter has a spatial element of some type, sometimes important, and sometimes not.  While both Python and R can process GIS data and display it, I prefer QGIS because it's designed specifically for GIS data, and has a lot of out of the box tools that help a GIS novice like myself. Also, it has Python internals, so if the functionality isn't there, I can script it out in a language I already know! It's like a smaller version of the industry standard ArcGIS, but is free and open source.  With ArcGIS's high cost, I have never been able to make a business case for it, especially with QGIS just a free download away.   Homepage.  
  4. SQL: Sure, nosql databases and Hadoop are all the rage right now, but 90% of the data that I have to analyze are in some kind of SQL database.  Obviously, there are a lot of flavors; when I'm doing a project on my own, I tend towards PostGRES. I was also quite impressed with the new versions of MySQL workbench, which is approaching the functionality of the exclusively commercial Microsoft SQL Server Management Studio.
  5. Notepad ++: A text editor, seriously?  In reality, I use vim quite a bit too when in Linux production, but Notepad ++ is a tool I use on a daily basis.  Why?  Primarily because Notepad ++ works across all languages I use, and can use it for my deploy scrips, SQL, R, Python, XML, and any other language I may encounter.  Homepage.

Wednesday, April 22, 2015

Why simple decision trees can suck.

On occasion I get the question of why I don't like using simple decision tree algorithms for production level predictions.  I ran into another example this week, and was thankful we tested results before putting a decision tree into production.  I thought it would be good to document the issue we encountered, for future reference for my team.

As background, here are some of the issues I've encountered with decision trees:

  1. They can have massive over-fitting problems (this is well documented).  
  2. The simplest versions (think the default rpart() in R) tend to predict in big buckets of homogeneous probabilities that don't fit the real world.... and don't do a great job of discriminating outcomes in a meaningful way.
  3. This methodology doesn't do well with bucketizing continuous variables.  Especially if the probability implications of the continuous variable is also.. continuous.
  4. It can perform poorly with new encountered attribute combinations.

We saw a combination of these four things come together to create extremely bad predictions in one of our new processes.  Here's a description of our procuess:
  1. 500K + records to be evaluated nightly.
  2. Training set is 100 million+ records.
  3. Our dependent variable is "success".
  4. Most records 90%+ will be very low probability of success (<1%).
  5. The business payoff, is in evaluation a smallish set of records with >4% of chance of success.
So, here's a chart I put together to evaluate the process.  Horizontal axis is our scoring method (which is a tree + a couple of other methodologies) and vertical axis is actual outcome.  The vertical bars represent volume in each bucket. 

So, this appears to predict well given our assumptions.  Most predictions are in the lowest probability bucket, and that bucket succeeds at approximately 0%.  Then we see a gradual shift up as our scoring model increases until... CRAP WHAT HAPPENED AT .08?

I had a feeling I knew what was going on.  We dug in, and found that there was a somewhat major issue with one of the buckets of the decision tree... for a combination of my four reasons above.  Mainly, the new data encountered involved some attributes not accounted for by the original tree, causing some predictions to be scored much higher than their actual propensity towards success.

We are still working through this issue, but we think using a random forest methodology should overcome these issues.  In the worst case scenario, we can just remove the tree and use our other trained models.

  1. Test everything.  Always.
  2. Be wary of decision trees, especially the simple ones.  We are turning over to a random forest methodology, which we believe will fix our problem.

Monday, April 20, 2015

Kansas Election Fraud: Pt. 3 ... The end?

My prior posts on Kansas election fraud continue to generate quite a bit of traffic, so I thought I should probably revisit the issue and bring some kind of conclusion.  My prior posts left quite a few open questions, and I had a curiosity about how well other metrics may predict voting outcome by precinct.

If you're new to this blog or my posts on election fraud, my two prior posts serve as a good primer, found here and here.

A quick thanks to one commenter on my blog, jimrtex.  This commenter provided great commentary on prior results, and thick description of underlying precinct design factors that could descend from demographic/historical structures that cause our found correlation.  I recommend that you use his comments as background on to the new variables I defined below.

Summary Results

This post may get a little nerdy in a second, and while I recognize some of us live in the world of regression coefficients, I also understand that others just want to know what I found.  Here's a short synopsis:
  1. I changed data sets to the 2008 Presidential election.  This is valid because the correlation of interest (republicans doing better in larger precincts) still holds up.
  2. Nearly every demographic covariate I threw at the equation was statistically significant and more important than the "number of voters" variable.
  3. If I create a large predictive model, using other variables such as population density, county size, and relationships between local precinct, the number of voters voting in the precinct become statistically insignificant. 

Ok, those points may still seem a little technical, but here's what they mean: The original analysis was looking at a very small relationship in a world where much more important relationships exist.  And if we look at the data in a way where we simultaneously account for multiple factors, the correlation from Clarkson's original commentary is simply non-existent.


And now on to the technical analysis.  I switched to the 2008 presidential election in Kansas because I found a file with additional demographic attributes--the base relationship still holds up.  I also added some attributes that compared the attributes of individual precinct to that of their county of membership as a whole.  Here's a list of variables defined:

  • pres_perc_r: Precent voting Republican for President.
  • t_08: total votes in 2008 presidential elections
  • perc_voting: percent of voting age population voting in election
  • aland10: land area of the precinct
  • perc_vote_age: percent of precinct that is voting age
  • area_to_county: relationship of precinct size in area to average precinct in county
  • pop_county: population of the county of membership
  • change: population change of precinct between elections
  • perc_change: change as percent of initial population
  • pop_dense: density of population
  • perc_county: percent change growth in relation to other precincts in county
I won't delve deeply into a priori theory on each variable, but, generally speaking the variables were designed to measure underlying demographic factors, as well as precinct design concerns as brought up previously by commenters on this blog.

So, first things first, Clarkson's initial simple correlation, did it hold up?  Absolutely, and here's the evidence.

But what else correlates to the % that votes Republican in an election?  A lot of our variables, it turns out.  Here's a correlation matrix.  And notice that size of precinct is actually the lowest absolute correlation value.  Also of note, many of them are cross-correlated with size of precinct, pointing towards multi-covariation.

 So, what happens if I throw some of these variables that a priori make sense at a model?  Number of voters in the precinct is not longer significant, but other variables end up highly significant. One variable of note is the percent voting:  This one is important, because as the percent of voters voting increases, the percent Republican increases significantly.  This is partial verification of a prior concern I had regarding higher turnout in Republican districts being the underlying cause of Clarkson's correlation.


Statistics: I spent quite a bit of effort here demonstrating that the small correlation found by Clarkson and previous authors is most likely due to other correlated variables.  These variables generally measure demographic factors and precinct design concerns (and correlate conceptually with the ideas from commenters on this blog and elsewhere).

Politics: Given the statistics of this, it is still of concern to me that a statistician goes to the media with an anomaly that is almost completely untested, and (the way it was reported in the media) can lead to massive accusations of fraud.  Given the nature of our electoral democracy, this has a tendency to call the entire system into question, and is certainly a reminder for all statisticians to be very careful in reporting results.

Tuesday, April 14, 2015

Fit Data #2

Why do I need to track my activity and sleep levels so closely?  Well.. here's some insight from a conversation I had today:

Other director: "Hey Levi, we're having a meeting on the online project and would like your opinion could you join us for a few minutes?"
Me:  "Um, well.. I'm eating cheese fries right now... can you hold.. ah hell, can I bring the cheese fries with me?"
Other director: "I guess...."

I mainly run/be active because of my massive food habit. But for my activity level I would weigh 300 lbs.  So, I need to nerdily track my activity level, at least for the near future.  I promise I won't post every week on this.

First the good news from this week:

  • I averaged 3,000 more steps than last week.
  • I slept, on average, 20 minutes fewer each night. I didn't feel tired though, so I think this metric may be more about consistency (low variance) than about mean sleep hours.

Now back to the nerdy part.  I only have 18 data points, so I really shouldn't be modelling this data.  But I will anyways, because I'm excited about it, and it's what I do.  

Anyways, down the road I want to build out a fuller, multi-factor model, but for now I wanted to look at two factors: prior night sleep and "weekend."  Here's what I learned:

  1. Prior night sleep:  positively correlated, meaning the more I sleep, the more I move the next day, though not enough data for statistical significance, and the realized impact is only 500 additional steps per incremental hour of sleep.
  2. Weekend:  I move a lot more on weekend days.  Statistically significant.  I end up moving almost 5700 steps more on the weekend than on normal weekdays.  
So, this is my first post with any kind of fit tracker data analysis.  I'm closing in on my 20,000 step per day average, which will be good to have.  The initial results aren't surprising, but I'm still hoping to derive a counter-cyclical workout schedule from future data.

Model specification with parameter estimates.

Weekend partial dependence, (1) represents weekend, (0) represents weekdays.
Plot of hours slept versus steps, showing positive correlation.

Wednesday, April 8, 2015

Kansas Election Fraud pt. 2

Yesterday's post on election fraud issues in Kansas got quite a bit of response, so I thought I would followup with an additional analysis. Also, R was still open, with my data frame loaded when I got up this morning, so... what the hell.

Oh, and this analysis moves quickly into some fairly technical areas, but I think most of them can be understood in general terms.  If you have any methodological questions please post in the comments.

Yesterday's Analysis

My biggest contention from yesterday's model (which was really my implementation of Clarkson's model) is that underlying, unmeasured demographic terms were likely causing the correlation.  So, my goal here (and over a series of posts in the future, theoretically) is to systematically look at other possibilities.

Also the R-squared metric (a common metric for how good a regression model is) was VERY low (about 0.02).  The model is still significant, it's just, not very predictive.


I only had one additional variable that I could use in my data frame, which was the county each precinct was located in.  Because election results are highly variable by county, and counties are also not homogeneous in demographic factors, county can be used as a proxy for these demographic and regional variations.  

In this case, if demographic terms that vary significantly by county are what are responsible for the precinct size: republican vote share correlation, we would expect that introducing county into the equation would decrease the importance of our precinct size variable.  That's not what happened. 

New Model

Methodology: I created the same model as yesterday, but added in a "fixed effect" for the county of each precinct.  That's it.  And by the way, I shouldn't use the term "fixed effect" because it's confusing and every statistician uses it differently.

Here's a summary of what happened in the models:

  1. The R-squared shot through the roof (this is expected) from 2% to 46%.
  2. The effect (parameter estimate) of the precinct size variable increased increase significantly (not expected).
  3. The statistical significance of the precinct size variable increase significantly (not expected).
R output:

Essentially, I expected the county-based control would decrease the correlation of interest, but it actually increased in importance.  What this generally means, in simple terms, is that the control variables cleared up some of the unexplained variance, and allowed a clearer view of the data, in which precinct size is even more important.

What does this mean?

The weird correlation is still there, and is stronger when we clear up some other exogenous factors. I'll need to dive into additional data to figure out the real root cause.

Also, I thought of an additional possibility, though it's still in the early phases of ideation.  Here are the basics: 
  • There's a potential endogenous relationship between % republican voters and number of voters. 
  • If precincts are formed based on census/geographical tracts, then there's a key intervening variable of turnout.
  • If conservative turnout is statistically "better" then having a more republican district, could cause our independent variable (voters in precinct) to rise significantly.
Still thinking through this one though, any input is appreciated.

Tuesday, April 7, 2015

Kansas Election Fraud

Statisticians need to be careful in the way we communicate with the public.

Last week, a Wichita State University statistician filed a lawsuit regarding the Kansas 2014 election.  She is trying to get access to vote machine tallies, to rule out the potential of voter fraud.  Here's a link to the article in the Wichita Eagle.

The statistician, who works as a QA engineer, has found some "voting anomalies"... essentially that republicans receive larger than expected vote shares in larger precincts.  Keep in mind that QA (Quality Assurance) engineers are trained to look for anomalies, things you wouldn't expect in data, and make a big deal out of them so that systems don't fail. 

Of note from the article:

“This is not just an anomaly that occurred in one place,” Clarkson said. “It is a pattern that has occurred repeatedly in elections across the United States.”

Read more here:

The pattern could be voter fraud or a demographic trend that has not been picked up by extensive polling, she said.
On face, as a research statistician, this doesn't seem like that big of deal.  Just a researcher looking into anomalies.  I did think that putting "voter fraud" out there as a possibility seems a little aggressive at this point, but didn't see this as an issue that would get a lot of attention.  But consider the political climate of Kansas:

  1. I've posted on this before, but the political climate in Kansas right now is really tense, largely due to a highly contested election.
  2. Same post from before, but progressives are especially upset because they just lost an election, which their leaders told them would be a fairly easy win.
  3. I've seen the article above posted by many progressive friends as fodder, evidence, and proof that, statistically speaking, Brownback was probably re-elected due to election fraud.
This seems to be a heating-up massive conspiracy theory.  But let's be calm and analyze:

What we know

First, from Clarkson's comments, she has no evidence of voter fraud.  She has found a small statistical anomaly, that exists nationwide and wants to use Kansas to verify that it isn't due to fraud or voting machine issues.

But what is that anomaly? 

The anomaly is that after a certain size threshold (500), there is a positive correlation between precinct size and percent republican votes.  

Why is that an issue?

Clarkson is a QA engineer and in anomaly detection mode.  She's starting from an a priori premise that precinct size should not determine results, and thus, statistically significant correlations should not exist.

Is Clarkson's analysis of the data correct?

Though disagreeing with her conclusions, I tend to think the mechanics of her analysis are correct.  In fact, I was able to replicate, using the 2010 Kansas Gubernatorial Election.  The relationship is weak statistically speaking, but statistically significant which indicates that something non-random is happening in the data. Regression stats and visual plot below.

And here's what a sample of Clarkson's work with an Ohio example looks like:

So the correlations exist, and then people must be acting nefariously in those large districts right?

Here we go.  Absolutely not.  And here's why: covariates.  In the real world, multiple variables often correlate with one another, causing us to find relationships, that are really measuring something else.  Clarkson's comments allude to this when she talks about potentially underlying and undetermined demographic factors.  

There are many what-if's here.  What if other variables also correlate with precinct size?  Age, Race, Wealth, Urbanity, etc, etc, etc.  In these cases we are latently measuring other factors through measuring precinct size.   Another specific issue, what if more conservative populaces somehow push for fewer, larger precincts and less division?  Keep in mind, this is a WEAK relationship we are trying to explain.

The authors of this analysis already drop the smallest precincts as a whole, because they tend to be more rural, and thus more Republican.  In essence, the authors tacitly admit underlying demographic factors can impact the correlation between precinct size and voting behavior.  We haven't gone through the steps to exclude all other demographic factors, so why are we making vague accusations of fraud and making a big deal of this in the press?  

Partially speculation here, but it is much more likely that the correlation found is due to underlying demographics and other covariates, rather than something nefarious going on with voter machines.


This is an interesting area of research, and I will likely post on this again when I get access to more data.  I think it is quite likely, that this is due to underlying demographic drivers.

I absolutely think Clarkson should have access to the vote machine records, as well as the software itself for testing.  But, the way this has been handled by Clarkson and the press at this point is pretty reckless.  In the political environment that is Kansas, where many believe the election was stolen or rigged, this "evidence" is being handled by many people as more evidence of fraud by the administration, while we really have no real evidence for that.

Monday, April 6, 2015

Fit Data #1

As I mentioned in a prior post I started using a fitness tracker to track activity, initially using my phone and Google Fit (which sucked, btw) and now using a Garmin VivoFit 2.  I started the Garmin a little over a week ago, so no fancy models yet.. BUT I do have some initial data to look at.

First A Review

The Garmin is awesome, and seems fairly accurate.  One bonus is that when I'm inactive in the office, it beeps and tells me to get up and move.  This is also annoying, and somewhat embarrassing as many of my coworkers have them as well, so they know that I've been lazy.

The Garmin only misses two major pieces of my activity, the exercise bike in the basement (I average 254 miles a week) and the stairclimber at the gym (I average 400 flights of stairs a week).  These misses aren't a huge issue, but I may figure out a way to add them back in at a later date.

Now To The Data

Two major metrics I'm analyzing are steps and sleep.

First sleep:  This is a huge area of interest to me. To be completely honest here, I have felt tired since early December, and have wondered I'm getting older, not sleeping well lately, or if the flu I had around Thanksgiving is still impacting my body.

Anyways, here's my number 1 lesson: my sleep is a lot more variable than I would expect.  Basically anywhere from under 7 to almost 9 hours a night.  Not enough data to draw any conclusions, but maybe more regular sleep would help?

Now steps:  My steps are a little more variable than I thought, and my mid-week totals were  disappointing.  I think I'll work on moving more during the work week, and then see if there's anything additional I can model.  Also, I want to set a goal of 20,000 steps a day.  

Note that this weekend's steps totals were far below last weekend's, because we were travelling for Easter this weekend. 

Finally the data together:  Not enough to draw any correlations from data, but of note:

I both move less and sleep less during the work week.  Is this due to being less tired from less activity, or does stress impact sleep negatively, which makes me move less?  Only time and a lot more data will tell.

I'm going to keep looking at the data, and see if I can develop an algorithm to predict a "next" day's activity from current data, and additionally to correct my mid-week cycle of sleeping poorly and not being active.

Right now, though, I'm a little worried I'll be sleeping at my desk this afternoon, as last night was the lowest sleep time recorded, combined with an above average activity day yesterday.