Friday, May 22, 2015

Reduction of Multinomial Data: Measuring Diversity with a Single Number

A decade ago I was confronted three coworkers with a major data problem.  They were trying to make a meaningful statement on whether or not certain entities had become more racially diverse over time.  They were looking at numbers like this (these are 2010 total US census numbers):

The problem they faced was this: How do we show a change in racial statistics over time (which is generally expressed as a series of six or seven numbers) in way that's accurate and easy to measure  and communicate?  

The analysts had a few good starts, but none of them really measured diversity:
  • Percentage of non-white people?
  • Ratio of minority to majority groups?
So I took a couple of days to think...


(skip if you're just interested in acutal results)

After a few days of thinking, I came up with a metric, which I called:  Effective Number of Races, or my diversity index, calculated as: 

Where p is the proportion of the total population of each racial group.  Admittedly, this metric isn't perfect, but it has an appropriate reaction to diversity:  Homogeneous populations with few racial groups active in small numbers get very low diversity scores; heterogeneous populations with more, large minority groups get larger scores.

This metric isn't necessarily a new idea; but it's derived from other fields where it's necessary to measure heterogeneity in multi-nomial variables. Two examples: in economics (effective number of firms: How many firms are REALLY active in this sector?) and political science (effective number of parties: How many parties are REALLY active in this parliamentary system?).  


So for comparative purposes I can use this score to compare diversity among different populations and sub-populations.  For instance:

  • US Populations in general: 2.21
  • US House of Representatives: 1.53
  • US Senate: 1.13
  • US President: 1.0 (This is a joke, FYI)
This analysis clearly demonstrates, through a single number something that we already know: the US Government is significantly less diverse than the nation as a whole.


This is good for demonstrative purposes, but what about predictive purposes?  Can diversity be predictive of political, social, or economic outcomes?

So lets start with politics.  Will a more diverse population lead to different political outcomes?  A priori theory would tell us that more diverse populations in the US would favor the democratic party (I won't go into the reasons why).

To prove this out I regressed the diversity index for each state against the percentage of democrats in the state legislature.  I found that a 1 point change in the diversity index led to a 11 percentage point gain in democratic votes. This is a significant correlation. 

(BTW, to do this I had to calculate diversity values for each State, if you're interested in that data let me know)


One of the best uses for the diversity index, involves its ability to measure diversity over time, and show how a population has changed in a single number.  I've plotted three time periods for the diversity index below (1960,2010,2050) against a simple % White metric.  Two thoughts:
  • I'm personally shocked at how large this change is.
  • The absolute slope of the diversity index is much higher (and here more accurate) because it measure not just a reduction of the white proportion, but also the growth in multiple minority populations.


One of the cases I wanted to analyze was one that has been covered in the media: the diversity of the population of Ferguson, MO to that of its police department.  Here are the values I received:

  • Ferguson MO: 1.85
  • Ferguson PD: 1.12
As we may expect, there is a huge differential in the diversity values.  Unfortunately though, because Ferguson is overwhelmingly African American and the Police department is majority white, the diversity index actually underestimates the effective differential here. 

This doesn't occur often, but a good analogy would be South African apartheid.  The diversity index would not be a great metric in this case because both values would read close to one, because both the population and leadership were homogeneous groups.  Only, in this case, one was homogeneously black and the other homogeneously white.  In essence: the diversity index is not a good proxy for measures of inequality.


The diversity index can serve as a powerful metric to measure differences between populations, subpopulations, change over time, and to predict the impact of diversity on various outcomes.  However, in instances of flipped majority groups (Apartheid South Africa, for instance) it is not a good proxy measure for inequality.

Wednesday, May 20, 2015

Halfway Book Review: The Signal and the Noise

This post should make my librarian wife quite happy, as it is a book review.  And actually, a book that she picked out for me to read, Nate Silver's The Signal and the Noise.

I'm currently only halfway through the book, but I have a lot of thoughts, and not sure I'll get around to writing a blog entry when I finish.  To summarize, the book is an explainer for why some predictions are good, and why some are bad.  The book is written from a simplistic perspective, such that even people with no background in statistics can understand the underlying concepts.

Silver walks the reader through several fields where predictions are made: predicting baseball player performance, predicting earth quakes, predicting the weather, predicting economic downturns, and predicting election results.  He takes us through the reasons that many fields are bad at predictions (earthquake research) and some are fairly good (weather predictions, by the government).  

A big downside to this book, if you are someone who makes predictions for a living, and your job depends on it, is that there are a lot of stories of failed predictions.  Specifically anecdotes about when researchers were confident in their prediction and confidence levels, but ended up being horribly wrong.  If you make predictions for a living, some of these stories are nightmare inducing, or at least sleep reducing.

A theme that runs through the first half of the book is that context matters a lot in predictions, and that forecasters need to understand things that aren't represented in data.  Examples of this may be how well a political candidate speaks in public or if a baseball player likes to party or takes care of his body.  These may not be evident in original data, but will impact eventual outcomes.  (My guess is that this later turns into a pitch for bayesian statistical methods, but we'll see.)

This take on context in political predictions takes me back to an earlier post on this blog, where I was critical of  Nate Silver for making a potentially out of context prediction about the Kansas 2014 Governor's race.  My specific criticism was that I felt a forecast that gave Paul Davis a major edge was seriously lacking in Kansas social and historical context. Upon reading this book, the political context was important (which Silver would undoubtedly agree with) and I was just the advantaged forecaster with 30+ years of living in Kansas (a lot of contextual knowledge).  

In summary, I definitely recommend this book to anyone who 1. it won't keep up at night and 2. who wants to understand the way statistical predictions work.  My wife has read this book, and I think it gave her some great insight into the way I think about and approach problems.  But you don't have to take my word for it (SORRY ALWAYS WANTED TO DO THAT).

Friday, May 15, 2015

Kansas Education Funding Analysis Part 1

Public education funding in Kansas is a huge mess.  The last ten years have seen multiple lawsuits, annual battles in the State Legislature on education spending, and massive changes in the State education funding formula.  

Why is this such a big deal in Kansas?  A couple of factors.  First, the Kansas Constitution has a section that says the State must adequately fund education.  Second, the Kansas legislature is largely made up of small-government fiscal conservatives, so it is relatively difficult to increase public spending for anything.  Twice in the last decade, citizens have filed lawsuits against the State to force an increase in public spending.  Twice they have (essentially) won.

At the heart of this question seems to be two main questions:
  • How much do we actually need to spend on schools to meet constitutional requirements?
  • If we fund schools more, will we get better results?


If you're just interested in whether or not spending matters to educational outcomes, skip to the conclusion section.

I have mentioned before that some of my first professional work was on a Kansas education "cost study" for the State auditor's office, about a decade ago.  During that study I did quite a bit of work on relative teacher salaries, but also some work on the relationship between spending and education outcomes.  For that study we contracted with a couple of professors out of Syracuse, their study can be found here.  (That study goes into more detail, so if you're very interested in method, start there).

Now, 10 years on in my career, I know I can do the spending to education outcomes research on my own, specifically replicating their original research and answering some questions:
  • How have coefficient values and relationships evolved over the past 10 years?
  • Does increased spending continue to relate to better outcomes in Kansas education?


My methodology here is  to replicate other education research on spending to outcome, this is just part 1 of a potentially many part series.  Please keep in mind:
  • I'm just one dude running this analysis while my other queries run.
  • The original study cost the State over a million dollars, and took up 6 full months of the State audit offices time (read: 25ish staff).
  • I'm going to build models slowly, as data comes available, so what I have for today is just a truncated model, but it's a start for a conversation.
I went to the State of Kansas Department of Education website looking for data, and found a nice data warehouse where I can run custom queries.  I pulled down data for the past three complete school years, did a bit of data cleaning, and pulled out what appeared to be the top variables.  I also bucketized district size, as had been done in the original study.

The model type here is a cost function, which estimates how much something will cost by various input factors.  Certain things increase costs for a district, which we can measure (poor kids, having fewer kids (being less efficient), and performing better on standardized tests, theoretically, should cost more money).

Here's my variable list and an explanation:

  • PERPUPSEPDN: Our dependent variable, per pupil dollars spent.
  • AVGASS: Our most important independent variable, average assessment values for each district (how well do kids do on standardized assessments).
  • FREELUNCH: % of kids on free lunch.  Kids in poverty are more difficult to educate, so this increases cost.
  • TEACHSAL: Average salary of teachers in the district.
  • VALPUP: Per pupil property values.  This is part of the efficiency variables used in the original study.  Effectively, these efficiency variables measure factors that make it easier for school district to spend money in inefficient ways.
  • ENROLLCAT: Categories for different school district sizes.
  • YEAR: Fixed effect for what year we are measuring.


So, what did I find?  

The important AVGASS variable is positive, and approaching statistical significance, meaning that some kind of relationship likely exists. 

Percent of kids receiving free lunch (a proxy for poverty) shows that districts with a lot of kids in poverty still spend more to get the same results.  Also, "property rich" districts still outspend "property poor" districts.

Keeping in mind that this data is still a bit noisy and I'm not yet controlling for all of the factors of the original study, nor using as many years of data, this is quite promising.  I can generally conclude, spending is still significantly related to education outcomes.*

Next Steps:  For the next steps here, I will try to acquire more years of data and more attributes, clean the current data set (I think it's likely form what I've seen that I have some data entry errors), and work on a better model.  Eventually, I may try to calculate actual spending levels required to hit specific outcomes levels for different school districts.

*Quick footnote from above.  I'm purposely avoiding terms that insinuate causal linkages, largely because this analysis has not yet flushed that issue out.  Do I think it's likely that spending more can create better results?  Yes.  But I also know how confounding some of these issues can be.  Specifically, intervening and co-linear variables mean that the relationship observed isn't as simple as spend more, get better.  It's likely that more affluent districts both tend to spend more money AND get better results for other reasons (less "unmeasurable" poverty, other social problems, parents with de-facto higher education levels, etc).  My point:  this doesn't prove cause, though through iterated analysis, we should be able to move in that direction.

Tuesday, May 12, 2015

My Top R Libraries

A couple of weeks ago I posted a list of my top five data science software tools, which received quite a few pageviews and shares across the internet.  As someone told me, people just freaking love numbered lists.

Later that week I saw Kirk Borne post on twitter regarding the top downloaded R packages, which was interesting, but a bit predictable.  The top packages contained elements that would be relevant across many fields that use R.  Packages like plyr, which has a lot of handy data-handling tools, and ggplot2 which is used to plot data.  This list was interesting, but for people just starting in the field,  I thought a post on data science specific tools would be useful.  So, in a non-random order, here they are:

  1. e1071:  Aside from having the weirdest name of the R packages, this one is probably one of the more most useful.  Because of the industry I'm in, the Support Vector Machine has great functionality, and I seem to always have this package loaded.  Outside of the SVM, there are a few other useful functions, including a Naive Bayes classifier, as well as a couple of clustering algorithms.
  2. randomForest: I've talked about the problems of decision trees before on this blog, so if I'm fitting any type of tree it's generally a random forest.  If you don't know how these work, it essentially fits several trees based on randomized subsets of features, and averages (bags) the results.  It's a great library, but does contain one piece of "functionality" that annoys me.  An informational message on calling the library, which is just annoying to those of us who use R in a production server.  See image below, with RODBC working without message, but randomForest providing an unnecessary message.
  3. RODBC: Speaking of RODBC, hey wait a minute didn't I just complain that Borne's list contained just a lot of general, non data-science specific packages? Don't care.  This one is just too useful. RODBC is the package I use to connect to databases for data pulling and pushing.    Best part?  Data comes into R with correct data types, which doesn't always happen when you import from flat files or csv's.   (A quick note though, I use RJDBC for similar functionality in production, because we use MSSQL and this allows me to use a *.jar proprietary driver)
  4. topicmodels (RTextTools?): These are the two libraries I use  for text mining.  Topicmodels provides a Latent Dirichlet Allocation (LDA) function that I use often.  To be  completely honest the two packages are complementary, and I can't remember which functions are contained in each package (I generally call them both at the same time), but together they provide most of the tools I need to process text data, as well as creating Document-Term matrices.
  5. nnet: If I want to test a simple neural network might perform well, or out-perform another model specification, I turn to this package.  While there are many other packages providing various types of neural network, this is the simple standard for a neural network in its simplest form.  I will use this as a test first before turning to the more complex and dynamic packages.

Honorable Mention:

The list above contains the most valuable functions I use, however these functions below also make my work life much easier:

For data processing: plyr
For Twitter connectivity: twitteR (requires that you acquire a twitter developer's account)
For Geo processing: rgeos,geosphere
For visualization/GUI functionality: rattle, grDevices, ggplot2

Friday, May 8, 2015

Have you tried logarithms?

So, cue XKCD reference.  

Randall Munroe is making fun of varying levels of technical knowledge and rigor in different fields.  It's a funny cartoon, and hopefully not too offensive to any readers of this blog involved with Sociology or Literary Criticism (I doubt there are many).

The irony here, is in the first panel.  In this case playing off a seemingly ignorant question of "Have you tried Logarithms?"

In analytics, I actually say "have you tried logarithms?" quite a bit.  The reason is simple: to emulate different shapes of relationships that occur in nature, sometimes variable transformations are necessary.

 Background Information

Although logarithmic transformations are used across many modeling types, the most common is linear regression.  If you understand why we log linear regression variables, it's easy to apply  to other models.  There are many ways to transform variables, but here's a quick primer focusing on logarithmic transformations.

  • linear-linear: Neither dependent nor independent variables are logged.  A  change in X creates a "coefficient value" change in Y.  This is your normal straight line relationship.
  • linear-log: Only the independent variables are logged.  In this case, a change in log(X) creates a  corresponding coefficient change in Y.  This looks like a normal log curve, and can be used in many diminishing returns problems.
  • log-linear: Only the Dependent variable is logged. A change in X creates a coefficient change in log(Y)  This creates an exponential curve, and is appropriate for exponential growth type relationships.
  • log-log: Both dependent and independent variables are logged.  If using a natural logarithm, this can be interpreted as % change in X creates a coefficient % change in Y.  (Think calculus, the important relationship being that the derivative of LN(X) is 1/X.)  This is used often in econometrics to represent elastic relationships.

Real Life Example

Remember my fitness tracker data from earlier in the week?  It struck me after that analysis that my activity didn't follow linear patterns.  For instance, it seemed like my increased activity on the weekend varied in a non-linear fashion, and that there was an elastic relationship between steps yesterday and steps today.

So I logged some variables and re-ran my regression.  Here's what I got:

Does it fit better? Well, R-squared as a metric has it's issues, but we can use it to generally measure fit.  My original model had an R-squared value of 68%, whereas this model is at 71%.  A small improvement, but there's one problem.  Log transformation of a variable changes its variance, so it's inaccurate to compare the R-squared of the logged dependent variable to the unlogged.  So what do we do?

I didn't document this.  But you can create predictions using the logged equation in R, unlog those predictions, and then calculate an unlogged R-squared, for the logged equation.  (If you have any questions on this or want code you can ask in comments.)  This unlogged R-squared for the logged equation sits at 70%.


In regards to Munroe's original joke, sometimes "Have you considered logarithms?" is the right question to ask.  In this case, I got a 2% boost to R-squared, which isn't huge, but is 6% of unexplained variance.  And I've seen entire news stories and lawsuits over lesser correlations than that.  

Wednesday, May 6, 2015

Modeling Fitness Tracking and Messing Up Models

I've never met a metric that I couldn't screw up in some way after measuring, modeling, and focusing on it. 

This happens to me quite a bit at work, where someone will say "hey, we really need to work on our 'X' ratio."  Then everyone works on the X ratio for a few weeks which makes X go up, predictably.  This creates a couple of problems, though:
  • X gets better for not-normal, and often not-model-able reasons (maybe not a concern for the business, but certainly irritating to me).
  • X gets better sometimes at the expense of the business.  I should probably blog on this later, but I've seen many times businesses sacrificing long-term revenue to improve a single KPI.

But this is a fitness blog entry, so let's start with that.

So, good news on the fitness data front:
  • I now have over a month of data to model.
  • My activity level has increased over 30% since my first week of tracking.
Except that taken together, these two things don't necessarily lead to predictive data models.  But more on that later.  Here is a summary chart of progress over the weeks.

Data Modeling:

I'm trying to create a model by which I can model future day's activity level using various data I'm collecting.  So far, here are the attributes I have for modeling:

  • steps: (how many steps I do today, dependent variable)
  • weekend: (is today a weekend?)
  • day_of_week: What day of the week is it?
  • week: What week is it?
  • steps_prior: How many steps did I do yesterday.
  • hours_three_prior: How many hours of sleep did I get in the last three nights.
  • sleep_hours: Hours of sleep last night.
  • travel: a binary for whether I was traveling that day or not.
  • sick: was I sick today?
Obviously with only 35 observations, I can't use all of these attributes, but I can start a small model. 

First I plotted some data just to get some ideas about trends.  A few interesting correlations found below.

I found that a simple correlation between steps yesterday and steps today was positive.  Does this imply hat by simply moving more I can get an infinite feedback loop where I continuously up my activity level?  Probably not, but interesting.

Next I looked at my weekend variable (plot below).  I knew that most weekend days I move more than weekdays, but now I can quantify.  Variation is also a lot larger on weekends.  I looked through some past data, and realized that there are some relatively low activity weekend days, generally when I'm traveling.  This is why I added the "travel" variable to my models.

So can I model this? Absolutely.  Here's a summary of the attributes in the model I created.

  • Weekend: I get (on average) 9,000 more steps on weekend days than weekdays.
  • Steps Prior: Ceteris paribus, for every 1 step I take today, I will take .25 fewer steps tomorrow.  This doesn't meet my correlation above, however it makes more a priori sense (if I move a lot today, I'll be sore or tired and move less tomorrow).  Why did the correlation change? Likely because the initial simple correlation was just measuring weekend days, or intra-week correlations (my general increase of activity over time).
  • Travel: On travel days I get 6,000 fewer steps.
  • Factor(week): Week level fixed effects generally increasing over time.  I probably need a better methodology here.  But I had to add this factor to make any of the models really work.  Why?  Because  as my first graph on this blog shows, I've greatly changed my behavior week-to-week in a way that make models far less predictive if I don't control for it.  
In common terms, weekends have always been better than weekdays, but my behavior over time has changed so much, that current weekdays are almost as active as weekends when I first started tracking.  This is similar to the effect we see for the "steps prior"correlation mentioned above.

I think I have a reasonable start on a model, and some interesting insight into my activity level.  Tracking data this way is fun (to me) and also interesting.  Would love to get my hands on someone else's data to see if general relationships hold up.

On the other hand, much like a business KPI that gets distorted as a result of measurement and focus, my fitness data has changed so significantly since the beginning of this experiment, that it has become more difficult to model what my "natural state" activity level might be.