Monday, August 31, 2015

Kansas Election Fraud Part 5

I thought about this data far too much over the weekend.  On Friday I received data by precinct for the State of Kansas 2014 Governor's race, which had never been available before.  Here's my post on that. A reminder of where we were on this data on Friday:

  • We acquired precinct level data for all 105 counties.
    • 101 counties were in analyzable (it's a word) form.
    • 2 additional counties were parsed manually.
    • 2 other counties (Sedgwick, Shawnee), I was still working on, due to data availability issues.  
  • We validated the Clarkson's results, specifically for Johnson county.
  • We found that the remaining 101 counties have a low density of the types of precincts Clarkson analyzed (>500 voters)
Let's start with a deeper dive into Johnson County

REVISITING JOHNSON COUNTY

In Friday's post, I validated Clarkson's correlation using the new data, specifically from the Governor's race 2014, for Johnson county.  From her prior analyses, Clarkson's correlation is as follows:

Counter-intuitively, after 500 voters, precinct size correlates positively with Republican share of the vote.

The idea behind this, is that most precincts under 500 voters are rural, republican leaning districts, whereas after 500 voters, she would expect the precinct size to republican % to level out, or become more Democrat leaning, because larger precincts would be at the urban core.  I have explained the problems with this logic in several different blog posts.  Effectively, precinct creation is not a randomized process, thus many covariates, demographic and otherwise come into play.  I even  demonstrated how Clarkson's analysis dries up when we expose the data to those covariates.

But is there a move visual way to demonstrate how this works?

Turns out, yes.

Focusing on Johnson County, I highlighted all of the precincts with 500 or more voters in the 2014 Governor's election, and then went on to classify each one by larger buckets.  After 500 voters per precinct, the smallest precincts are the ones closest to the urban core, while the largest are in the outer-rural suburbs.  This map demonstrates that relationship:



Johnson County is weird though, so we don't necessarily know that the gentrified areas closest to the urban core are going to be the most liberal.  So I mapped this as well, validating that the areas closest to the urban core tend to be the most liberal, with the most conservative areas outside of the 435 loop. 

What does this mean?  It validates two things:
  • Precinct creation is not random, and the larger precincts within Johnson County do not lie closest to the "democrat" urban core, or randomly throughout the region, but instead in rural and near-urban suburbs-in direct opposition to Clarkson's hypothesis.
  • Those suburban areas (outside of the 435-loop) tend to also be more conservative.
In essence, the primary hypothesis of Clarkson's analysis is flawed, because these a priori relationships exist.

SEDGWICK COUNTY

I acquired one more county of data today, this time Sedgwick County with plenty of 500+ voter precincts.  If you're from outside of Kansas, Sedgwick County is where Wichita is located.  This county also validated Clarkson's correlation.  Here are our outputs, first graphed, then R.




I haven't found a lot of additional data on Sedgwick County yet, but will post similar to Johnson County as I find additional data.

CONCLUSION

I will keep posting on this as more data comes in, but two big takeaways from today.  
  • Sedgwick County Fits the pattern developed by Clarkson.
  • Taking a deep dive into Johnson county, we validated from my prior analysis, that, beyond 500 voters, larger precincts are actually LESS likely to be in the urban core, and more likely to be in the conservative outer suburbs.





Friday, August 28, 2015

Kansas Election Fraud, Part 4

I was minding my business yesterday afternoon, when I was approached by someone on the internet with data I might be interested in.  I was half-expecting Ashley Madison or other scandalous data, but what I was presented with, was much, much juicier: election data.  Kansas Governor's Race 2014.  This is interesting data, given the current debate over election fraud in Kansas.  Also, for some reason, this data has not yet been posted to the Secretary of State's website (I would soon find out it was submitted by counties in poor formats, so maybe this is explainable).

If you're curious about the history of my analysis of Kansas election fraud, I've written on it a few times below.  The basics of it are this:

A WSU National Institute of Aviation Research statistician says she found an anomaly in voting records related to size of precincts and correlated Republican vote shares.   This pattern has been found in several elections, going back several years, and states, and she thinks this may be evidence voting machines may be unreliable (potential systematic election fraud).   I replicated her analysis, fairly easily, and her statistics are "correct." That said, I disagree with her, on the basis that precinct size is not a result of a stochastic-random process, correlation with demographic factors, and endogeneity to turn-out problems in high-republican districts.  
And my prior posts for ongoing reading.

Part 1
Part 2
Part 3

THE DATA

The data came from an internet source, but was a forward from the Secretary of State's office.  I believe it is accurate, there's a couple of times I missed a control total here and there, by small amounts, and uncorrelated to any one candidate.  I was able to cross-reference it to data on a handful of county websites and national aggregate data.

The data itself was in pretty poor form.   101 counties were aggregated in a good spreadsheet.  Four counties were on pdfs:  Shawnee, Sedgwick, Wyandotte, Johnson.  These are the most populated counties in Kansas, so this is an issue.  Two the counties were easily recoverable, two were not:
  • Shawnee:  not an OCR'd PDF, so I ignored it for now.  Will seek a solution.
  • Sedgwick: wrong data? not at a precinct level.  Will followup (especially because Clarkson's request is in this County).
  • Wyandotte: OCR'd PDF.  I scraped the data and added it to the 101 counties.
  • Johnson: OCR'd PDF. I scraped the data and added it to the 101 counties.

INITIAL RESULTS

The methodology here has been well documented in my other posts, so I won't go into details.  Here are the basics of what I found.  

  • My first attempt to replicate Clarkson's prior methodology on the Governor's race did not show the correlation.  In this case I was looking at the other 101 counties.  There weren't many big precincts in those counties though, so the failure to replicate was somewhat unsurprising. Here's the output from that:


  • My second attempt at replication, looked at only Johnson county, which has more precincts of 500 or more voters than the original 101 counties combined.  In this case, I confirmed Beth Clarksons correlation, effectively confirming that the "anomaly" was also present in the 2014 Governor's race:

  • So at this point I've confirmed Clarkson's results in fairly nerdy R outputs.  What does that actually look like. Here's a prettier chart of that correlation:

  • And more confirming evidence from Wyandotte county.


  • And one final piece of data I found interesting.  The Wyandotte county data had some general turnout by precinct information.  When I dug in, I found that the more Republican precincts had much better turnout.  This isn't hugely surprising, based on conservative voter bases, but it does lend some creedence to some of the theories surrounding the Brownback win, despite polls favoring Davis.




CONCLUSION

I'm missing some counties, and this was just an initial exploration so I'll post more later, but the takeaways for now:
  • Though I still disagree with Beth Clarkson's conclusion, the statistics also seem to hold up in the 2014 Kansas Governor's race.
  • Turnout seems higher in  to be an issue in highly republican precincts.
  • I will explore both of these issues, hopefully with fuller data, in a future post.

Tuesday, August 25, 2015

Comparing Sentiment: #ksleg v #ksed v #Royals

Yesterday's post was fairly popular (quite a few page views) so I thought I would extend it a bit.  What about comparing sentiments over various topics rather than over time?  It's easy to point my code at three separate topics on twitter, use my emotion/polarity identifying algorithm on each set, and compare the results.  I also pull out a couple of current events references below.

METHODOLOGY


For practical methodology, I pointed the algorithm at three separate hashtags that are used quite a bit in my region:

  • #kleg: Hashtag used for Kansas government matters.
  • #ksed: Hashtag used for Kansas education matters.
  • #Royals: Hashtag used for the Kansas city Royals baseball team.
After compiling the tweets into a dataframe, I ran some simple statistics, and validated with statistical tests.  

MAJOR FINDINGS

First, when we compare the overall polarity of the tweets, people are most negative #ksleg, followed by #ksed and #Royals.  This makes some sense, people are most negative about politics, and more positive about baseball (my conclusion yesterday gave a general hypothesis to why).  I would expect people to be more positive about public education however the hashtag in question is one on more politically charged policy-related education issues.  More on that below.



In the emotion analysis, we see a little shift in negative emotion towards #ksed.  A lot of anger and sadness on that hashtag.  The Royals bring out about the same amount of anger as Kansas government, and slightly more sadness.  But education wins negative emotion (I dig into this below).

The #Royals are the most "surprised" overall, but you can see how people would have a "surprised" emotion, especially when live-tweeting a game.  


The anger at education was weird to me, so I wanted to dive in a little bit.  Here's what I found:
  • Prior to early yesterday afternoon, #ksed tweets were very positive.
  • Early afternoon yesterday the tweets turned more angry (50% increase in negativity in tweets using hashtag #ksed).
  • I checked reality, and this happened, which appears to have spurred the negative twitter storm.
So the negative tweets are at least explainable, and coincide directly with a real-life event.  


One final note on something I noticed in data: a cluster of extremely negative tweets, all referencing the word "fraud."  I realized that these tweets were referencing an issue I've talked about a few times before, Kansas Election Fraud.

Anyways, I dove into tweets using the terms "election fraud" related to Kansas.  Findings were interesting.  First, this is the most negative collection of tweets I've ever analyzed.  Second, only two real emotions were popping up, Anger and Fear.  Here's what the polarity looks like (can be compared to polarity above for reference).

Monday, August 24, 2015

Sentiment Mining #Royals Over Time

I have been analyzing political/policy data over the past few weeks using my sentiment mining algorithm, and thought it would be interesting to look at some sports Twitter data.  Fortunately, the Royals are good again this year, and approaching the end of the season, so I have a good local target.  A few pre-findings of interest:

  • Royals tweets are much more positive than political tweets in general.
  • Royals fans are most negative DURING games, and more positive when games are not being played.
  • Royals fans are also less "sad" during games, though more likely to be angry or disgusted when tweeting about the team.

METHODOLOGY


I used the same methodology as before, downloading tweets for approximately the past week that used the #Royals hashtag. Then a bunch of preprocessing, including word removal, stemming, etc, as discussed in prior posts.  Finally I classified each tweet by emotion and polarity using a naive-Bayes algorithm, and analyzed focusing on over-time trends.

INITIAL DATA

First our initial descriptive data, general polarity of sports tweets? The data is generally pretty positive.  Though this may have more to do with the Royals having on of the best records in Major League Baseball right now.  It is certainly much more positive than political data (a future post on that).

What about the emotions of Royals tweets?  See chart below.  Joy is number one by far, but keep in mind it is the only *real* positive emotion, whereas there are four negative emotions to spread the negativity around on. 



A lot of sadness in our sample too, with a little anger, and a bit of surprise.  (I know given the last twenty years of Royals seasons, I am full of surprise every time they win.)

TIME SERIES ANALYSIS

The #Royals hashtag has quite a bit of volume on it throughout the day, so we can analyze the data-trend over time.  Here is the hourly data over the past four days, with the volume of tweets in blue, and the % positive tweets in orange.



A few things stick out of this graph:

  • Regular daily spikes of activity, not always at the same time.  Oh, yeah, because the Royals play games most days in summer, and that's when most people tweet about them.  I matched these up manually to the Royals schedule.
  • The % positive tweets is volatile, but appears to have a "low point" each day when the Royals play. I tested this, and found that fans are significantly more negative when the team is actually playing. Here are the actual results of that:


What about emotion?  Looking at emotion hourly doesn't make a ton of sense, because of multiple categories (dimensionality).  We can, however, summarize the data by emotions during a game/not during a game.  During games, Royals fans are generally less sad, but much more likely to be angry or disgusted.  There is also a slight uptick in "joy."



 CLOSING THOUGHTS

A few moderately interesting findings in the data, and I think some of it makes sense.  

The spike in negativity while games are being played: This is likely because the Royals are having a great season, and so when responding to the Royals outside of game times, fans are likely to be talking about their record, or playoff chances, etc.  During the games, fans are more likely responding to an acute event, such as a bad at-bat, a missed throw, etc.

The spike in disgust and anger during games: If you look at the shift during games from sadness to disgust and anger, you're likely looking at a shift from a generalized feeling about the team ("sad" a player being on the DL, last nights game, etc) to another acute emotion (disgusted at a sloppy play, angry at Ned Yost for being no better than a random number generator as a manager).

Sports tweets more positive than political tweets: This one is fairly obvious.  Sports are what we do in our spare time in order to enjoy ourselves.  By definition it exists for fun.  Politics is what we do to resolve disagreements about how the world should work.  It is by definition negativity and conflict.  

Saturday, August 22, 2015

#ksleg Friday Twitter Power Rankings

And it's that time of the week again for the formulaic #ksleg power rankings.  I think I can almost automate this completely at this point, but here are our three steps:

TOPIC MINING

What topics were discussed this week?  Here's a word cloud to get us started:

This was a relatively low-volume week on the #ksleg hashtag, and our topic mining only found a few distinct topics:

So, it was a slow week, but a few meaty topics.  I look forward to seeing how items 1 and 2 play out in the future.

POWER RANKINGS

And our somewhat sarcastic power rankings.  Of note: slow week for BryanLowry3.   My speculation is that he's actually tweeting less to avoid a higher ranking, allowing him to be a "sleeper pick" when the regular season comes around in January.



SENTIMENT MINING

There are so many fun things I can do with my sentiment mining algorithm.  First I can measure the general negativity/positivity on a specific set of tweets.  While this graph may make #ksleg look more positive than negative, that's generally how tweets look.  In fact when I compare these to sports related topics, they are about twice as negative as sports.  But I will cover that in a future post.

One additional fun thing we can do with sentiment mining, is to statistically determine the Twitter screenname most associated with a specific emotion.  Methodology on this is to determine via chi-square test the  user with the tweets most disproportionately weighted towards each emotion.  Here are the results:



These results generally make sense if you follow this hashtag regularly.  LOLKSGOP is angry with the State of Kansas, SpeakerTimJones is disgusted with what America is becoming and andymarso is disproportionately positive, compared to the rest of the traffic we see.



Thursday, August 20, 2015

Sentiment Mining: Do Tweets Correlate with Team Performance?

It's been a while since I blogged about football, and it's about time again.  I know the cool sport for statisticians to like is baseball, but keep in mind I grew up in Kansas during the 80's and 90's, and that the Royals sucked from roughly 1986-2013.

BACKGROUND AND GOALS

A couple of background posts.  First I watch quite a bit of football and have an interest in predicting the outcomes of games.  Second my interest in sentiment mining or classifying text (tweets) by their general emotion expressed.  

The football models I made earlier in the summer generally outperform other "pre-season" models simply based on win-loss records, because they try to measure underlying performance, beyond prior year aggregate win/loss. One thing they don't account for are "soft" factors, that are difficult to measure in data.  For instance injuries to important players, players under-performing but still winning, off-field issues, Tom Brady getting divorced, etc.

So my goal in sentiment mining is to enhance my predictions by also including general external sentiment about how a team is doing week to week.  That is-Twitter can serve as a generalized proxy for those soft factors.  Here are my research questions:
  1. Can we use twitter data to provide additional color to predictive algorithms?
  2. Does the sentiment of tweets about a particular team correlate to performance?
  3. Does the direction of above predict forward, backward, or both?
Because the regular season hasn't started yet, I can't prove forward predictivity, but I can determine if tweets correlate backwards to last year's performance.  


METHOD AND DATA

For methodology I downloaded tweets associated with the hashtag used for each NFL team.  Then I used my sentiment mining algorithm (naive Bayes) to determine the emotion and polarity for each tweet.  The St. Louis Rams were enough of an outlier (3x more negative than other teams) that I removed them (they're going through some stuff right now).

And the results.  There is a statistically significant negative correlation between wins last year and % of negative tweets this year.  That means the fewer games a team won last year, the more negative tweets this year.  Here's what that looks like:



The biggest outlier found was the Bills who were one of our more negative teams, even though they had nine wins last year.  They didn't make the playoffs though.  Which led me to a second and more statistically significant conclusion:

Playoff teams generally had 5% fewer negative tweets than non-playoff teams.  Generally, about 1 in 5 tweets about playoff teams were negative, while 1 in 4 tweets about non-playoff teams were negative.
One last piece of not-statistically-significant but otherwise interesting evidence.  Our two lowest negative-tweet receiving teams?  The Seahawks and Patriots, the two teams that played in last year's Superbowl.

CONCLUSION

Just a few points here:
  • The #Rams have real problems, as demonstrated by their true outlier status.
  • I have at least some evidence that twitter sentiment follows team performance; and that we can model performance using our Naive Bayes classifier.
  • More to come, I will be including this information in my Week 1 models, to determine if it  can add value.




Monday, August 17, 2015

Paul Davis Won 85% of Kansas by Population

(Note to readers: I will get off this Kansas Political Data kick soon, and blog about football for the next six months.  So if you're tired of this topic, no worries it will end soon)

I've posted on the Kansas 2014 governor's election before.  So for a primer on my thoughts, please look at this post.  


BACKGROUND

Largely my thoughts, skip below for analysis of data.

I'm from central Kansas, but currently live in far eastern Kansas, making me acutely aware of the differences between the two regions.  Over the past year, I have heard many in eastern Kansas say things like "I don't understand how Brownback won the election" or "How is this guy still so popular?"  I catch myself thinking this from time to time as well, especially when I lived in Lawrence (2007-2014) the most liberal area of Kansas, where Paul Davis won 72% of the vote in 2014.

When people make these comments to me, I am tempted to ask, "Have you ever been west of Topeka (or Salina)?"  Brownback won the counties west of Salina by carrying 61% of the vote.  Though sparsely populated, I knew this had an effect on the end election results. 

If you travel to this area and meet the people there, the results are not surprising.  Largely made of rural farming communities, very white, very christian, these people vote on things like farm tax rates, conservative values, and abortion.  Areas where Sam Brownback does very well.  

But what about eastern Kansas.  A couple of questions: Did Paul Davis win "eastern Kansas"?  And is it possible to divide Kansas into two States, one with Paul Davis Governor, and the other with Sam Brownback?  

A PRIORI DATA

I'll spare you long methodology here, but I used QGIS, publicly available election results by county, census data, and went to cutting up Kansas.  A few observations to get me started :
  • Davis won a few densely populated eastern Kansas counties by large margins (Wyandotte, Douglas, Shawnee), giving him a cushion in the east.
  • Generally, races in sparsely populated eastern Kansas counties were closer, though Brownback won these races.  
  • Brownback won two densely populated counties, (Sedgwick,Johnson) but by relatively tight margins.
  • Brownback won all western Kansas counties, generally in landslides.  
  • Brownback won the overall election by about 3.5%, or 33K votes.
Given these, factors, you can see how it would be easy to carve out an eastern Kansas state. 

THE NEW STATES

Put simply, Paul Davis won, in aggregate, all of the eastern Kansas counties including Salina and Wichita (margin only 1K votes, but still a win).  Here's what a map of Kansas would look like if we broke into two states (intentionally avoiding the term Brownbackistan).  



And what do these new States look like demographically?
  • DavisLand: contains 85% of the population of Kansas, but only 43% of the land area. 
  • BrownbackLand: contains 15% of the population, but only 4.1% of the African American population of Kansas.
  • DavisLand: slightly younger, with a median age about 2.5 years less than Brownbackland.
  • BrownbackLand: Only 15% of total housing, more than 20% of total vacant housing (this is a weird statistic, largely due to urbanization).

So, I gerrymandered that last map to give Davis the most space I could.  What if I make more neutral areas?  For this map, I broke the "states" down into two areas, each with the same population.  Davis would win the eastern side by about 5%, and lose the Western side by about 12%.  





CONCLUSION

Obviously Kansas isn't going to break into two states (even though this was a movement, when I was a kid).  But understanding that Davis did in fact win most of eastern Kansas (and 85% of a "grouped" population) is quite telling.  It's understandable how eastern Kansas residents, who don't go west of Topeka could fail to understand the support for Brownback.  For those who don't understand how Brownback won the election, the simplest answer is "Western Kansas."  

Thursday, August 13, 2015

Sentiment Mining and #ksleg Power Rankings

Here we go again.  Once again mining one of my favorite Twitter hashtags for the last week (ish).  First, a wordcloud to know what people are talking about.


This is generally the same as we've seen before, people talking about the governor, education, but a few new words like "brother" show up too.


TOPIC MINING AND POWER RANKINGS



And summary of the topics with relevant news links:



And once again the (somewhat humorous) by-user power rankings.  Not too much surprising here, still working on developing a one-metric reach index.


SENTIMENT MINING

(non-nerds skip to the RESULTS below)

So I've been playing with sentiment mining a little bit, and thought of a few applications here. Sentiment mining is different than the previous text modeling I've used in prior analyses.  My prior analyses focused on Topic Modeling, which is using an algorithm to determine the topics that exist in a set of documents (tweets) and categorizing those tweets by topic.  

Sentiment mining focuses on using an algorithm (in this case a Naive Bayes classifier) to determine the "emotion" communicated by a document (tweet).  For this purpose I used the now deprecated "sentiment" R library.  It includes a pre-trained Naive Bayes classifier that gives me two outputs when I run tweets through it:

  • Emotion: Categorizes tweets by general that is being communicated.  A lot of tweets fall out of this as they don't match a clear "emotion"
  • Polarity:  Determines if a tweet is generally positive, negative or neutral.  

Aside from installing a deprecated R library, the process for sentiment mining is fairly straight forward.  I used the same tweets from above, and ran them through the classifier.  (ask me for code if you want it)


RESULTS

Mining text like this allows us to look at a few things.  First, the polarity (negative/positive/neutral) of all #ksleg tweets.  This graph shows that positive tweets have a slight advantage over negative  tweets.  

Side note: a positive tweet wouldn't necessarily be saying positive things about the State, but could be saying things of a positive nature.  Example:  We should pay teachers more! is actually negative towards the State of Kansas, but a "positive" statement in algorithm terms.


What about the emotions in the tweets?  See graph below.  Some may think that there's too much "joy" being categorized based on current attitudes in Kansas.  On further analysis "joy" is the only abjectly positive emotion, and it's possible to express joy over a negative news story.  Further reading.


APPLICATIONS

There are two predictive applications from sentiment mining: classifying people, and measuring emotions between people.  

First, when you look at the types of people that tweet on the hashtag #ksleg, there are two main types:
  • Pundits: People who tweet their opinions on policies they like or dislike.
  • Newsies: People who report the news.  Supposedly in a neutral way.
You would assume you can classify users by the polarity of their tweets into Newsies (more objective) and Pundits (more objective).

First I took our top 20 from the list above, classified them manually by my knowledge of whether they work for a newspaper, are a fake account, etc.  Then I calculated the percent of their tweets with negative polarity, and reclassified them using a 33% cutoff point (less negative tweets = newsie, more negative tweets = pundit).

Using this method, I was able to successfully parse out Newsies from Pundits 85% of the time.  And two of those failures were conservatives that are just more positive about the current administration.  The algorithm could be refined further, but generally works.



Second, what about emotional tweets, is there anything they can tell us?  For each tweet I know who was being "replied to," what kind of emotions are being directed towards users?

From the emotional categories above, I can set a statistical (bayesian) prior, and then determine which tweeters are obvious outliers in the emotions directed towards them. There were two obvious accounts that create statistically different reactions (p <0.01). 

First the most joyously received tweeter on the #ksleg hashtag (Bryan Lowry of the Wichita Eagle):



And the most angrily received (Michael Austin, works for KDOR):


(A quick editorial note.  I disagree with the above Twitter user quite often, but agree with him sometimes too.  He's in a rough spot in defending policies that aren't working out.)

CONCLUSION

You can see the Twitter power rankings and topic modelling above for an update of how things are going.  The biggest takeaway from a data perspective is that we can successfully sentiment model tweets.  More important, we can use the polarity and emotion of sentiment mining to both categorize users, and to measure the emotion directed towards users.

Saturday, August 8, 2015

Blog Review and "Best of"

Having this blog for over half a year now, I thought it would be a good time for a quick review of the blog, it's original intention, what it has become, and what is most popular.  Suffice to say, this has gone in a much different direction than I originally intended.

BLOG PROGRESSION

Here's the basic progression of the blog:

  • My first post set out my original intention for this blog in December 2014.  Essentially, a super nerdy place for me to talk about the challenges of managing a data science team, writing code, and dealing with an R production server. 
  • The blog proceed this way, gaining a small following through January thru March, with readership generally tripling each month.  With a higher readership in March, I bought this domain, and moved off a blogger subdomain.
  • About this time, I noticed that the analysis posts (where I analyze actual data) were doing much better than the conceptual posts, where I talk about data science in general.  So I moved this way, posting about analyses about things that annoy me in the media.   This also led me to post more about going-ons data-wise in my home state of Kansas.
  • Move on to the present day:  I've posted about 90 times on the blog, have a fairly large readership (still growing each month), and have setup a Twitter account specifically for the blog.

"BEST OF"

Here are the most read blog posts from this blog:

  1. My Data Science Toolkit.  This super-nerdy post was shared a lot when first released, and continues to be very popular among the data science community.  
  2. Kansas Election Fraud.  This post is my first take on Beth Clarkson's analysis of Kansas voting records.  This has seen some renewed interest, as it is making news again this week.
  3. On Weird Metrics.  This is the first post to get a large viewing on the site, largely from being picked up by John Durant, a major figure in the "paleo" movement.  This post also proved that showing a sense of humor is a great way to increase views. 
  4. Sales Tax.  This wasn't really a data science, or even a statistical analysis, but simple math.  Effectively, I was just looking at how Sales Tax impacts people differently than increases to income tax. 
  5. Teacher Salaries.  Annoyed at the bad data passed around by multiple Kansas entities, I found some legitimate data, then controlled it for cost of living so we could make legitimate comparisons to other States.

CONCLUSION

No real conclusion here, just two points.  First, this blog has gone in a different way than originally intended, but I am happy with the way it is ending up.  Second, if you have any data you would like me to look at on the blog, let me know.  I'm usually up for anything data-wise.   


Friday, August 7, 2015

GOP Primary Debate Number One: Summary And Text Mining

I had a great time watching last night's GOP debate with my wife.  I missed the crazy on twitter last night due to being 'in-between phones' right now.  But I still watched and enjoyed.  This morning, I couldn't resist an analysis.

SUBJECTIVE ANALYSIS


My take on the debate, in two sentences per candidate.  (Mostly intended to be humorous).
  1. Marco Rubio.  The US has not elected a man under 6'0" tall to the presidency since 1976.  Rubio is 5'10".
  2. Scott Walker.  Has the same kind of "derp" as Ross from Friends.  May be the best candidate in the field.
  3. Jeb Bush.  Like a smarter, more moderate version of his brother, with less "conservative mojo." Which of course makes him a MUCH worse GOP primary candidate.
  4. Chris Christie.  Makes very pragmatic moderate points for the most part, and seems electable but-for a bridge.  Has "Catholic Dad" syndrome, in that he reminds me of the dads at the Catholic school I attended.
  5. Ben Carson.  He needs to be better prepared. HE NEEDS TO BE BETTER PREPARED.
  6. John Kasich.  He should be glad he was in his home state last night.  Per this morning's mining, Carly Fiorina should have clearly had that spot.
  7. Ted Cruz.  Did his nose start looking that way because he talks to people like that?  Seems like a grown up debate kid with an attitude.
  8. Donald Trump.  Better at sound-bites than I thought he would be.  Still a troll candidate.
  9. Mike Huckabee.  I fell out of my chair when I heard his abortion argument.  Actually I was standing, so I fell from a standing position.
  10. Rand Paul.  Good attack dog, not presidential.  Belongs in the Senate protecting civil liberties.


TWITTER DATA

I downloaded tweets from this morning (approximately 7am - 11am) using the #GOPDebate hashtag and went to analyzing them. First I wanted to know about dominant subjects.  Trump is dominating still, the only other candidate showing up is Carly Fiorina.  


Then I wanted to know which words were most associated with which candidates.  I've used this statistical method before, you can read about it on prior posts.  Here's Huckabee's associated terms, my favorite terms from this are "big government", "downright", "personhood",  "perv", and "jingoist".



Trump is also interesting, for use of words "butt","plug", "fucktrump", "bimbo" and "donzilla"

I created a few more of these down the page, less amusing bust still interesting.

TOPIC MODEL

The tweets don't break well into set topics, partially because there is a large breadth of topics from last night, and partially because individuals are going on their own little rant, largely divergent at that.  But I did put together a topic model.  There are some obvious topic breakdowns here.


Of note are: 
A lot of interesting topic areas, but those are the big three, in my opinion. 

CONCLUSION

A few bullet points from last night:
  • General twitter consensus: Marco Rubio won.
  • Trump is still dominating the conversation of the GOP primary.
  • Carly Fiorina should have been included in the main debate.
  • The interaction between Megyn Kelly and Donald Trump was a wide topic of conversation.

And, some more candidate term associations:

Chris Christie, who is associated with "blowing" something hot, and his interaction with Rand Paul (one of the best exchanges of the night).


And Scott Walker, with a bunch of Wisconsin and union references.


Wednesday, August 5, 2015

Structural Barriers to Efficiency: Target Denstiy

If you can't tell by reading this blog, my inspiration often comes from Twitter.  This post is a result of what happens when I see something on Twitter, have free access to data, and decide to look into things.  The tweet that spurred this:


So, if you don't know these regional politics, Dave Trabert is the head of the Kansas Policy Institute, which is the local think-tank, sponsored by the Koch Brothers.  Yeah, those Kochs.. they're from Kansas.

Anyways, if you go back to that tweet, and read the responses around it, you see a lot reference to high administrative costs for Kansas schools.  This spurred two questions, 1. is that accurate? and 2. if so, is it explainable?

BACKGROUND

I've looked at school funding a couple of times before, a lot of times actually, you can check some of my other posts on this blog, so I have a bit of knowledge on this subject.  I grabbed some national data, and found that Kansas does have higher than average administrative costs than other states by about 1.5 percentage points.  

Whenever I see numbers that are different, I think of reasonable explanations, specifically explanations that would structural, and outside the control of school districts.  Is it possible that it is not the fault of being Kansas educators being inefficient?  

What would create high administrative costs (specifically by %).  This is a business question:  What makes business operations have higher administration costs?  Other than wasteful spending, the main threat to administrative efficiency is lack of economies of scale.  So in this case, what gets in the way of creating large "economy of scale" school districts?  Sparse population, something Kansas has plenty of.  So I created a quick bi-variate model, controlling for population density.

MODEL

My model was simple and regressed percent Administrative costs against population density. 



The red dot is Kansas.  What I found was that Kansas is a sparsely populated state, and controlling for density, admin costs are only 0.3% over expectations.  Essentially, when you control for population density, Kansas education is more efficient than it initially looked.  

I had some other data on my desktop by individual Kansas school district, and the small, low density districts have much higher administration costs.  The model holds up, meaning that there are a lot of small districts, impacted by the structural differences.  See chart below:






POTENTIAL SAVINGS?


So far, I've demonstrated that Kansas has higher administrative costs, but most of that difference is due to exogenous structure (poplation density).   Specifically, sparse population density not allowing for larger school districts which have low relative administrative costs.  But just because Kansas is in line with national averages after the control, doesn't mean that there isn't potential for savings. 

In fact, I find Trabert's notion from his tweet of regional administrative centers to be interesting.  My second model shows the specific low density districts with those high costs, where maybe administrative centers would make sense. What he is trying to do is hack his way around the structural disadvantage of Kansas, and make those identified low density districts behave like high density ones. 

To put Trabert's idea in different terms, if this were the business world and I wanted to provide services to a sparsely populated area I would either:

  • Not create a physical presence.  Try to reach out and recruit customers in the online space, have them manage their account and purchases online.  (This sounds like something that could go terribly wrong if implemented in education).
  • Create small stores with minimal field management.  If you're from rural Kansas think about a Shopko or Alco model.  Basic bare bones small stores in small towns with regional admin centers/managers.  (This is effectively Trabert's idea; everyone keeps their schools; regional management centers)
Education experts and a pilot-test is the only thing that could tell us if option two would work without decreasing service levels.  One option would be to replicate the interlocal/coop model currently used in special education for other top-level administrative tasks.  I know that agreeing with Trabert won't be popular, but if we can direct money elsewhere by saving costs, why not try a pilot?





Additional plots from analysis above: