Friday, October 30, 2015

Name That Highlighted Map Contest

First a Friday review post, our top three posts by pageview from this week:
  1. Public Library Peer Groups
  2. KC Area Job Growth Comparison
  3. Democratic Debate Summary
And now for something new on this website, a randomly highlighted map.  Ok, not *random* but actually highlighted according to a specific geographic attribute by county.  Green means the county has a higher amount of said attribute, red is less.  There will be a reward for whoever can name that highlighted attribute.  Comment your answer below!




Thursday, October 29, 2015

Kansas Historically Better Than Missouri, Jobs-wise

Am I mainly writing this to tout Kansas as historically superior to Missouri?  Yes.  Probably.  I may also be writing this because I work in the Kansas City area and don't want to have to drive to Missouri for work.

This is somewhat seriously related to my recent posts on the Kansas jobs report, maps, and arguments between pundits, but more nuanced as it is specifically related to differences in the Missouri versus Kansas side of the Kansas City Job Market.  Here are the takeaways:

  • Historically: Kansas side has beat Missouri side 19 of the past 24 years, the last time Kansas lost in a non-recession year was 1999.
  • 2010-2014: Kansas beat the Missouri side each of these years, following recession impacted years of 2008-2009.
  • 2015: Kansas is currently pacing behind Missouri, which would be its first non-recession loss since 1999.

BACKGROUND

The background of this post is rooted in an article in the KC star regarding jobs reports, followed by a twitter argument between conservatives and liberals.  The initial argument was exacerbated by a date issue in the original article, now corrected, but the argument is generally this:
  • Liberal: Kansas is losing the jobs growth war in 2015.
  • Conservative: Kansas is losing in 2015 only as a correction to high growth in 2013 and 2014.
Here are some tweets to give you a flavor of this:




Yikes!  Everyone gets so serious about this stuff.  One last good piece of background, while the Missouri side of Kansas City has the majority of population, the growth center in KC over the past couple of decades has been the affluent Johnson County Kansas suburbs.

DATA 

The problem with the above argument between pundits is that it doesn't look over a long enough time horizon to understand long-term trends and patterns. Here's the question at hand:
Should we really give current Kansas policy makers a win if they beat Missouri by 1%, if they have been beating Missouri by 1% for the past two decades?  
I dug in and analyzed BLS data from 1990-2014.  Here's a chart of annualized growth rate in average jobs per month:


Some takeaways:

  • Excluding recession years, Kansas generally beats Missouri by an average of 0.75%.  
  • Kansas has beat Missouri in Job Growth 19 of the past 24 years, "loss" years were: 1993, 1994, 1999, 2008, 2009.
  • Controlling for the 0.75% historic growth rate differential, the actual win/loss record looks like this:
    • 2013: KS (+1.23%)
    • 2014: KS (+0.59%)
    • 2015: MO (+1.31%) (pacing)
  • If the 2015 pattern holds, it would be significant for Kansas as it would be the first non-recession loss to Missouri since 1999.

CONCLUSION

The argument over the Kansas versus Missouri jobs is informed by historical data showing Kansas winning over the past 24 years.  If policy makers want to take credit for beating Missouri in job growth, they should probably show growth levels significantly above historical averages.  That said, the scorecard for the past two years shows 2013 and 2014 as wins for Kansas, 2015 as a win for Missouri.  I will continue tracking these metrics; the end of year report should be interesting.


Friday, October 23, 2015

Peer Group Determination: Library Peer Groups

Libraries are an institution generally associated with books and reading, and not necessarily math and data.  But as an avid reader who is married to a librarian, this data guy has an interest in the data behind libraries.  When my wife made me aware of some library datasets that might be interesting, I dug in and looked at the numbers.

I realized this dataset gave me the opportunity to potentially help libraries and tackle a subject I have wanted to on this blog for a while: Peer Group Identification.  Specifically, the research question here is: can we use a data-driven methodology to identify peer groups for individual libraries?  If we can, librarians can use these peer groups for purposes of benchmarking, best practices setting, and networking with those in similar situations.

THE PEER GROUP

Mixing it up a bit by putting results before detailed methodology.  It's down below if you're interested, both in a math intensive and non-math intensive formats. For data, I used the publicly available 2013 IMLS dataset.

I used a "nearest neighbor" methodology to find peer libraries for my home library system, Johnson County Library (JCL).  The nearest neighbor method is widely used across many fields, here's an example from medical research.  The factors I matched on were population served, branches, funding per capita, and visits per population.  

The result established a peer group in the chart below, with libraries that have between 11 and 13 branches, similar funding levels, similar populations and visits.  There is one extremely close neighbor, which is the Saint Charles City-County Library District. This library is similar to JCL in data, but also in serving affluent suburban areas near mid-western towns.  

I ran this list by my wife and she liked it.  So success?  In the short-term at least, but may be potential room for refinement (see conclusion).






SIMPLE METHODOLOGY

The "nearest neighbor" methodology to determining peer groups is fairly easy to understand at a basic level.  If we wanted to determine a peer group for Johnson County Library without using advanced analytics, we might start by simply looking at all Libraries that serve between populations between 400K and 500K.  

That might give us a good start, but upon diving in we would learn that many of those libraries face different challenges and experiences.  Some would be less affluent with lower funding levels, while others may see far different use patterns.  So we would add in a second variable, lets say funding per population, which would look like this:  


In this case we would choose the libraries closest to JCL, roughly the circle in the above graph.  But once again, there's a lot more to the attributes of a library than funding and population served.  What about use patterns, and number of branches?  

This is where I lose most people in the math.  Using this methodology, we can use as many dimensions as we want, and simply calculate the nearest neighbors simultaneously on all variables.  The best way you can imagine it is an extension of the above graph, just extended into 3-dimensional, 4-dimensional and n-dimensional space.

NERD METHODOLOGY

This is a methodology I have been using since early in my career, to choose peer groups and in the form of the k-NN algorithm.  Computationally, this is similar to the k-NN algorithm, especially in the first phases.  Some generally nerdy notes:

  • Computation Method: This is easy, really, it's just a minimization of euclidean distance in multi-dimensional space.  Effectively, a minimization of d in this equation:

  • Computation Strategy: The k-NN machine learning algorithm is both computationally elegant and costly.  It is elegant because it is simple to write, basically we just compute euclidean distance in n-dimensional space, and find the "closest" "k" neighbors to the point we're interested in.  Simple. I didn't even use an R package for this, I just rewrote the algorithm, in about 10 minutes of R coding, so I could have more control. It is costly because, in its predictive form, it requires distances calculated between every point in a dataset (which you can imagine in million+ row data tables can be slow).  Luckily in this case, I'm only interested in the distance between Johnson County Library and others, so it's computationally cheaper.
  • Variable Normalization: If you input raw data into the nearest neighbors algorithm, attributes will establish their importance in the equation by variance (because we're simply measuring the space raw).  I take three steps:
    • I conduct some attribute transformations.  Most importantly, I take logarithms of any variables showing a power-law distribution to reduce variance.
    • I Z-score each attribute, such that we are now dealing with equivalent variance units.
    • I (sometimes) multiply the Z-score by a weighting factor, when I want one factor to matter more than others.  In this case, I don't have a good a priori reason to weight factors, but could reconsider if Librarians think other factors matter.

CONCLUSION

In this post I have covered the methodology for determining peer groups, and created peer groups for the local Johnson County Library.  I hope that it is both demonstrative of methodology that could be implemented in many fields, and also adds value to the field of Librarianship.

If any librarians are interested in this analysis, or details of it feel free to reach out to me at datasciencenotes1@gmail.com.  I would be happy to provide you a custom peer group for your library.  Also would be interested in any thoughts on improving the peer groupings either by:
  • Using additional factors or variables.
  • Weight certain factors as more important than others (is funding or # of branches more important than # served?)
I'll leave you with a final view of the JCL data.  Below is a graph of libraries by visits and population, with the Johnson county peer group highlighted blue.  Note that there are some orange dots inter-mixed in the peer groups-these are libraries that were good matches on these two factors (visits/population) but not on our other two factors.
Johnson County Peer Group Versus All Libraries

Wednesday, October 21, 2015

Mapping Kansas Jobs

When researching my post from earlier this week, I downloaded some data that wasn't immediately helpful: By county total employment statistics from 2001 to 2014.  The data wasn't initially useful, but yesterday I created some maps that are fairly interesting when looking at job patterns in Kansas 2001-2014.  A few takeaways I found:
  • Johnson County (+) The biggest positive growth over the past years at 35,000, which is actually greater than the total for all counties.  That means, if we remove Johnson County, all other counties see net-negative job growth 2001-2014.
  • Sedgwick and Shawnee Counties (-) These two counties have seen the worst aggregate job reductions, losing from 3,500 to 5,000 each.
  • Smaller Counties (+/-) Some good, some bad.  Positives for Southwest Kansas, and near suburban counties (e.g. Butler), and those along I-70 (e.g. Ellis), negatives for Southeast Kansas, and those along Nebraska border (e.g. Jewell).
  • Commuter Counties (?) Weird data thing here, but if I divide workers in a county by total population, I can get a sense of commuting patterns.  For instance, the ratio is .870 for Saline County, .396 for neighboring Ottawa County.  Huge difference, but growing up there, I know there's a net-inbound commuter into Saline County and out of Ottawa County. 

TOTAL WORKERS BY COUNTY

To start off, I mapped total workers by county, to get a sense of where people worked.  First by 50K buckets.  




That's not as meaningful as I thought it might be, only four of 105 counties have more than 50,000 workers.  Johnson and Sedgwick are dominant.  Let's break that down into a bit more meaningful buckets, here each bucket has an approximately equal number of counties in it.



That's a bit more meaningful, and the combination of the two maps give us an idea about where most jobs are in Kansas.  Johnson and Sedgwick Counties, then the rest of eastern Kansas, then a few hot spots in Western Kansas (Ellis, Finney, Ford).

EMPLOYMENT CHANGE


First I let QGIS break down the counties as it saw them, by 10,000 worker change buckets.  Below dark red indicates net loss of jobs, orangish is a gain of jobs between 0 and 10,000.  The bright green is Johnson County at +35,000.



This gives us an idea that Johnson County is an outlier in jobs growth, but by how much?  A summary chart below makes the point in numbers, but essentially: without Johnson County, Kansas counties are net-negative in Jobs since 2001.

The above map shows aggregate job change, but a better measure of impact to individual counties is percent change.  Here's a map of job growth and loss mapped by percent change. There's also a chart below that shows the biggest winners and losers in jobs over this time period.



And below is the same information in chart form.




COMMUTER POPULATIONS

Because the data was available in my GIS layer, I calculated the ratio of total workers to population aged 18-65.   Here's what that looks like:


The variance range was larger than I expected but two things seemed to pop out.

  1. Moderately low employment rates over all in *extremely rural*, older, agriculture based populations.  (e.g. Decatur, Jewell, Hodgeman, Edwards Counties).
  2. Near urban "commuter" communities have extremely low employment rates.  These communities are fairly obvious if you're familiar with Kansas demographics. Some county examples:
    • Ottawa commutes to Saline
    • Butler commutes to Sedgwick
    • Miami commutes to Johnson

CONCLUSION

I just pulled these maps together from data already on my desktop, but here are a few summary points:
  • If you remove Johnson County from Kansas, there has been essentially no job growth over the past 14 years.
  • Sedgwick, Shawnee, and a few rural counties have seen especially negative results over this time frame.
  • By looking at ratios of jobs to population, we can detect net commuter/commutee counties, though there's probably a more direct way to detect this.


Monday, October 19, 2015

Kansas Jobs Report; LETS ALL ARGUE

Last Friday, I was confronted with another Twitter argument regarding the Kansas economy, this one specifically related to created jobs.  It all stemmed from this tweet:

A little background is in order. As part of Governor Brownback's promises during the 2014 gubernatorial campaign, he said he would create 100,000 new jobs for Kansas.  Nine months into the 48 month term, the State of Kansas is not on pace to hit that target.  It caused a bit of a twitter storm Friday morning, here's a followup from a Brownback administration official:


Generally, I could care less about the political fight here.  I get the economic theory behind cutting taxes to attract employers, though I think the Brownback administration lacked a valid model for their short-term projections.  But that's for another post.

These type of arguments over Kansas jobs have been going on for a long time, is there any clarity we can pull from them?  I assumed a post on the major arguments and history of Kansas jobs could provide a deeper look.  Here's a few things to cover.

  • Year-over-Year (YoY)-Month to Month Numbers: Each month, when new jobs numbers come out, reporters and have this argument on twitter.  And whoever is on the side the lost out in this months numbers claim the metric is invalid.  Are they correct?  Is there a better metric?
  • Precedents in Job Growth: What are the precedents for job growth, what would we expect annually, and would that just naturally get us to 100,000?
  • How much control does the governor have?  Are the drivers of job growth difference largely external to the state, or can policy changes significantly impact job growth?

YoY NUMBERS

One of the arguments brought up with each month's jobs report is that you can't simply look at year-over-year monthly numbers, because there is too much volatility.  I've noticed that Yael Abouhalkah (KC Star Reporter) and Michael Austin (economist with Department of Revenue) tend to have this argument monthly-ish.  Generally, Michael is correct in that YoY statistics are too volatile and not as meaningful as long-run stats.  Here's a look at Kansas job growth over the past 17 months, on two axis against nationwide growth.  While the nationwide number are reasonably steady, the Kansas numbers show a lot of volatility month to month, ups and downs that aren't indicative of a long-term trend.


We need a better metric for a more reasonable discussion.  To control for volatility, for the rest of this analysis we'll look at average Kansas Jobs annualized (the average of non-farm BLS jobs numbers).  This is a better metric because it reduces volatility, though it also reduces our ability to have dumb arguments each month.

HISTORY OF JOB GROWTH

Going back to 1990 Kansas Job growth is on a fairly stable upward trend, with two notable setbacks due to recessions in the early 2000's and 2008.  The second axis related to the orange line below shows YoY changes, which are a bit more meaningful. (all numbers in 1000's).

Kansas hasn't seen steady job growth in excess of 20-25K jobs per year since the mid-1990's.  That means Brownback's promise of 100K new jobs by the end of his second term is extremely aggressive, and would require a return to 1990's style growth.  If only someone could re-invent the internet.




But looking at Kansas growth historically, we know that it is impacted by national periods of growth and recession.  How much of the growth rate is due to national trends versus what actually happens in Kansas?  Turns out, most of the variations in job growth (~75% historically) is related to what's going on in the national economy.  Here's a look at change in jobs, indexed to 1.0 (no change Year over Year).



Region: is a group of similar near-neighbor states (Nebraska, Missouri, Oklahoma, Arkansas, Iowa).

The past two graphs on history bring out two problems with Brownback's campaign promise:
  • It's highly aggressive and highly unlikely to be attained.
  • It's more dependent (75% of variance) on the external economy than internal policy actions.

TRACKING SUCCESS AND EXTERNAL INDEXING

Given the last two issues raised, Brownback's original promise (and the Kansas job market in general)  should be judged on different standards than "100,000 more jobs by the end of Brownback's second term."   In my mind there are two ways that we can look at Brownback as a success, rather than month-to-month comparisons:

  • Approximation of 100K increase.  Because we'll analyze things on annualized basis, success will be measured as an 100K increase in average monthly jobs from 2014 to 2019.  Yes we're giving the adminstration an extra year, but the measurement works better, it's still the end of his term, and I'm feeling generous.  This requires about 7.2% growth from about 1.39 Million to 1.49 Million jobs.  (annualized de-compounded assumption is 1.39% growth)
  • External Indexing.  Brownback's number of 100K new jobs didn't allow for the case of another recession, or really any external factors.  It is helpful to know how Kansas is fairing against peer states and nationally.  If we are doing significantly better than peers, in the case of a recession or other economic setback, this should be recognized.  So here's how the metric works:  Historical average growth for peer States have been at about 1% for the past 25 years  (and has settled around here for 2013-2015).  To add 100,000 jobs against historical background averages, Kansas jobs would have to grow at 1.4%.  We would then expect Kansas to outperform our neighbors under Brownback policies at by 0.4%.  That means, from an indexed perspective, even if we miss the 100K target, Brownback should get credit if we outgrow neighbors by 0.4%. 

CONCLUSION

We've looked at three things, first 
  1. the unfairness of month-to-month metrics
  2. historical job growth in Kansas versus externally
  3. a new metric going forward
I will try to update this monthly, or at least quarterly, but how is Kansas doing through August (eight months into 48 months)?  Per the metrics defined, Kansas is up 8,000 jobs, and growing at a rate about 0.4% slower than peer states.  In other words, by either metric Kansas is missing Brownback's job targets, but there's still a lot of time on the clock. 


Wednesday, October 14, 2015

Democratic Debate #1 Summary


After the democratic debate last night, my wife asked me who I thought won.  I had no clue, actually. I could say three people who didn't win, but the debate wasn't a clear decisive victory for either front-runner, Hillary Clinton or Bernie Sanders.  Why not try to derive who won the debate from social media like I did for the Republicans?

TWO SENTENCE SUMMARIES

I'll just start out with my take from the debate, somewhat sarcastic, and limited to two sentences each.
  • Hillary Clinton - She'll likely win the nomination, so a good part of her debate strategy is like the four corners offense in basketball or "prevent" defense in football.  She's just running out the clock and trying not to mess it up.
  • Bernie Sanders - He made some points indicating willingness to work with people of differing opinions, weird for a debate (gun control).  Came off only half as crazy as portrayed in the media.
  • Jim Webb - Seems to be angling for an appointment Secretary of Defense.  Here's his picture as Assistant Secretary of Defense, 30 years ago.
  • Martin O'Malley - Still had to google him this morning to figure out anything about him.  I guess he was governor of Maryland?
  • Lincoln Chafee - Is he the one that looks kind of like a bird?  Yes, both a real bird AND Larry Bird.

VOLUME AND POLARITY

Ssame methodology as normal, I downloaded and analyzed a sample of tweets from after the debate and analyzed.  Then I calculated the number of tweets related to each candidate and the positive percentage.  Clinton and Sanders had a similar number of tweets, but tweets mentioning Clinton were a bit more positive.  The rest of the field went Webb, O'Malley, and Chafee, with Chafee getting only slightly more attention than Barack Obama (not running, FYI).



Oh, and a wordcloud demonstrating talk after the debate.  Note that only Clinton, Sanders, and Webb make the cloud.


TOPIC MODEL

So what topics were talked about following the debate?  I ran a quick Correlated Topic Model to determine the topics.




TOPIC 1: People talking about facts of the debate generally the morning after, focusing on who "won" the debate.  
TOPIC 2: Topic focused on people talking about Bernie Sanders saying he was tired of hearing about Hillary Clinton's emails.
TOPIC 3: This is a topic focused on what is being referred to as the "Warren Wing" (Elizabeth Warren) of the party (warrenw is the stemmed hashtag).  Bernie Sanders is largely seen as the most Warren-esque candidate.
TOPIC 4: This is the Hillary focused topic, centered around Hillary's performance, and CNN all-but declaring her the winner of last night's debate.


WHO WON?


Then I looked to see if any candidate names are disproportionately associated with the word "won" and "winner to declare a winner in this debate (obviously this isn't a great predictive strategy, but it's at least an amusing way to see who is most associated with winning the debate in social media).  Here's what I got:



Senator Warren's name comes up here once again, many people are pointing out, using similar language that the ideas presented were largely Elizabeth Warren's.  But she's not running, do any actual candidates come up?  The name Bernie does, but the word "snore" proceeds it in the ranking.  For the word "winner," "unclear" is a top word, which looks to be a fairly broad consensus in media as well.  It's unclear who the actual winner of last night's debate was.

Tuesday, October 13, 2015

Royals Twitter Performance: Tracking A Comeback

A couple of months ago, I posed an open question about how well sports team performance correlates to Tweet outputs and sentiments.  I generally demonstrated low-level but significant correlations from last season to this season.  But what about in-game correlations and observations; do Twitter sentiments and volume shift in-game in relation to outcomes in the game?  Yesterday's playoff game between the Royals and the Astros (a game with a major shift of performance late in the game) provided a great test.

METHODOLOGY 

I downloaded all the tweets using the hashtags #Astros and #Royals yesterday between noon and 5pm central time.  I cleaned the tweets (stemmed, removed words, etc) down to standardized data.  I also sentiment mined the tweets and categorized them into 20-minute interval buckets.

DATA

The first thing I looked at was the volume of tweets by time period throughout the afternoon, by each team's hashtag.


 Note that the #Royals hashtag is generally "beating" the #Astros hashtag throughout the afternoon, then an explosion of #Astros tweets around 2:40 followed by an explosion of #Royals tweets later in the afternoon. Here's what was going on in the game.

12:00-2:40 Pregame, and the first six innings of the game.  They Royals and Astros play a tight first six innings, with the Royals leading in the second inning, with a slight Astros comeback, the Astros lead 3-2 going into the bottom of the seventh.

2:40-3:00 The Astros score three runs in the bottom of the seventh to take a commanding lead.  At this point teams win over 95% of the time with a four run lead with two innings left.  Here's a flavor of #Astros tweets during this time period.




3:00-5:00 The Royals create a massive comeback in the eight inning, scoring five runs, with an additional two in the ninth and win the game.

Looking at data this way, we can bucketize the data into three main times of the game.  Prior to the Astros "big lead", during the Astros lead, and then during the Royals comeback and afterwards.  Here's what that looks like by tweet volume.

This chart demonstrates the correlation between performance and tweet volume, with #Royals outperforming #Astros 2:1 in the later stages of the game.



There's no clear explanation for #Royals outperforming early in the game, though a couple of testable hypotheses:
  • Away teams see more tweet volume, because people can't be "at the game."
  • Some teams just have better Twitter presences.
For one last test, I sentiment mined the tweets for the positivity versus negativity.   These results are less significant, but they do show that the Royals fans were more negative when they were losing, but positivity spiked during the comeback.  



Friday, October 9, 2015

Kansas Tax Revenue Estimates: Are They Accurate?

One of the key arguments in my home State of Kansas right now is about tax revenues.  I've posted on this before, but there's a general argument about the correct tax strategy (more sales? income? consumption? property?).  There's also a more short-term argument: under the current tax plan, will we take in enough money for the budget created this year (FY 2016, July 2015-June 2016).  This blog post will evaluate the accuracy of the revenue estimates, and following posts may seek to refine the revenue estimates.

BUDGETING BACKGROUND

To explore whether we'll have enough money for this year, we're actually asking another question, which is, are the consensus revenue estimates accurate?  Here's how the process works in a nutshell:
Legislators create a budget with a certain max spending amount plus a little safety margin (called an ending balance), determined by an estimate of tax revenue for the next year.  How are those revenue estimates determined?  In Kansas, we get a bunch of smart people together who look at tax policies and rates, the health of the economy, etc, and determine how much money the State will come in each year.
The question here, thus becomes, how right or wrong are the smart people.*

*As someone who makes projections for the living, I'm well aware of how wrong supposedly smart people can be about projects.


PERFORMANCE OF ESTIMATES

I compared the Kansas consensus revenue estimation group's initial estimates for each year (annualized, not monthly estimates) to the actual outcomes.  The data I used ended in FY 2014, so 2015 is not shown.  But I am really just wanting a historical evaluation, if that data becomes available I'll add it to my charts.   A lot of data visualization here, but also some numbers; what do those numbers look like?


The numbers track together, as we would expect, but visually (read: ocular regression) we see greater variance in recent years.  But that greater variance could be just because we're dealing with larger numbers.  We should really look at this on a percentage basis, like this:



This is actually a fascinating chart, because it points out clearly three "low revenue" misses:  1983 (early 80's recession), 2002 (early 2000's recession), and 2009/2010 (2008 recession).  There are also policy changes tied up in this bit of history, and if you know more about the history of tax policy that might cause any of these volatilizes please comment below.

Another thing that stands out, is that there are relatively few multi-year, non-recession associated negative misses, with none occurring since the mid-1980s.  The misses in 2013 and 2014 (and 2015, not shown) are relative anomalies in that sense.  

Is there another way to look at how accurate revenue estimates are, smoothing for negative misses and bad years?  Here's a five year-moving average version, that looks at absolute deviation.  


That last view is telling, because it shows that over time, the revenue estimates are off by an average of 5.1% negative or positive.  The current amount of "slack" in the budget is $100 million.  By way of comparison, we would expect the average year to by off by about 5.1% or $300 million on current budgetary numbers.  In essence, it's quite likely we could see an overrun of that slack.

CONCLUSION

Some takeaway bullet points:
  • The revenue estimates historically are fairly accurate, however have an average "miss" of 5% annually.  That 5%, if negative would overrun the current budgetary slack.
  • The largest negative misses historically are associated with major recessions.  But we have seen three consecutive estimate misses, outside of recession conditions.
  • The miss in the last three years is uncharacteristic, as non-recession (ended in 2009) multi-year misses.  That's not good, and could point to something inherently flawed in the revenue estimates.


Thursday, October 8, 2015

Royals Playoff Power Rankings!

Hey you know that baseball team that I liked when I was a kid, but didn't make the playoffs from the time I was 5 until I was 33?  They're in the playoffs again!

In the past I have used sentiment and topic modeling technology to analyze text on the #ksleg hashtag on twitter, and also create "power rankings."  Why not do that for the Royals too?  Especially in the playoffs? Let's kick the playoffs off in style.

POWER RANKINGS

So my power ranking system is based on the reach an individual account has through tweets, retweets, and favorites.  I don't disclose the system, because I know that being on my power rankings is highly coveted, and accounts would just start gaming my system.  I can say that I downloaded the last 48 of tweets using the hashtage #royals.  Hhere are the top #Royals accounts over the past 48 hours:


The top ranked here actually make a lot of sense.  The @Royals official account, some players, a few local media people.   I'll check in tomorrow to see who is top at live-tweeting the game.


TOPIC MODELING

So what topics are people talking about when using the #Royals hashtag?  Here's a wordcloud of common terms.
Quite a few things you would expect in there, including Astro (their opponent), baseball, MLB, takethecrown (obvious Royals pun), and... Emma Watson?  What?

Ok, so can we break this into topics using a simple low-n topic model algorithm?  Answer: Yes.  I used the correlated topic model algorithm described here.




Topic 1: Topic mainly about naked Emma Watson pictures, and people selling stuff, some of it Royals related like this.  (I've wrote about this before, but high popularity hashtags get a lot of spam traffic, which is the nature of this hashtag).

Topic 2: Topic about the other Royals.  You know, Kate, William, their babies.  Pictures of them.  Here's an example.


Topic 3: A topic cheering on the Royals.  This one is largely talking about the playoffs, and being positive towards the Royals.

.
Topic 4: A topic about tickets.  A lot of people looking for tickets, or regretting not having tickets.  Or gloating, about having tickets.



SENTIMENT MINING

Sentiment mining is effectively mining text for feelings.  Where the model above was a topic model which just separates tweets into broad correlated categories by "what" they are talking about.  Sentiment mining however looks at tweets and determines what emotion is being expressed (joy,sadness,anger,etc) and how positive or negative they are.

A lot of different ways to go with this, but I used our above topics (renamed appropriately) and cited the associated probability associated with two sentiments, joy and sadness. 

What did I find?  Cheering on the Royals was the most joyful, while not having tickets, is the saddest.









2015 NFL Picks: Week 5

For all practical purposes this year looks like a dud year for my KC Chiefs.   Right now they are last in their division, behind the freaking Raiders.  THE RAIDERS.  Here's proof.



With low performance by the hometown team, while the other hometown team (The Royals) are advancing through the MLB playoffs I doubt I follow the NFL closely for the rest of the season.  That said my pre-season model (predictions for all games made in May) is still performing above most NULL or pre-existing heuristic based models, at 37-26.  

THIS WEEK'S PICKS

And here are this week's picks, keep in mind they were made in May.  Given performance so far this year, we could make some changes to this picks, but it's almost more fun to see how May's picks play out.


Monday, October 5, 2015

KU Football: The Chase for a Winless Season

Last week I posted on the probability of the University of Kansas Football team losing every game this season.  Since then, the team lost again, thus increasing the probability of a winless season. I also found some great information on the topic of futility in college football.

BACKGROUND

I found some great information on horrible college football teams since I last posted, scouring the internet for infomation. Two great pieces:

  • List of Winless Seasons.  Winless seasons in college football are only pseudo-rare.  By that I mean it happens fairly often (a couple teams each year), but it's still rare enough that we can keep a list of it.  That list is interesting, and even includes seven (SEVEN!) winless seasons by my undergraduate alma-mater, Kansas State University.  Also, the toilet bowl is an interesting clickhole I found myself in.
  • The Bottom 25.  This is a CBS attempt to rank the worst teams in college football.  Guess who is considered the worst right now?  The University of Kansas.  An interesting insight from this analysis is that KU had a better chance to win its first Big 12 game this year more than any other Big 12 game.  They already lost this game, so I need to change methodology to maintain an accurate probability estiamte.

METHOD CHANGE

My prior methodology gives a good estimate of how likely it is that KU will go winless based on historical probabilities.  This is especially true in the first week when the entire Big 12 season is laid out in front of us.  But there is one bias to the estimate:  KU is significantly more likely to win some games than other games.  Complicating this, is that KU's statistically easiest game was the first game of the Big 12 season, after losing which, the probability of going winless increases dramatically.

While a team that only wins 6.8% of their games doesn't see a drastic difference in probabilities from game to game (wins are essentially a "fluke" and not as tied to opposition talent as competitive teams), we can still estimate the relative strengths of teams, using their individual probabilities to lose, and then weighting KU's probability by that.  This methodology looks at the historical performance of each team, and adjusts the 6.8% by that performance.  Here are the probabilities that each team will win their KU matchup this year:




The other worst team in the league (ISU) only has a 86.8 percent chance of beating KU, whereas the best team (Baylor) has a 98.3 percent chance to beat KU.  Relatively speaking, this means KU is about 8 times more likely to beat ISU than Baylor.

Because KU played ISU first,  their probability of a winless season increased from 53% to 61%.  My initial model only had it at 57% following a conference week 1 loss, but that flat model didn't account for Iowa State's poor play.





Thursday, October 1, 2015

Voter Suspension List: What Community Factors

My post from earlier in the week on Voter Suspension Demographics has received quite a bit of traffic recently, probably because of the lawsuit file by former gubernatorial Candidate Paul Davis against Secretary of State Kris Kobach.  In that post I demonstrated that suspended voters are generally younger and less Republican than the general voting population.  I also promised a  deep dive into racial and other demographics of suspended voters.

The suspension list doesn't identify the race or economic status of each person, so we can't measure these types of demographics directly.  What we can do, identify members by their communities (here represented by zip codes) and which community demographic attributes are most predictive of a high suspended voter rate.

I looked at census data by zip code in Johnson County, and also confirmed that the results are similar in Sedgwick County (two largest counties in state).  I plan on moving fairly quickly through a few data analyses, so I'll just state my findings here (if you have any questions please comment):

  • Home Ownership (affluence) and African American % Matter.  My home ownership proxy (owner occupied housing) is negatively related to voter suspension rate, meaning that affluence measured through home ownership, leads to less voter suspension.  The % of African Americans living in a community was also highly  related to suspension rate, meaning the more African Americans, the more people on the suspension list.
  • Median Age and % Latino Matter Less.  Given my earlier analysis, we expected median age of a community to lead directly to more suspended voters.  This is true, but the relationship is relatively weak.  Because the reason given for citizenship voter requirements often center around illegal Mexican immigration, we expected that relationship to be positive and highly significant. It was positive, but not as important as the other factors.
First, a map of voter suspension rates in Johnson County:


AGE 

Based on our earlier analysis, we expected the age of a community to be highly related to the percent of suspended voters.  There was a relationship and in the right direction, but it wasn't as strong as we expected.  Generally this means, that while younger voters are suspended at a higher rate, this isn't an underlying driver at a community level.  We also may need to measure this differently, as it may be more related to "young adults" rather than an overall age metric (median).

Not too relevant, but I mapped this for fun.  Fun stat: Gardner is youngest community in the Joco, Leawood the oldest.  


PERCENT LATINO

A lot of the rhetoric I hear about citizen registration laws is regarding illegal Mexican immigrants taking over the American political system by voting in our elections.  From an a priori point of view, I assumed if this was true (either the fear of Mexican immigrants, or that this policy is effective against that) that we would see a strong positive relationship here, we did not.  Here's the chart.  


And I had my mapping software here, so here's a map of Latino % across Joco.



PERCENT AFRICAN AMERICAN

Having the census data handy, I had a few more correlations I could try.  I tried everything I could and found two high correlations (absolute R value of > .5).  The first one is African American Percent in the community.  That doesn't necessarily mean that African Americans are more likely to be suspended voters (though it does strongly point that way), it does mean that in highly African American communities, move voters are suspended.


 And a map of African Americans living in Johnson County.  The highest percentages being inside the 435 Loop.



HOME OWNERSHIP PERCENT

The final major correlation I found with suspension percent as a negative correlation, but the strongest significant correlation we found.  The higher the Home Ownership rate the lower the voter suspension rate in that community.  Home Ownership is generally used as a proxy for community affluence.  We know that younger people are more likely to not own their homes, so it is somewhat likely it's a combination of age and affluence at play.  Here's our graph.  


And a map of home ownership rates.  The reds are the highest, with greens being the lowest.







NFL Picks: Week 4

Call me a fair-weather fan, but I have the following rule about following NFL teams: once the team is two games under .500 for the year (1-3,2-4, etc) , I stop watching until they are again only a single game out of even.  

Given the short NFL season it is fairly difficult to come back from two games under .500 and make the playoffs, so this generally works for me.  My Kansas City Chiefs are in danger of hitting that mark this week, but luckily, they're favored in my rankings.

LAST WEEK PERFORMANCE

Last week I went 11-5 on what was generally an easier week to predict in the NFL, with relatively few upsets and quite a few games that were "easy calls."  That puts me at 28-20 for the season, or once again in the middle of the pack of experts I am tracking, with the same record as Mike Ditka, and one game ahead of Boomer Essiason.

THIS WEEKS PICKS

This week, the picks weren't quite as easy as last week, but quite a few obvious picks like Colts over Jaguars and Seahawks over lions.  I'm picking the Chiefs to win on the road at the Bengals, and if they don't, it could be the last Chiefs game I watch this year.