Friday, October 23, 2015

Peer Group Determination: Library Peer Groups

Libraries are an institution generally associated with books and reading, and not necessarily math and data.  But as an avid reader who is married to a librarian, this data guy has an interest in the data behind libraries.  When my wife made me aware of some library datasets that might be interesting, I dug in and looked at the numbers.

I realized this dataset gave me the opportunity to potentially help libraries and tackle a subject I have wanted to on this blog for a while: Peer Group Identification.  Specifically, the research question here is: can we use a data-driven methodology to identify peer groups for individual libraries?  If we can, librarians can use these peer groups for purposes of benchmarking, best practices setting, and networking with those in similar situations.

THE PEER GROUP

Mixing it up a bit by putting results before detailed methodology.  It's down below if you're interested, both in a math intensive and non-math intensive formats. For data, I used the publicly available 2013 IMLS dataset.

I used a "nearest neighbor" methodology to find peer libraries for my home library system, Johnson County Library (JCL).  The nearest neighbor method is widely used across many fields, here's an example from medical research.  The factors I matched on were population served, branches, funding per capita, and visits per population.  

The result established a peer group in the chart below, with libraries that have between 11 and 13 branches, similar funding levels, similar populations and visits.  There is one extremely close neighbor, which is the Saint Charles City-County Library District. This library is similar to JCL in data, but also in serving affluent suburban areas near mid-western towns.  

I ran this list by my wife and she liked it.  So success?  In the short-term at least, but may be potential room for refinement (see conclusion).






SIMPLE METHODOLOGY

The "nearest neighbor" methodology to determining peer groups is fairly easy to understand at a basic level.  If we wanted to determine a peer group for Johnson County Library without using advanced analytics, we might start by simply looking at all Libraries that serve between populations between 400K and 500K.  

That might give us a good start, but upon diving in we would learn that many of those libraries face different challenges and experiences.  Some would be less affluent with lower funding levels, while others may see far different use patterns.  So we would add in a second variable, lets say funding per population, which would look like this:  


In this case we would choose the libraries closest to JCL, roughly the circle in the above graph.  But once again, there's a lot more to the attributes of a library than funding and population served.  What about use patterns, and number of branches?  

This is where I lose most people in the math.  Using this methodology, we can use as many dimensions as we want, and simply calculate the nearest neighbors simultaneously on all variables.  The best way you can imagine it is an extension of the above graph, just extended into 3-dimensional, 4-dimensional and n-dimensional space.

NERD METHODOLOGY

This is a methodology I have been using since early in my career, to choose peer groups and in the form of the k-NN algorithm.  Computationally, this is similar to the k-NN algorithm, especially in the first phases.  Some generally nerdy notes:

  • Computation Method: This is easy, really, it's just a minimization of euclidean distance in multi-dimensional space.  Effectively, a minimization of d in this equation:

  • Computation Strategy: The k-NN machine learning algorithm is both computationally elegant and costly.  It is elegant because it is simple to write, basically we just compute euclidean distance in n-dimensional space, and find the "closest" "k" neighbors to the point we're interested in.  Simple. I didn't even use an R package for this, I just rewrote the algorithm, in about 10 minutes of R coding, so I could have more control. It is costly because, in its predictive form, it requires distances calculated between every point in a dataset (which you can imagine in million+ row data tables can be slow).  Luckily in this case, I'm only interested in the distance between Johnson County Library and others, so it's computationally cheaper.
  • Variable Normalization: If you input raw data into the nearest neighbors algorithm, attributes will establish their importance in the equation by variance (because we're simply measuring the space raw).  I take three steps:
    • I conduct some attribute transformations.  Most importantly, I take logarithms of any variables showing a power-law distribution to reduce variance.
    • I Z-score each attribute, such that we are now dealing with equivalent variance units.
    • I (sometimes) multiply the Z-score by a weighting factor, when I want one factor to matter more than others.  In this case, I don't have a good a priori reason to weight factors, but could reconsider if Librarians think other factors matter.

CONCLUSION

In this post I have covered the methodology for determining peer groups, and created peer groups for the local Johnson County Library.  I hope that it is both demonstrative of methodology that could be implemented in many fields, and also adds value to the field of Librarianship.

If any librarians are interested in this analysis, or details of it feel free to reach out to me at datasciencenotes1@gmail.com.  I would be happy to provide you a custom peer group for your library.  Also would be interested in any thoughts on improving the peer groupings either by:
  • Using additional factors or variables.
  • Weight certain factors as more important than others (is funding or # of branches more important than # served?)
I'll leave you with a final view of the JCL data.  Below is a graph of libraries by visits and population, with the Johnson county peer group highlighted blue.  Note that there are some orange dots inter-mixed in the peer groups-these are libraries that were good matches on these two factors (visits/population) but not on our other two factors.
Johnson County Peer Group Versus All Libraries

Wednesday, October 21, 2015

Mapping Kansas Jobs

When researching my post from earlier this week, I downloaded some data that wasn't immediately helpful: By county total employment statistics from 2001 to 2014.  The data wasn't initially useful, but yesterday I created some maps that are fairly interesting when looking at job patterns in Kansas 2001-2014.  A few takeaways I found:
  • Johnson County (+) The biggest positive growth over the past years at 35,000, which is actually greater than the total for all counties.  That means, if we remove Johnson County, all other counties see net-negative job growth 2001-2014.
  • Sedgwick and Shawnee Counties (-) These two counties have seen the worst aggregate job reductions, losing from 3,500 to 5,000 each.
  • Smaller Counties (+/-) Some good, some bad.  Positives for Southwest Kansas, and near suburban counties (e.g. Butler), and those along I-70 (e.g. Ellis), negatives for Southeast Kansas, and those along Nebraska border (e.g. Jewell).
  • Commuter Counties (?) Weird data thing here, but if I divide workers in a county by total population, I can get a sense of commuting patterns.  For instance, the ratio is .870 for Saline County, .396 for neighboring Ottawa County.  Huge difference, but growing up there, I know there's a net-inbound commuter into Saline County and out of Ottawa County. 

TOTAL WORKERS BY COUNTY

To start off, I mapped total workers by county, to get a sense of where people worked.  First by 50K buckets.  




That's not as meaningful as I thought it might be, only four of 105 counties have more than 50,000 workers.  Johnson and Sedgwick are dominant.  Let's break that down into a bit more meaningful buckets, here each bucket has an approximately equal number of counties in it.



That's a bit more meaningful, and the combination of the two maps give us an idea about where most jobs are in Kansas.  Johnson and Sedgwick Counties, then the rest of eastern Kansas, then a few hot spots in Western Kansas (Ellis, Finney, Ford).

EMPLOYMENT CHANGE


First I let QGIS break down the counties as it saw them, by 10,000 worker change buckets.  Below dark red indicates net loss of jobs, orangish is a gain of jobs between 0 and 10,000.  The bright green is Johnson County at +35,000.



This gives us an idea that Johnson County is an outlier in jobs growth, but by how much?  A summary chart below makes the point in numbers, but essentially: without Johnson County, Kansas counties are net-negative in Jobs since 2001.

The above map shows aggregate job change, but a better measure of impact to individual counties is percent change.  Here's a map of job growth and loss mapped by percent change. There's also a chart below that shows the biggest winners and losers in jobs over this time period.



And below is the same information in chart form.




COMMUTER POPULATIONS

Because the data was available in my GIS layer, I calculated the ratio of total workers to population aged 18-65.   Here's what that looks like:


The variance range was larger than I expected but two things seemed to pop out.

  1. Moderately low employment rates over all in *extremely rural*, older, agriculture based populations.  (e.g. Decatur, Jewell, Hodgeman, Edwards Counties).
  2. Near urban "commuter" communities have extremely low employment rates.  These communities are fairly obvious if you're familiar with Kansas demographics. Some county examples:
    • Ottawa commutes to Saline
    • Butler commutes to Sedgwick
    • Miami commutes to Johnson

CONCLUSION

I just pulled these maps together from data already on my desktop, but here are a few summary points:
  • If you remove Johnson County from Kansas, there has been essentially no job growth over the past 14 years.
  • Sedgwick, Shawnee, and a few rural counties have seen especially negative results over this time frame.
  • By looking at ratios of jobs to population, we can detect net commuter/commutee counties, though there's probably a more direct way to detect this.


Wednesday, October 14, 2015

Democratic Debate #1 Summary


After the democratic debate last night, my wife asked me who I thought won.  I had no clue, actually. I could say three people who didn't win, but the debate wasn't a clear decisive victory for either front-runner, Hillary Clinton or Bernie Sanders.  Why not try to derive who won the debate from social media like I did for the Republicans?

TWO SENTENCE SUMMARIES

I'll just start out with my take from the debate, somewhat sarcastic, and limited to two sentences each.
  • Hillary Clinton - She'll likely win the nomination, so a good part of her debate strategy is like the four corners offense in basketball or "prevent" defense in football.  She's just running out the clock and trying not to mess it up.
  • Bernie Sanders - He made some points indicating willingness to work with people of differing opinions, weird for a debate (gun control).  Came off only half as crazy as portrayed in the media.
  • Jim Webb - Seems to be angling for an appointment Secretary of Defense.  Here's his picture as Assistant Secretary of Defense, 30 years ago.
  • Martin O'Malley - Still had to google him this morning to figure out anything about him.  I guess he was governor of Maryland?
  • Lincoln Chafee - Is he the one that looks kind of like a bird?  Yes, both a real bird AND Larry Bird.

VOLUME AND POLARITY

Ssame methodology as normal, I downloaded and analyzed a sample of tweets from after the debate and analyzed.  Then I calculated the number of tweets related to each candidate and the positive percentage.  Clinton and Sanders had a similar number of tweets, but tweets mentioning Clinton were a bit more positive.  The rest of the field went Webb, O'Malley, and Chafee, with Chafee getting only slightly more attention than Barack Obama (not running, FYI).



Oh, and a wordcloud demonstrating talk after the debate.  Note that only Clinton, Sanders, and Webb make the cloud.


TOPIC MODEL

So what topics were talked about following the debate?  I ran a quick Correlated Topic Model to determine the topics.




TOPIC 1: People talking about facts of the debate generally the morning after, focusing on who "won" the debate.  
TOPIC 2: Topic focused on people talking about Bernie Sanders saying he was tired of hearing about Hillary Clinton's emails.
TOPIC 3: This is a topic focused on what is being referred to as the "Warren Wing" (Elizabeth Warren) of the party (warrenw is the stemmed hashtag).  Bernie Sanders is largely seen as the most Warren-esque candidate.
TOPIC 4: This is the Hillary focused topic, centered around Hillary's performance, and CNN all-but declaring her the winner of last night's debate.


WHO WON?


Then I looked to see if any candidate names are disproportionately associated with the word "won" and "winner to declare a winner in this debate (obviously this isn't a great predictive strategy, but it's at least an amusing way to see who is most associated with winning the debate in social media).  Here's what I got:



Senator Warren's name comes up here once again, many people are pointing out, using similar language that the ideas presented were largely Elizabeth Warren's.  But she's not running, do any actual candidates come up?  The name Bernie does, but the word "snore" proceeds it in the ranking.  For the word "winner," "unclear" is a top word, which looks to be a fairly broad consensus in media as well.  It's unclear who the actual winner of last night's debate was.

Monday, October 5, 2015

KU Football: The Chase for a Winless Season

Last week I posted on the probability of the University of Kansas Football team losing every game this season.  Since then, the team lost again, thus increasing the probability of a winless season. I also found some great information on the topic of futility in college football.

BACKGROUND

I found some great information on horrible college football teams since I last posted, scouring the internet for infomation. Two great pieces:

  • List of Winless Seasons.  Winless seasons in college football are only pseudo-rare.  By that I mean it happens fairly often (a couple teams each year), but it's still rare enough that we can keep a list of it.  That list is interesting, and even includes seven (SEVEN!) winless seasons by my undergraduate alma-mater, Kansas State University.  Also, the toilet bowl is an interesting clickhole I found myself in.
  • The Bottom 25.  This is a CBS attempt to rank the worst teams in college football.  Guess who is considered the worst right now?  The University of Kansas.  An interesting insight from this analysis is that KU had a better chance to win its first Big 12 game this year more than any other Big 12 game.  They already lost this game, so I need to change methodology to maintain an accurate probability estiamte.

METHOD CHANGE

My prior methodology gives a good estimate of how likely it is that KU will go winless based on historical probabilities.  This is especially true in the first week when the entire Big 12 season is laid out in front of us.  But there is one bias to the estimate:  KU is significantly more likely to win some games than other games.  Complicating this, is that KU's statistically easiest game was the first game of the Big 12 season, after losing which, the probability of going winless increases dramatically.

While a team that only wins 6.8% of their games doesn't see a drastic difference in probabilities from game to game (wins are essentially a "fluke" and not as tied to opposition talent as competitive teams), we can still estimate the relative strengths of teams, using their individual probabilities to lose, and then weighting KU's probability by that.  This methodology looks at the historical performance of each team, and adjusts the 6.8% by that performance.  Here are the probabilities that each team will win their KU matchup this year:




The other worst team in the league (ISU) only has a 86.8 percent chance of beating KU, whereas the best team (Baylor) has a 98.3 percent chance to beat KU.  Relatively speaking, this means KU is about 8 times more likely to beat ISU than Baylor.

Because KU played ISU first,  their probability of a winless season increased from 53% to 61%.  My initial model only had it at 57% following a conference week 1 loss, but that flat model didn't account for Iowa State's poor play.





Thursday, October 1, 2015

Voter Suspension List: What Community Factors

My post from earlier in the week on Voter Suspension Demographics has received quite a bit of traffic recently, probably because of the lawsuit file by former gubernatorial Candidate Paul Davis against Secretary of State Kris Kobach.  In that post I demonstrated that suspended voters are generally younger and less Republican than the general voting population.  I also promised a  deep dive into racial and other demographics of suspended voters.

The suspension list doesn't identify the race or economic status of each person, so we can't measure these types of demographics directly.  What we can do, identify members by their communities (here represented by zip codes) and which community demographic attributes are most predictive of a high suspended voter rate.

I looked at census data by zip code in Johnson County, and also confirmed that the results are similar in Sedgwick County (two largest counties in state).  I plan on moving fairly quickly through a few data analyses, so I'll just state my findings here (if you have any questions please comment):

  • Home Ownership (affluence) and African American % Matter.  My home ownership proxy (owner occupied housing) is negatively related to voter suspension rate, meaning that affluence measured through home ownership, leads to less voter suspension.  The % of African Americans living in a community was also highly  related to suspension rate, meaning the more African Americans, the more people on the suspension list.
  • Median Age and % Latino Matter Less.  Given my earlier analysis, we expected median age of a community to lead directly to more suspended voters.  This is true, but the relationship is relatively weak.  Because the reason given for citizenship voter requirements often center around illegal Mexican immigration, we expected that relationship to be positive and highly significant. It was positive, but not as important as the other factors.
First, a map of voter suspension rates in Johnson County:


AGE 

Based on our earlier analysis, we expected the age of a community to be highly related to the percent of suspended voters.  There was a relationship and in the right direction, but it wasn't as strong as we expected.  Generally this means, that while younger voters are suspended at a higher rate, this isn't an underlying driver at a community level.  We also may need to measure this differently, as it may be more related to "young adults" rather than an overall age metric (median).

Not too relevant, but I mapped this for fun.  Fun stat: Gardner is youngest community in the Joco, Leawood the oldest.  


PERCENT LATINO

A lot of the rhetoric I hear about citizen registration laws is regarding illegal Mexican immigrants taking over the American political system by voting in our elections.  From an a priori point of view, I assumed if this was true (either the fear of Mexican immigrants, or that this policy is effective against that) that we would see a strong positive relationship here, we did not.  Here's the chart.  


And I had my mapping software here, so here's a map of Latino % across Joco.



PERCENT AFRICAN AMERICAN

Having the census data handy, I had a few more correlations I could try.  I tried everything I could and found two high correlations (absolute R value of > .5).  The first one is African American Percent in the community.  That doesn't necessarily mean that African Americans are more likely to be suspended voters (though it does strongly point that way), it does mean that in highly African American communities, move voters are suspended.


 And a map of African Americans living in Johnson County.  The highest percentages being inside the 435 Loop.



HOME OWNERSHIP PERCENT

The final major correlation I found with suspension percent as a negative correlation, but the strongest significant correlation we found.  The higher the Home Ownership rate the lower the voter suspension rate in that community.  Home Ownership is generally used as a proxy for community affluence.  We know that younger people are more likely to not own their homes, so it is somewhat likely it's a combination of age and affluence at play.  Here's our graph.  


And a map of home ownership rates.  The reds are the highest, with greens being the lowest.