Friday, October 23, 2015

Peer Group Determination: Library Peer Groups

Libraries are an institution generally associated with books and reading, and not necessarily math and data.  But as an avid reader who is married to a librarian, this data guy has an interest in the data behind libraries.  When my wife made me aware of some library datasets that might be interesting, I dug in and looked at the numbers.

I realized this dataset gave me the opportunity to potentially help libraries and tackle a subject I have wanted to on this blog for a while: Peer Group Identification.  Specifically, the research question here is: can we use a data-driven methodology to identify peer groups for individual libraries?  If we can, librarians can use these peer groups for purposes of benchmarking, best practices setting, and networking with those in similar situations.

THE PEER GROUP

Mixing it up a bit by putting results before detailed methodology.  It's down below if you're interested, both in a math intensive and non-math intensive formats. For data, I used the publicly available 2013 IMLS dataset.

I used a "nearest neighbor" methodology to find peer libraries for my home library system, Johnson County Library (JCL).  The nearest neighbor method is widely used across many fields, here's an example from medical research.  The factors I matched on were population served, branches, funding per capita, and visits per population.  

The result established a peer group in the chart below, with libraries that have between 11 and 13 branches, similar funding levels, similar populations and visits.  There is one extremely close neighbor, which is the Saint Charles City-County Library District. This library is similar to JCL in data, but also in serving affluent suburban areas near mid-western towns.  

I ran this list by my wife and she liked it.  So success?  In the short-term at least, but may be potential room for refinement (see conclusion).






SIMPLE METHODOLOGY

The "nearest neighbor" methodology to determining peer groups is fairly easy to understand at a basic level.  If we wanted to determine a peer group for Johnson County Library without using advanced analytics, we might start by simply looking at all Libraries that serve between populations between 400K and 500K.  

That might give us a good start, but upon diving in we would learn that many of those libraries face different challenges and experiences.  Some would be less affluent with lower funding levels, while others may see far different use patterns.  So we would add in a second variable, lets say funding per population, which would look like this:  


In this case we would choose the libraries closest to JCL, roughly the circle in the above graph.  But once again, there's a lot more to the attributes of a library than funding and population served.  What about use patterns, and number of branches?  

This is where I lose most people in the math.  Using this methodology, we can use as many dimensions as we want, and simply calculate the nearest neighbors simultaneously on all variables.  The best way you can imagine it is an extension of the above graph, just extended into 3-dimensional, 4-dimensional and n-dimensional space.

NERD METHODOLOGY

This is a methodology I have been using since early in my career, to choose peer groups and in the form of the k-NN algorithm.  Computationally, this is similar to the k-NN algorithm, especially in the first phases.  Some generally nerdy notes:

  • Computation Method: This is easy, really, it's just a minimization of euclidean distance in multi-dimensional space.  Effectively, a minimization of d in this equation:

  • Computation Strategy: The k-NN machine learning algorithm is both computationally elegant and costly.  It is elegant because it is simple to write, basically we just compute euclidean distance in n-dimensional space, and find the "closest" "k" neighbors to the point we're interested in.  Simple. I didn't even use an R package for this, I just rewrote the algorithm, in about 10 minutes of R coding, so I could have more control. It is costly because, in its predictive form, it requires distances calculated between every point in a dataset (which you can imagine in million+ row data tables can be slow).  Luckily in this case, I'm only interested in the distance between Johnson County Library and others, so it's computationally cheaper.
  • Variable Normalization: If you input raw data into the nearest neighbors algorithm, attributes will establish their importance in the equation by variance (because we're simply measuring the space raw).  I take three steps:
    • I conduct some attribute transformations.  Most importantly, I take logarithms of any variables showing a power-law distribution to reduce variance.
    • I Z-score each attribute, such that we are now dealing with equivalent variance units.
    • I (sometimes) multiply the Z-score by a weighting factor, when I want one factor to matter more than others.  In this case, I don't have a good a priori reason to weight factors, but could reconsider if Librarians think other factors matter.

CONCLUSION

In this post I have covered the methodology for determining peer groups, and created peer groups for the local Johnson County Library.  I hope that it is both demonstrative of methodology that could be implemented in many fields, and also adds value to the field of Librarianship.

If any librarians are interested in this analysis, or details of it feel free to reach out to me at datasciencenotes1@gmail.com.  I would be happy to provide you a custom peer group for your library.  Also would be interested in any thoughts on improving the peer groupings either by:
  • Using additional factors or variables.
  • Weight certain factors as more important than others (is funding or # of branches more important than # served?)
I'll leave you with a final view of the JCL data.  Below is a graph of libraries by visits and population, with the Johnson county peer group highlighted blue.  Note that there are some orange dots inter-mixed in the peer groups-these are libraries that were good matches on these two factors (visits/population) but not on our other two factors.
Johnson County Peer Group Versus All Libraries

1 comment:

  1. If you add a variable for geography, I think you've got the start of something really interesting here!

    ReplyDelete