Monday, February 23, 2015

Day in the life

It struck me last week that a majority of my job as a Data Science manager is fairly mundane office work or IT work, which is necessary but a little boring.  I thought it might be interesting to document my priorities for a fairly average day.

Notable in this, is that I spend very little time actually building models.  Most of my time is spent debugging business problems or communicating prior results to other Director/Executives.

Anyways, here's what I did in a fairly typical Monday in February, 2015.

  1. Approve timesheets: This is as boring as it sounds.. it's actually approving timesheets that aren't filled out, but.. for some reason required by HR so people get paid.
  2. Create new metric: It occurred to me late last week that we're missing an important part of the revenue conversion process.  In essence, some customers create negative revenue initially, but eventually turn positive.  From an underwriting perspective, we often need to understand this process, especially when other dimensions come into play (how does late-conversion vary by group A versus B, or age of the customer?).  So I'll be working on this metric today, or delegating it.
  3. Test Online Site:  IT had some.. issues.. last week that meant there were periodic outages with our online brand.  I need to have a team member test this so that we know that it is working correctly.  This is especially important to underwriting, because changes in user experience can massively change customer outcomes.
  4. Check on Test Environment:  IT test environment has had issues for a couple weeks, determine if it's working again so I can deploy three sets of code.
  5. Provide Additional Analytics Info:  We found a major variation in customer performance based on a dimension we had never really analyzed.  I communicated these results to the executive team and a few field directors last week.  In these results, we have found some opportunities for quick gains.  I'll spend a good portion of today communicating and following up with our Exec team.
  6. Provide Info for Marketing: The director of marketing is very data hungry guy, who is asking for some customer retention data.  I'll spend a good potion of the day putting this data together for him.  This type of work would generally be passed on to a Business Intelligence team, but without one of those, it falls to my team.  I think we can add value on explaining some root-cause, but really this is just a lot of query work. 
  7. Followup on position to hire with HR.  I complained about the user experience of people applying for my job last week.  I need to followup and see if any progress has been made towards improving UX, or if I should try to work around those issues.
  8. Review new Resume Submissions.  If any are good, maybe a phone interview?
  9. Build Model For New State Underwriting.  Building an initial model, to serve as a proof of concept for roll out of nationwide model.  Not a total build out, but just initial explorations.

Wednesday, February 18, 2015

Model to Revenue

An area I have struggled with is converting predictive algorithms into revenue maximizing calculations.  Because I create models for a financial services company, I sometimes would have a hard time finding an optimization point, or being able to show the exact business impacts of a model.

My boss is the CFO, so he always wants to know revenue impact, and often, my model only outputs predictions of some intervening variable.

Should the model increase revenue? Absolutely.  How much? Well...

There are a few factors that feed into this difficulty:

  1. Dependent variable: By the nature of the business, many times my initial predicted (dependent) variables are not "revenue", but early operational outcomes (payment default, returned checks, fraud).
  2. Exogenous costs: Back-end cost of capital and other varying and fixed overhead costs make calculations more complex when considering where break-evens should be set.
  3. Statistical optimization: Classic data science techniques for evaluating model effectiveness (AIC, Mean Squared Error, Area Under the Curve)  are not necessarily revenue optimizing, especially considering reason (1) above.
Over time, I've developed strategies to deal with this issue, such as a variation on the model building process and a set of R functions that are specific to our business.

The R functions are very specific, and largely because each business converts model outcomes revenue in a way unique to the industry and business model.  As a result, releasing those functions back to the R community wouldn't be valuable.

My model building process, on the other hand, seems somewhat valuable (although really, it just adds an additional step):

  1. Collect data, try new variables, choose dependent variables that have a known or conceptual relationship with net revenue.
  2. Build and refine exploratory models. Just as you would in any other model building project.
  3. Optimize the models to classic statistical algorithm measures (AIC, AUC, etc)
  4. Validate model performance using validation set.
  5. Calculate revenue breakpoints and net revenue changes by implementing the models.  If not satisfied, try to determine potential sources of bias and return to step 1.

This process isn't novel, but the important part is to realized that optimized prediction is not the final step, and that the initial predictions are only the first step in increasing revenue for the business.

Thursday, February 5, 2015

R in Production Diaries: rJava made me slow (not really)

Well.  rJava wasn't really the cause.  Here's a basic architecture overview:

1. The web application (C#) makes a call to a service layer.
2. That service layer creates a unique connection per request to the Linux R server using Rserve
3. Rserve makes a call into R, running code for the Decision Engine, (calling libraries, making database connections, running prediction functions against multiple models, writing database logging).
4. R/Rserve return a response to the service layer... the Decision Engine's answer.

We've had three major rounds of optimization:

1. R service initially runs 10-12 seconds.  Ends up this is due to a monstrous time to load a GLM model.  I've blogged on this before.. here.

2. R service now run 2.5 seconds on average.  But occasionally 17 seconds.  This was an issue with the services layer (step 2 from above).  This wasn't my code, but was eventually fixed.  I've blogged about this before too... here.

3. R service now runs 2.5 seconds.  Always.  Except, because sometimes server volume is high, sometimes it gets behind.  Generally, in the business I'm in, 2.5 seconds is fine, and doesn't negatively impact customer experience.  However, with multiple requests and a backlog, it would be nice if it could run just a bit faster.  So I set out to debug.

I implemented a verbose logging, and found that most of my process ran in milliseconds, but the server was taking a full 2 seconds load code libraries, spending most of that time on a large library, rJava.

I call rJava in the process so that I can utilize the RJDBC package, which gives me database connectivity.   It's a necessary library (long story on why I don't use RODBC).

So, it seemed like the solution was to not load rJava on each call, but the way the C# service layer is written gave little ability to do that.  So I figured if rJava loaded as R did, it would decrease that load significantly.

Here's the solution... to the file:


add the line:


The entire process now runs in less than half a second, and I (as well as some IT people) am very happy.

Sunday, February 1, 2015

Data Science versus the librarians

OK, so maybe the title of this blog post is overly dramatic, but it describes where I am right now.  Currently, along with my librarian wife, I'm attending the American Library Association mid winter conference in Chicago. 

So far, the biggest fun of the conference is a blizzard warning and an 18 inch snow prediction, which could be expected when you hold a conference in Chicago in late January.  Related to my job and this blog however, there seems to be a lot of interest in STEM and analytics in the library community. 

On Friday night, we were invited to a small dinner put on by a vendor selling STEM education products.  The product was interesting, and would have likely appealed to the grade-school version of me.  The vendor pointed out the US was lagging behind in STEM fields, and that this should worry educators and librarians etc... 

Interestingly in the presentation, the vendor asserted that the failure of the United States in math and science fields was actually a simple failure of literacy.  I disagree with this notion, the causes of the US failure in these fields go much deeper into our society and the basics of our education system.  I wrote this off, however, as specialists diagnosing a systemic problem as a problem that they know the most about.

It was good to know, though, that educators are taking steps to fix our issues in math and science though.  Later in the evening, we were asked to go around the room, and talk about our professional lives.  I mentioned a project I'm working on to voice-to-text translate all audio from phone calls my company makes, and then use Latent Dirichlet Allocation to categorize, then predict future customer behavior.  

The librarians seemed bored with my first anecdote.  So I mentioned that in my last search for candidates, of the 10 most qualified candidates for the position, only one was American.  There was an audible gasp.  Hey, if the stats from the earlier presentation didn't get their attention, maybe I did.

The rest of the conference is basically me chatting with and occasionally drinking with librarians.  There are a lot of vendors here selling "big data" products that are actually just repackaging census data and, if they're more sophisticated, adding a GIS overlay.  Most companies do this fairly poorly (though one, whose name I can't remember, impressed me last year).  As I mentioned in a prior post, I could probably put one of these products together in a day with a junior analyst. 

I have talked to a few librarians actually trying to make a difference in technology.  Some try with kids and have cool "maker spaces" in their libraries where kids can play with 3d printers, program robots, and learn Python.  Others were trying to use data to improve the operational effectiveness of their organizations... though most of these programs were just in their infancy.  It will be interesting to see where data science and library science meets in the next few years.