Saturday, June 13, 2015

R in production diaries: sqlsave slowness

A few weeks ago, I received a text from a former colleague, here's what it read:

Do you do any bulk writes from R to a db?  Sqlsave is slow.  I'm considering writing a file and picking it up with another tool.

I knew exactly what he was talking about.  Sqlsave() is a function to write from R to SQL databases in the RODBC package.  I also know that he was likely refactoring my code, or code that he had partially copied from me at one time.

THE SOLUTION 


Fortunately, a couple of years ago, I had migrated off of  Sqlsave() to another solution.  Here was my response to my colleague, in email:

Easier to type on here.
We don't have to do any bulk saves in production, everything there is a single transaction, so we're saving back 6-7 records at any time transactionally (times 100 tx's per minute, yes it's a lot at a time but not a lot at any one execution).

As for SQLSAVE. I got fed up with the command and no longer use it.  We have an insert stored procedure that we call from R.  Much faster.  Much more efficient and easier.

We basically call an ODBC execute command wrapper, then just execute the proc with parameters to insert our data.

I don't know how this would work form your point of view, but you could (sans proc) in theory, create a text string in R that is essentially the bulk insert you want to do, and execute it AS SQL.  Which I think is the key... send the command, and let the insert be handled by the ODBC driver as a native ODBC SQL command, not as a hybrid weirdo SQLSAVE.

CONCLUSION

  • Sqlsave runs slow in production jobs.  Not sure why, but I would guess it is in a general class of problems I call "Meta-data shenanigans"
  • Using SQL directly to write back to the database in some way is generally a faster solution.
I received a thank you text yesterday from my former colleague, he refactored and his job runs faster.  All is right in the world again.

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Hi Levi,

    Could you elaborate a little more on using SQL directly? I assume you'd use something like SQLQuery in R with the SQL code as a character variable. How would one reference a dataframe (created in the R Workspace) in the SQL code when connected to teradata? Example:

    new_query_3 = "insert into teradata_container.teradata_table

    select * from df"

    Regards,

    Matt

    ReplyDelete
    Replies
    1. Great question Matt,

      I'm current using RJDBC from Linux. Here's an example of when I pull data (the sourced in code (connectionDynamic.r) holds connection info in a function called get_dbhandle().

      library(DBI)
      library(rJava)
      library(RJDBC)
      source("connectionDynamic.r")
      dbhandle <- get_dbhandle(ServerName,DatabaseName)
      quex <- paste("EXECUTE [dbo].[up_QCGetQCDecisionEngineData] @ServiceRequestID = ", Request_ID,sep = "")
      x <- as.data.frame(dbGetQuery(dbhandle,quex))

      Your question seemed specific to inserts though. I do this in a couple of ways. Most of the time, live in production, I'm writing back a small number of records in which case it looks something like this:


      query <- paste("exec achprocessing.[dbo].[up_InsertACHItemBlacklist] @CreateUser = 2,@RoutingNumber = "
      ,dx$Routing_Number[i]
      ," ,@AccountNumber = "
      ,dx$Account_Number[i]
      ," ,@StartDate = '"
      ,toString(Sys.Date())
      ,"' ,@EndDate = '"
      ,toString(Sys.Date())
      ,sep="")
      dbSendUpdate(dbhandle,query)}

      but if wanting to batch insert from a df, I push the df into a string, then create an insert query from there. Something like this:

      insert_data <- paste("(",df$x.precinct,",", df$x.bush,",",df$x.kerry,")", sep="", collapse=",")
      query <- paste("insert into precincts values ", insert_data)
      dbSendUpdate(dbhandle,query)

      Delete
  3. Lower price for wholesale with oakley sunglasses? how much? oh only $16 for all products! free shipping over $99, wholesale oakley sunglasses!

    ReplyDelete