The Accelerate HR Blog

Super-fast bulk data imports in Rails with ar-extensions   (Sat Jan 19 2008)

Most Rails tutorials assume that when you import or move data it's going to be a record at a time. Yes, there are methods for importing bulk data. You can use migrations for initial data entry, and I've found FasterCSV useful for comma-delimited files.

But these are always painfully slow. Why? Because they iterate Create statements within a block. Like this:

FasterCSV.foreach("#{RAILS_ROOT}/lib/nationality.csv") do |row| Nationality.create(:nationality => row[1])
end

For each iteration, a separate SQL statement is being generated. And when you're importing thousands of records, that gives you plenty of time to go and get a coffee .. or two .. or three.

In the Accelerate HR database, we're going to be bulk-importing data frequently, notably when we capture and then interpret records from an attendance reader. Take a largish business - with a couple of thousand employees. They may clock in and out several times each day. We'll need to pick up the data once a day, and that could mean 10 - 20,000 records. Even if I ran the import in the background, that's quite a strain on the system - and delays of more than a second or two just aren't acceptable.

Well today, I solved the problem, thanks to a great piece of work from Zach Dennis. A process that was taking more than 10 minutes came down to less than 1 second.

Let me put you in the picture. I was working on the payroll setup. Each employee is assigned to a cost center, and in the Employee model we have the fields costcenter_id and costcenter_date, (this being the date when the cost center was assigned). But people may be moved from one cost center to another. The client I'm dealing with now supplies some of its staff to other businesses, so we use the cost center to determine how much to charge for these services. This means we also need a cost center history, showing all the assignments and the change-dates.

To get this started, I'd set the current cost centers for 1150 employees (using another automated process ... but that's another story.) When the user confirms that everyone's been assigned and the set-up is complete, I wanted to initialize the Costcenterhistory table sending the Employee id, costcenter_id and costcenter_date from Employee to Costcenterhistory. That's what was taking 10 minutes, using the normal looped Create method.

So I Googled around to find a better way, and came across Zach Dennis's ActiveRecord::Extensions - that's ar-extensions. (A quick heads-up here to Ilya Grigorik, who pointed me in the right direction.)

The way that Zach achieves exponential increases in speed is by adding methods to ActiveRecord that allow you to create a single SQL INSERT request instead of looping through a block. To go even faster, you can turn validation off - if you're sure that your import data has already been validated elsewhere of course.

I should add that ar-extensions isn't limited to the import feature. Zach also provides similar SQL native methods to accelerate your Finds, to benchmark your queries, and to create CSV files. You'll find all this and more in the docs.

But before I show you how it all worked for me, let me talk about the docs - and the hours of frustration trying to work it all out before I reached the wonderful Aha moment. I don't know if it's just me, but so much of the Rails documentation I come across takes me halfway there and then leaves me stranded. And Zach, I love your gem, but I wish you'd given me just a bit more help.

Installing the gem, no problem. And then right at the end of the documentation an intro to mass import/insertion of data. Open it up, and there's a description of what the gem does. I'm fine with that too. But then:

Basic Example

class Person < ActiveRecord::Base ; end

columns = [ :first_name, :last_name ]
values = [ [ 'Zach', 'Dennis' ], [ 'Mark', 'Van Holstyn' ], [ 'John', 'Doe' ] ]
Person.import column, values

And that's it.

Now, I'm new to extensions. I've never done this before. Do I add something like this at the bottom of my Costcenterhistory model, after the last end? So I try it.

class Costcenterhistory < ActiveRecord::Base
..
end

columns = [ :employee_id, :costcenter_id, :start_date ]
values = [ [ 1, 12, '2007-12-1' ], [2, 15, '2007-11-25' ] ]
Costcenterhistory.import column, values

And then I guess I need to call this in my Controller:

def setup_ok
Costcenterhistory.import
end

Well it doesn't work. Ah, there's a note in Ilya's post, telling me I need to require 'ar-extensions' in my controller. So I try that.

OK, it works. Or at least the 2 records are added to the database. But there's an error message: Invalid arguments! And a reference to the line number in the gem.

And I spend the next 4-5 hours trying adding arguments to the controller reference, reading anything useful I can find on ActiveRecord extensions, working my way through Zach's code. Nothing works. So I give up. And anyway there's a match I'm planning to watch on TV.

...With two hours away from it all, the answer is blindingly obvious. You don't need to add anything to the model at all! Zach's given us a new method - Model.import - that we can use immediately in the controller. I tried it and it worked immediately.

So my plea, not just to Zach but to everyone who writes all the wonderful Rails material, is to remember that accessibility is an issue for programmers: some of us are only partially sighted and we need our instructions bold and clear. Even if it's the obvious to you, state it for us.

Rant over. Let's see how it worked out.

A. Installed the gem.
B. Added require 'ar-extensions' to the end of /config/environment.rb
C. Model - no change
D. Controller:

def setup_ok

# find the session location - because I want to deal only with
# the employees in a single location.
@location = Location.find(session[:location])

#define the columns I want to import to
columns = [:employee_id, :costcenter_id, :date_from]

# Build the values statement with a Ruby array map
@employees = @location.employees
values = @employees.map { |e| [e.id, e.costcenter_id, e.costcenter_date.to_s] }

# Turn validation off to increase the speed.
# The original data has already been validated
options = { :validate => false }

# Call the ar-extensions method
Costcenterhistory.import columns, values, options

#return to the cost center menu page
redirect_to :controller => "cost_center"
end

Let's look at how I constructed the values statement.

Zach's Basic Example looks like this:
values = [ [ 'Zach', 'Dennis' ], [ 'Mark', 'Van Holstyn' ], [ 'John', 'Doe' ] ]

But clearly, with 1150 records to insert, I don't want to type in the values manually like this. Instead I use a Ruby array map to build the array of arrays. OK, it uses a block - it's iterative - but as we run through the loop we're not updating the database, only creating the SQL statement. So no performance hit.

Time to try it out. I was absolutely delighted - no flabbergasted would be a better word - when in an instant the process was completed. In fact, I didn't believe it had worked and wasn't convinced until I checked the database and found that all the records had been successfully imported. And then I had to try it again to convince myself.

Zach, thank you for a brilliant addition to my Rails tool-set. And I hope this little contribution helps to show others the way.


POSTSCRIPT

I later discovered that turning validation off during data imports - when possible - makes a huge difference to speed. Test details are here. In another post, I look at an outstanding piece of research from Nimble Method published in the last few days, helping to explain why some Rails processes can seem slow - and suggesting other fixes.

Filed under: Ruby on Rails




Comment 7 months ago from Mark Van Holstyn

Just thought I would throw out a refactoring for building the array of values...

@employees = Employee.find(:all, :conditions => ["location_id = ?", @location.id])
values = @employees.map { [e.id, e.costcenter_id, e.costcenter_date.to_s] }
Comment 7 months ago from alanmiles

Mark - thanks for pointing me towards array mapping. Great suggestion - and I've updated the post.

Just a small thing. Your values line is missing the block definition. It should read:

values = @employees.map { |e| [e.id, e.costcenter_id, e.costcenter_date.to_s] }
Comment 7 months ago from Ivan Vega

Nice! I didn't knew I needed this until this post. Only one question. Why are you querying Employee instead of just calling @location.employees ? Thanks!
Comment 7 months ago from alanmiles

Ivan - Because I didn't know any better. But I do now. Many thanks. I've updated the post, that was easy. Now to refactor the database ...
Comment 7 months ago from Timw

Lifesaver... thank you very much for this tutorial/blog entry I heard about this from the Rail Envy podcast on my way to work this morning. Knowing I was going to have to migrate a legacy database onto our new app in the next few weeks I was worried it was going to take forever. I wrote a rake task the standard way first, it was going to take the 62,000 rows about 4 or 5 hours. I then ported it to ar-extensions import feature and it took less then 20 minutes. Considering I have about 30 tables to do, some bigger some smaller, this is truly a lifesaver.
Comment 7 months ago from AllenH

Can someone show me what the raw SQL looks like that is generated by this batch extension? My understanding is Oracle doesn't support multiple-row inserts in a single INSERT statement. I'm wondering if this extension works with Oracle or only with MySQL and other RDBMs that support that INSERT syntax.
Comment 7 months ago from Anthony Eden

http://activewarehouse.rubyforge.org/adapter_extensions/ provides similar extensions that use native bulk loading.
Comment 6 months ago from btetampa

Much thanks to you and Zach for the library and instructions. I have an import process that was taking over an hour for 50K+ records which now takes less than 90 seconds. I'm using it from a rake task, not a controller, but it works just as well.


List recent Entries
List all blog entries filed under:

Employment Politics
HR
Implementation
Just thoughts
Ruby on Rails
Web 2.0

Can't find what you're looking for? Try this: -

Search blog for a word or phrase



 Subscribe to an RSS feed

Or get an email copy every time we post something new. Nothing new? Nothing mailed

Enter your email address:

Delivered by FeedBurner


If you're enjoying our blog, why not find out more, and maybe get involved?

ACCELERATE HR is a website built on Rails and designed for the enterprise. And we're building it live on the Web, right here.

Check out our home page HERE, or sign up for free HERE.


DESERT ISLAND BLOGS

Sharing a few of my favorites

HR Stimuli
McArthur's Rant

Jon Ingham's Strategic Human Capital Management Blog

The Rails Track
Railscasts

Web Power
The Technology Edge

Window on the Gulf
Mahmood's Den