The Accelerate HR Blog
More notes on super-fast data insertion in Rails (Sat Feb 02 2008)
In my last post I explained how I've been using Zach Dennis's ar-extensions to achieve exponential speed increases when importing large amounts of data into my Rails HR database - Accelerate HR.
Now, thanks to the folks at Nimble Method, I understand better why bulk data imports and updates in Rails normally take so long to execute, and why ar-extensions works so well. It seems that we have a problem with garbage collection.
"Every time you allocate 8M of memory GC runs. Complex Rails requests can allocate hundreds or even thousands of megabytes of memory, making GC runs dozens of times. Each GC pass takes 50-150ms. You do the math."
OK, you say, then, let's stop collecting the garbage. Or let's send the garbage collectors round less often. But that's not such a good idea if you want your Rails site smelling sweet. In fact if you left the garbage out there long enough, civilized society on your site would just break down.
No, instead Nimble Method are suggesting a green approach. Generate less garbage.
And the way you do that is to patch your Ruby code. Just go to the post and you'll see a number of smart suggestions, together with an impressive set of before-and-after performance measurements.
This is a great piece of work, and I'm glad to read that the guys have been submitting these patches at enhancements to core Rails and Ruby. But for any medium- or large-scale data import or update job, it now makes even more sense to take the ar-extensions approach and to avoid altogether the need to loop through ActiveRecord::Base#create or ActiveRecord::Base#save. By creating a single sql statement for the whole collection of data, we have just one large task instead of several thousand small ones. And the result is much less garbage. And exponentially faster imports and updates.
Don't get me wrong. The Nimble Method patches are important. Sometimes, ar-extensions is not going to do the job for you. For example, although you can use model validation to check the data, callbacks are not supported. So when in my database I wanted to import a set of several hundred employee records from an external source, and simultaneously update employee benefit entitlements depending on contract-status, grade, length of service, I found myself going back to the standard ActiveRecord approach (which I described here). In circumstances like this, the patches are certainly going to be useful. And the Nimble Methods patches aren't restricted only to data import and update issues.
* * * * *
Finally following on from last time, you might be interested to see how I used ar-extensions to import from a CSV file using a rake task. I've installed two gems: ar-extensions and FasterCSV.
1. Create the rake task ... /lib/tasks/load_employees.rake
desc "Load new employees for location into database."
task (:load_employees => :environment) do
columns = [:first_name, :last_name, :staff_number, :job_id, :date_of_hire]
values = FasterCSV.read("#{RAILS_ROOT}/lib/employees.csv")
before_count = Employee.count
Employee.import columns, values
puts "Loaded #{Employee.count - before_count} entries."
end
2. Save the required data as /lib/employees.csv - making sure the columns in the CSV file are in the correct order, and omitting column headers. Note the number of records in the file.
3. Make sure these lines are at the end of the /config/environment file:
require 'fastercsv'
require 'ar-extensions'
4. Run the Rake task - and then check that the reported number of records imported is the same as in the original file.
(The next post in this series takes a closer look at validation when importing bulk data. It seems that whether validation is on or off makes a huge difference to your import speed.)