The Accelerate HR Blog

Bulk imports into Rails: validate if you must   (Sun Feb 03 2008)

I've been importing data from a legacy database, and noticing how much time Rails spends on validation.

With just over 900 employee absence records to bring into Accelerate HR, I started by converting the original table into a csv file and then used a standard looped Rake task.

desc "Load archived absences into database."
task :load_absences => ["#{RAILS_ROOT}/lib/absences.csv", :environment] do | t | before_count = Absence.count

File.read(t.prerequisites.first).each do | line |

employee_id, first_day, last_day, days_not_counted, total_days, absencetype_id,
percent_to_pay, docs_provided, payroll_month, payroll_year, note = line.split(/,/)

Absence.create(:employee_id => employee_id,
:first_day => first_day,
:last_day => last_day,
:days_not_counted => days_not_counted,
:total_days => total_days,
:absencetype_id => absencetype_id,
:percent_to_pay => percent_to_pay,
:docs_provided => docs_provided,
:payroll_month => payroll_month,
:payroll_year => payroll_year,
:note => note
)
end
puts "Loaded #{Absence.count - before_count} entries."
end


The result? 5 minutes 25 seconds to complete the download.


So then I tried the ar_extensions method that I've been talking about recently. This is how the rake file looked:

desc "Load archived absences into database."
task (:load_absences => :environment) do

columns = [:employee_id, :first_day, :last_day, :days_not_counted, :total_days,
:absencetype_id , :percent_to_pay, :docs_provided, :payroll_month, :payroll_year, :note]

values = FasterCSV.read("#{RAILS_ROOT}/lib/absences.csv")
#Comment out this line when validation is on:
#options = { :validate => false }

before_count = Absence.count

#Comment out this line when validation is off
Absence.import columns, values

#Comment out this line when validation is on
#Absence.import columns, values, options

puts "Loaded #{Absence.count - before_count} entries."

end


A little better at 4 minutes 40 seconds - but actually I was pretty disappointed. After the spectacular success I'd had using ar-extensions to create new tables within the database, I was expecting a better gain.

So then I tried using ar-extensions with its handy validation off option. This was safe: the records had already been validated in the original database. (Sometimes it helps if you're maintaining the legacy database too!)

Now there was a quite dramatic increase: just 9 seconds for the task to complete, 36 times faster than the original rake task, and over 30 times faster than the same ar-extensions method with validation on.

So it seems that while working with a native sql statement rather than an ActiveRecord loop certainly improves speed, it's validation that's slowing everything down. Why should this be? Well, it looks as though the research just published by Nimble Method could be pointing us in the right direction, and beginning to give the right answers.

I got a further improvement of a second and a half by turning off the record count - but the importance of the count outweighs the small gain. It's my guarantee that all the records have been successfully imported.


So on to my next import task - nearly 40,000 overtime records. That's the monthly figure coming in from my Middle East construction client, working to a very tight schedule on a major project. And it's why I need efficient data import methods. Lots more records ... but a simpler 10-field table table this time - and just 24 seconds for all 40,000 items..

Now this isn't always going to work. More often than not, the data we import will need validation. But if speed is important and you're really sure it's not necessary to validate your source data, then I'd suggest you just don't do it.

Filed under: Ruby on Rails




Comment 6 months ago from kjetilge

This looks great. Perfect for importing my CSV list of schools for mass mailing. However this code seems not to work the way it's supposed to. I keep getting the error: Array can't be coerced into Fixnum.

I've printed the columns and values to screen and they look fine. Something goes wrong when I call School.import columns , values.

Have I forgotten something ?

This is my code:

namespace :database do
require 'faster_csv'
require 'ar-extensions'

desc "Load schools into database."
task (:load_schools => :environment) do
columns = [:navn,:care_off,:postadresse,:postnr,:poststed,:kommunenr,:kommune,:telefon,
:telefon_2,:telefaks,:klasser,:elever,:maalform,:epost,:internett,:adressetype]
values = FasterCSV.read("#{RAILS_ROOT}/lib/skoler.csv")
#Comment out this line when validation is on:
#options = { :validate => false }
p values
before_count = School.count
#Comment out this line when validation is off
#School.import_data columns, values
School.import columns , values

#Comment out this line when validation is on
#School.import columns, values, options

#puts "Loaded #{School.count - before_count} entries."

end

end
Comment 5 months ago from alanmiles

Hi kjetilge

Sorry not to have responded sooner. I've been having so much designing that I've neglected to look after the shop. Sorry too that your comment came out looking so awful. I really must add text formatting to the comment box.

From the error message, it looks as if Ruby's moaning about .csv trying to send text into a number field. You've got validation ON in your code, and that gives Ruby license to moan.

My first guess is that you might have set up your phone or fax fields to numerics, but then tried to send a number with the phone code separated from the number - or with a hyphen between the numbers - which would make it a string. It's my first guess because that's the kind of mistake I might have made.

Or, if not that, then could the ' p values ' line be the culprit? - I can't really see how that fits anyway.

You could clean up the rest a bit too to make it easier to read. For example, if you're not going to output the number of records imported at the completion of your task (the line's commented out), then you also don't need the ' before_count = School.count ' line.

Hope this helps.
Comment about 1 month ago from Guest

Are you using sqlite3? There's a bug in there... http://rubyforge.org/tracker/index.php?func=detail&aid=13659&group_id=2113&atid=8253


List recent Entries
List all blog entries filed under:

Employment Politics
HR
Implementation
Just thoughts
Ruby on Rails
Web 2.0

Can't find what you're looking for? Try this: -

Search blog for a word or phrase



 Subscribe to an RSS feed

Or get an email copy every time we post something new. Nothing new? Nothing mailed

Enter your email address:

Delivered by FeedBurner


If you're enjoying our blog, why not find out more, and maybe get involved?

ACCELERATE HR is a website built on Rails and designed for the enterprise. And we're building it live on the Web, right here.

Check out our home page HERE, or sign up for free HERE.


DESERT ISLAND BLOGS

Sharing a few of my favorites

HR Stimuli
McArthur's Rant

Jon Ingham's Strategic Human Capital Management Blog

The Rails Track
Railscasts

Web Power
The Technology Edge

Window on the Gulf
Mahmood's Den