data migration tip
Courtenay : August 20th, 2008
I’m tracking all the failures that occur in a model, so the users can easily track and resolve them. The data looks something like this:
Failures table
id | video_id | description
----+----------+------------------------
1 | 5 | Transcoding error
2 | 23 | Bad file type
Videos table
id | name | creator_id
----+----------+------------------------
5 | Kitten | 23
6 | Monkey | 12
23 | Elephant | 23
If we want to search for all failures by creator, we have to do a join on Failures and Video. To make this a little faster, I will denormalize the data a little, by adding a creator_id to failures table, and a callback to the Failure model to set the creator_id field. This is one of the scaling tradeoffs you need to make: slower writes, slower updates, larger table disk size, faster reads and counts (with grouping).
class Failure < ActiveRecord::Base
before_update :denormalize_creator
def denormalize_creator
self.creator_id = video && video.creator_id
end
end
This might have some issues depending on if you’re using #build to generate your Failure object. Regardless..
The temptation (for me, anyways) is to create a migration that looks something like this:
class AddCreatorIdToFailure ActiveRecord::Migration
def self.up
add_column :failures, :creator_id, :integer
Failure.each do |fail|
fail.update_attribute :creator_id, fail.video.creator_id
end
end
def self.down
remove_column :failures, :creator_id
end
end
There are a few things bad with this method.
1. You’re loading all failure objects into memory, then performing a query on each one. 2. If you have thousands of failures, it’s going to take some time to run. If it gets stopped partway through, you’ll have to comment out that “add_column” line to get it to re-run.
So. Step one, move the update to its own migration. Then, you can re-run it as often as you like.
Step two, make the migration a bit smarter. You can do this either by rewriting it in SQL, or by using something like paginated_each (jfgi).
When you do that, it’s worth throwing some conditions and an include in there. For example,
Failure.paginated_each(:order => "id desc", :conditions => "creator_id IS NULL", :include => :video) do |fail|
fail.update_attribute :creator_id, fail.video.creator_id
end
You can run this migration as many times as you like (it will only query the records it hasn’t updated). Ultimately, though, unless you’re doing polymorphic associations (which makes the join nigh on impossible), it’s going to be 10 – 100x faster (wild guess) doing the update in raw sql. Any takers on the best SQL for this situation?
5 Responses to “data migration tip”
Sorry, comments are closed for this article.
August 20th, 2008 at 01:58 PM
Wrapping the migration in a transaction block solves the problem of stopping the migration part way through, I think.
August 20th, 2008 at 02:05 PM
nope. schema changes don't run in transactions.
The SQL looks like
And it's about 10000x faster.
August 20th, 2008 at 04:08 PM
If you can't do JOIN on UPDATE on your database, you can still do a correlated subquery. It'll take forever, but faster and less memory than any Ruby-only approach.
August 23rd, 2008 at 09:43 AM
Something like:
should work too
September 4th, 2008 at 10:02 AM
Shouldn't there be Failure.resetcolumninformation in there?