data migration tip

Courtenay : August 20th, 2008

I’m tracking all the failures that occur in a model, so the users can easily track and resolve them. The data looks something like this:


 Failures table

  id | video_id | description
 ----+----------+------------------------
  1  | 5        | Transcoding error
  2  | 23       | Bad file type

 Videos table

  id | name     | creator_id
 ----+----------+------------------------
  5  | Kitten   | 23
  6  | Monkey   | 12
  23 | Elephant | 23    

 

If we want to search for all failures by creator, we have to do a join on Failures and Video. To make this a little faster, I will denormalize the data a little, by adding a creator_id to failures table, and a callback to the Failure model to set the creator_id field. This is one of the scaling tradeoffs you need to make: slower writes, slower updates, larger table disk size, faster reads and counts (with grouping).

class Failure < ActiveRecord::Base
  before_update :denormalize_creator  
  def denormalize_creator
    self.creator_id = video && video.creator_id
  end
end

This might have some issues depending on if you’re using #build to generate your Failure object. Regardless..

The temptation (for me, anyways) is to create a migration that looks something like this:

class AddCreatorIdToFailure  ActiveRecord::Migration
  def self.up
    add_column :failures, :creator_id, :integer
    Failure.each do |fail|
      fail.update_attribute :creator_id, fail.video.creator_id
    end
  end
  def self.down
    remove_column :failures, :creator_id
  end
end

There are a few things bad with this method.

1. You’re loading all failure objects into memory, then performing a query on each one. 2. If you have thousands of failures, it’s going to take some time to run. If it gets stopped partway through, you’ll have to comment out that “add_column” line to get it to re-run.

So. Step one, move the update to its own migration. Then, you can re-run it as often as you like.

Step two, make the migration a bit smarter. You can do this either by rewriting it in SQL, or by using something like paginated_each (jfgi).

When you do that, it’s worth throwing some conditions and an include in there. For example,

Failure.paginated_each(:order => "id desc", :conditions => "creator_id IS NULL", :include => :video) do |fail|
  fail.update_attribute :creator_id, fail.video.creator_id
end

You can run this migration as many times as you like (it will only query the records it hasn’t updated). Ultimately, though, unless you’re doing polymorphic associations (which makes the join nigh on impossible), it’s going to be 10 – 100x faster (wild guess) doing the update in raw sql. Any takers on the best SQL for this situation?

5 Responses to “data migration tip”

  1. Joe Van Dyk Says:

    Wrapping the migration in a transaction block solves the problem of stopping the migration part way through, I think.

  2. court3nay Says:

    nope. schema changes don't run in transactions.

    The SQL looks like

    update failures left join videos on failures.video_id=videos.id set failures.creator_id = videos.creator_id.
    

    And it's about 10000x faster.

  3. Jordi Bunster Says:

    If you can't do JOIN on UPDATE on your database, you can still do a correlated subquery. It'll take forever, but faster and less memory than any Ruby-only approach.

  4. Paul Campbell Says:

    Something like:

    update failures set failures.creator_id = (select creator_id from videos where id = failures.video_id)
    

    should work too

  5. Dave Says:

    Shouldn't there be Failure.resetcolumninformation in there?

Sorry, comments are closed for this article.