weed v2 is coming.. statistical aggregates on rails
Courtenay : April 15th, 2007
If you're interested in keeping a lot of historical statistical data in rails, watch out! Weed is being rewritten by a crack team of monkeys.. I mean, a team of monkeys on crack. By which I mean me.
The storage of data is now automatically aggregated. This means, speed! For those in the know, I borrowed the idea from rrdtool.
an rrdtool primer
If you've never heard of rrdtool, here's how I decided it works. (note: this may be inaccurate)
Normally, you store a bunch of pageview data like this:
created_at: 01-jan 00:25
ip_address: 12.43.522.343
user_agent: Mozila blah blah
created_at: 01-jan 01:45
ip_address: 12.43.522.343
user_agent: Mozila blah blah
And so on. To build a pretty graph of this data, you have to do a sql grouping statement (in mysql) like
select count(id), concat(year(created_at), '-', dayofyear(created_at))
from hits
group by year(created_at), dayofyear(created_at)
or in rails like
Hit.find :all,
:select => "count(id), concat(year(created_at), '-', dayofyear(created_at))",
:group => "year(created_at), dayofyear(created_at)"
which builds up a table something like
+-----------+----------+
| count(id) | date |
+-----------+----------+
| 13 | 2005-256 |
| 313 | 2005-257 |
| 349 | 2005-258 |
| 260 | 2005-259 |
| 163 | 2005-260 |
Now, this is fine. Until you get to about 300,000 records (which is my test data), and that query takes 400ms. Imagine it now with a few million records and it becomes prohibitive to generate that data on a pageload.
If you think about it, the data doesn't actually change; we can't go back in time and hit those pages again. So we keep around a cached version in a new table, like so:
date: 01-jan-2007
hits: 20
date: 02-jan-2007
hits: 45
That way, the query is a straightforward select, no counts or groups which on my dataset runs at less than one millisecond. Perfect.
Now, let's abstract that out a little. What if, instead of storing daily values, we wanted to store weekly values?
date: 01-jan-2007
timespan: 86400 # one day in seconds
hits: 20
date: 01-jan-2007
timespan: 604800 # one week in seconds
hits: 190
So, provided you're regularly aggregating data, you can easily count a year's worth of data by summing the weeks, not the days (which is 1/52 less work) or months (1/12th as much work). This may not seem much but over millions of records, it matters; and even more so if you're performing work on the value, such as doing Math.log on the hitcount.
Automatically updating the aggregate table
Lets imagine we have an Aggregate model
class Rrd::Aggregate < ActiveRecord::Base
And we want to add a method to cache those counts. On mysql we can use the "insert into .. select ..." sql like this:
def self.update
connection.execute("
insert into aggregates (timespan, name, value, time_from)
select '86400', concat(...), count(hits.id), DATE_FORMAT(created_at, '%Y%m%d000000')
from hits
group by dayofyear(created_at), year(created_at)
")
end
This populates the aggregates table with the data we want. Now, wrap it up in a class method, and oh! we don't want to count all records every time, only records since the last update.
def self.last_update
if last = find(:first, :order => 'time_from desc')
last.time_from
end
end
The eagle-eyed amongst you may realise that we may have been part-way through a day before, so that just by updating the aggregate table, the stats get skewed. So, my hackish solution is to first extract that date_format string, and then delete all records matching the last_update's time_from. This way we clear out the last day's data and re-evaluate it.
def self.update(domain_id)
date_format = case @@precision
when 1.day; "%Y%m%d000000"
...
delete_all(["timespan = ? and name=DATE_FORMAT(?, '#{date_format}')", @@precision, last_update(domain_id)])
connection.execute(" ... ")
end
Finally, let's add in some more assumptions; the aggregate table is abstract and uses STI; and there's a domain for stats, so we can grab stats for more than one site.
For the STI, I make sure to define "aggregate_table" in the subclass (in this case, it would be 'hits') and have to do some magic to get the class name.
Note I changed the SQL a little to use a subquery, such that it's less work to define some of the functions. Probably a case of too much DRY, but it feels subjectively better.
For the domain, we just have to pass it around to the methods.
# Table name: aggregates
#
# id :integer(11) not null, primary key
# timespan :integer(11)
# name :string(255)
# value :integer(11)
# domain_id :integer(11)
# time_from :datetime
# type :string(255)
class Rrd::Aggregate < ActiveRecord::Base
belongs_to :domain
@@precision = 1.day
def self.last_update(domain_id)
if last = find(:first, :conditions => ['domain_id=?', domain_id], :order => 'time_from desc')
last.time_from
end
end
# Update the aggregate table with any new data.
def self.update(domain_id)
date_format = case @@precision
when 1.day; "%Y%m%d000000"
else
raise "Precision value not supported"
end
# delete the last day's records since they may have changed.
delete_all(["timespan = ? and name=DATE_FORMAT(?, '#{date_format}')", @@precision, last_update(domain_id)])
connection.execute("
insert into #{table_name} (timespan, name, value, domain_id, time_from, type)
select '#{@@precision}', name, count, #{domain_id}, name, '#{self.name.split(":")[-1]}'
from (
select DATE_FORMAT(created_at, '#{date_format}') as name, count(#{aggregate_table}.id) as count from #{aggregate_table}
where #{aggregate_table}.domain_id = #{domain_id} and #{aggregate_table}.created_at >= '#{last_update(domain_id)}'
group by name) #{aggregate_table}_2
")
end
end
class Rrd::AggregateUnique < Rrd::Aggregate
def self.aggregate_table # verb; this model aggregates ....
"uniques"
end
end
class Rrd::AggregatePageView < Rrd::Aggregate
def self.aggregate_table # verb; this model aggregates ....
"page_views"
end
end
how to use it
The domain model (through which everything is queried) looks something like this:
class Domain < ActiveRecord::Base
has_many :page_views do
def count(*args)
@owner.rrd_aggregate_page_views.sum :value, *args
end
end
has_many :uniques do
def count(*args)
@owner.rrd_aggregate_uniques.sum :value, *args
end
end
has_many :rrd_aggregate_page_views, :class_name => 'Rrd::AggregatePageView', :conditions => ["timespan = ?", 1.day]
has_many :rrd_aggregate_uniques, :class_name => 'Rrd::AggregateUnique', :conditions => ["timespan = ?", 1.day]
I can then create a graph with plotr by sending this data set:
@domain.rrd_aggregate_uniques.find(:all, { :conditions => ["time_from > ?", days.days.ago] })
Bonus points for anyone who works out what I'm missing from the data set. I'll post more on this in the coming days.
Final question: Do you have a better way to do this?
1 Response to “weed v2 is coming.. statistical aggregates on rails”
Sorry, comments are closed for this article.
April 16th, 2007 at 05:10 AM
Ah, aggregation of large data sets. We’ve been hacking away at similar issues for ActiveWarehouse, although we’re dealing with some other thorniness such as loading rather large amounts of slowly changing data from disparate sources who can’t agree on a single exchange format from month to month. Feel the joy. Anyhow, relational aggregation is just one way to aggregate and store data, you may be interested in looking at some of the research done for calculating cubes with very large data sets, such as MOLAP, Dwarf and more recently CURE (which is actually a ROLAP as well). Beware though, this rabbit hole is deep. ;-)