Sanitize your users' HTML input

Courtenay : August 25th, 2008

The default Rails sanitize helper is actually quite powerful. You can see some of its usage here:

<%= sanitize @article.body, :tags => %w(table tr td), :attributes => %w(id class style) %>

However, as the docs say,

Please note that sanitizing user-provided text does not 
guarantee that the resulting markup is valid.

We were having an issue with users providing bad markup and leaving their tags unclosed.

This is <a href="http://foo.com">my dog<a/> and he&#8217;s super cool!

We solved it by running Hpricot over their input.

before_save :clean_html
def clean_html
  self.body = Hpricot(body).to_html
end

For performance reasons, you should probably run the hpricot and sanitize methods on the way into the database, rather than rendering it in the views, because it’s somewhat slow, and is a calculation that you only need to perform once.

In fact, instead of saving it in a callback, you could overload the accessor like so:

def body=(new_body)
  write_attribute :body, Hpricot(new_body).to_html
end

You’ll want to include the ActionView methods from ActionView::Helpers::SanitizeHelper to get ‘sanitize’ available in your model.

9 Responses to “Sanitize your users' HTML input”

  1. Luke Francl Says:

    Good idea!

    I built xss terminate (http://code.google.com/p/xssterminate/) to help do this sanitization automatically when records are saved. I included support for sanitizing with HTML5lib (http://code.google.com/p/html5lib/) which parses HTML like browsers do to try to fix the invalid HTML problem, but I didn't try just running it through hpricot.

  2. Matthijs Langenberg Says:

    Good solution, but the thing that I don't like is including a view helper in the model. There is a reason for not having view helpers, route generators and sessions available in the model. Otherwise you'll get a really fat model.

  3. Joe Van Dyk Says:

    http://gist.github.com/7086

    You could put hpricot in there as well, and do everything in the controller where it belongs.

  4. court3nay Says:

    technoweenie's whitelist helper made it into rails, that sanitize method IS whitelist. just so you know.

  5. court3nay Says:

    oh i hate markdown

  6. rick Says:

    One issue I have is this changes the user's original body. This is why I tend to save to an alternate field like formatted_body.

    Also, if you look at the sanitize_helper, the html sanitizers are classes from the html tokenizer library. There's no need to include helpers:

    sanitized = HTML::WhiteListSanitizer.new.sanitize(body)
    formatted = Hpricot(sanitized)
    write_attribute :formatted_body, formatted
    
  7. Mina Says:

    If anyone misses perl's powerful HTML::Scrubber module, Michael Moen wrote HpricotScrub ( http://github.com/UnderpantsGnome/hpricot_scrub/tree/master )

    Example of scrubbing rules: http://github.com/UnderpantsGnome/hpricotscrub/tree/master/test/hpricotscrub_test.rb

  8. Dan Manges Says:

    I'm using Tidy on a project to make sure user input doesn't produce invalid html. Not sure how its performance compares to Hpricot, but I'm doing it on the way out of the database, and the performance hit is negligible.

  9. Kevin Olbrich Says:

    I've used hpricot_scrub before too, with a great deal of success. I actually do not store the scrubbed version. That way you don't have to worry about synchronizing the fields and pulling out twice as much data from the DB every time. If you properly cache your views, you won't actually be calling the scrub operation very often at all.

Sorry, comments are closed for this article.