A Ruby Helper To Cleanly Truncate HTML Text

by Andrea Singh | June 04, 2010

A problem that occasionally crops up, is how to best truncate text with HTML markup, e.g. in order to display the first lines of a blog post or the initial sentences of a product description. The Rails helper method #truncate does not take care of the problem caused by chopped HTML tags or missing end tags.

Things are relatively simple if the markup does not need to be taken into consideration. Then it is just a matter of efficiently stripping out all HTML tags and sending the remainder off to the truncate method.

One way to accomplish this is through regular expressions:

module TextHelper

  TAG_PATTERN = %r{(</?.*?>)}

  def truncate_html(text, max_length = 30, ellipsis = "...")
    tag_free = text.gsub(TAG_PATTERN, '')
    truncate(tag_free, :length => max_length, :omission => ellipsis)
  end

end

In certain circumstances, though, it is desirable to retain the HTML formatting. So we need an HTML-aware truncator, one that would only use the actual text to determine content length, make sure that all the HTML tags are properly closed and that HTML entities, such as &amp;, are left intact.

For example, if we wanted to truncate the following markup

  This text is <strong>bold</strong> and <i>beautiful</i>.

to 15 characters, then we expect this as a result:

  This text is <strong>bo</strong>

Googling for a solution, I found a blog post Rails truncate helper that handles HTML tags and entities by Henrik Nyh with a great looking code snippet, that uses Hpricot to accomplish just that. However I had some problems getting this code to work in some test cases, specifically in scenarios involving nested ul and li tags. In the end I could not discern any specific pattern for the failures, so I decided to try and port the same code to Nokogiri and lo and behold everything worked as it should.

There are actually some advantages to traversing the text fragment with Nokogiri: first, it is purportedly faster than Hpricot and secondly, it natively takes care of HTML entities, while Hpricot does not recognize them and thus requires the use of regular expressions. An added bonus with both HTML parsers is that they automatically fix malformed HTML.

So here's the final Nokogiri version:

require "rubygems"
require "nokogiri"

module TextHelper

  def truncate_html(text, max_length, ellipsis = "...")
    ellipsis_length = ellipsis.length     
    doc = Nokogiri::HTML::DocumentFragment.parse text
    content_length = doc.inner_text.length
    actual_length = max_length - ellipsis_length
    content_length > actual_length ? doc.truncate(actual_length).inner_html + ellipsis : text.to_s
  end

end

module NokogiriTruncator
  module NodeWithChildren
    def truncate(max_length)
      return self if inner_text.length <= max_length
      truncated_node = self.dup
      truncated_node.children.remove

      self.children.each do |node|
        remaining_length = max_length - truncated_node.inner_text.length
        break if remaining_length <= 0
        truncated_node.add_child node.truncate(remaining_length)
      end
      truncated_node
    end
  end

  module TextNode
    def truncate(max_length)
      Nokogiri::XML::Text.new(content[0..(max_length - 1)], parent)
    end
  end

end

Nokogiri::HTML::DocumentFragment.send(:include, NokogiriTruncator::NodeWithChildren)
Nokogiri::XML::Element.send(:include, NokogiriTruncator::NodeWithChildren)
Nokogiri::XML::Text.send(:include, NokogiriTruncator::TextNode)

You can add this code as a helper file in RAILS_ROOT/app/helpers/text_helper.rb