How to extract plain text from HTML with Nokogiri

While working at an upcoming blogpost, I encountered the problem of extracting some plain text from HTML. If I was interested the whole plain text, I could just run html2text in bash and feed it with the HTML, but what I needed was just a specific part of the plain text between two certain comments. As it was hard to google a simple solution for this I decided to share mine.

Imagine the following HTML:

<h2>Here goes some heading we are not interested in.</h2>
<!-- someComment -->
        Here it goes. We are interested in this text.
        But <strong>some</strong> words are wrapped with HTML-Tags we are <em>not</em>
        interested in.
        <a href="bar">Or links,..</a>
<table>
<tbody>
<tr>
<td>Or a Table,...</td>
</tr>
</tbody>
</table>
<!-- /someComment -->
 
      But we do NOT care about this.
 
      <!-- foo -->
        Even if it is wrapped in another comment.
      <!-- /foo -->

The text of interest is the whole plain text between the comments containing the texts someComment and /someComment. We will use the awesome ruby xml lib Nokogiri and it’s  SAX-Parser to solve this problem. All we have to do is creating a subclass of Nokogiri::XML::SAX::Document and overwrite the certain event callbacks with the desired behavior.

Let’s start with recognizing comments. Put the following code in a file called plain_text_extractor.rb:

require "rubygems"
require "nokogiri"
 
class PlainTextExtractor &lt; Nokogiri::XML::SAX::Document
  # This method is called whenever a comment occurs and
  # the comments text is passed in as string.
  def comment(string)
    puts string
  end
end
 
parser = Nokogiri::HTML::SAX::Parser.new(PlainTextExtractor.new)
parser.parse_file ARGV[0]

If you put the HTML source in a file called sample_page.html you can now invoke the parser with the command ruby plain_text_extractor.rb sample_page.html. You should see some output like:

 someComment
 /someComment
 foo
 /foo

Nice, it matches all comments. We now to have to manage some sort of state to keep track of if we are in the area of interest or not.

Note: All following methods are defined within the class PlainTextExtractor in the first code snippet but the rest is left out for better readability.

  # Initialize the state of interest variable with false
  def initialize
    @interesting = false
  end
 
  # This method is called whenever a comment occurs and
  # the comments text is passed in as string.
  def comment(string)
    case string.strip        # strip leading and trailing whitespaces
    when /^someComment/      # match starting comment
      @interesting = true
    when /^/someComment/
      @interesting = false   # match closing comment
    end
  end

Invoking the script again leads to,…

Interesting!
Boring!

… and therefore shows that our state of interest is handled correctly.

The final step is to collect the desired plain text. We are lucky, there is another callback method method for this:

  attr_reader :plaintext
 
  # Initialize the state of interest variable with false
  def initialize
    @interesting = false
    @plaintext = ""
  end
 
  # This callback method is called with any string between
  # a tag.
  def characters(string)
    @plaintext &lt;&lt; string if @interesting
  end

To see the result you have to modify the code following the class definition as well:

pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]
puts pte.plaintext

Finally, here is what we should see when invoking the script again:

        Here it goes. We are interested in this text.
        But some words are wrapped with HTML-Tags we are not
        interested in.
        Or links,..

            Or a Table,...

This is exactly the plain text we wanted to extract. Yay!

It should be fairly easy to transfer this to whatever you would like to extract or even getting the whole plain text.

The complete source code is available in this gist.