How to extract plain text from HTML with Nokogiri
While working at an upcoming blogpost, I encountered the problem of extracting some plain text from HTML. If I was interested the whole plain text, I could just run
html2text in bash and feed it with the HTML, but what I needed was just a specific part of the plain text between two certain comments. As it was hard to google a simple solution for this I decided to share mine.
Imagine the following HTML:
<h2>Here goes some heading we are not interested in.</h2> <!-- someComment --> Here it goes. We are interested in this text. But <strong>some</strong> words are wrapped with HTML-Tags we are <em>not</em> interested in. <a href="bar">Or links,..</a> <table> <tbody> <tr> <td>Or a Table,...</td> </tr> </tbody> </table> <!-- /someComment --> But we do NOT care about this. <!-- foo --> Even if it is wrapped in another comment. <!-- /foo -->
The text of interest is the whole plain text between the comments containing the texts
/someComment. We will use the awesome ruby xml lib Nokogiri and it’s SAX-Parser to solve this problem. All we have to do is creating a subclass of Nokogiri::XML::SAX::Document and overwrite the certain event callbacks with the desired behavior.
Let’s start with recognizing comments. Put the following code in a file called
require "rubygems" require "nokogiri" class PlainTextExtractor < Nokogiri::XML::SAX::Document # This method is called whenever a comment occurs and # the comments text is passed in as string. def comment(string) puts string end end parser = Nokogiri::HTML::SAX::Parser.new(PlainTextExtractor.new) parser.parse_file ARGV
If you put the HTML source in a file called
sample_page.html you can now invoke the parser with the command
ruby plain_text_extractor.rb sample_page.html. You should see some output like:
someComment /someComment foo /foo
Nice, it matches all comments. We now to have to manage some sort of state to keep track of if we are in the area of interest or not.
Note: All following methods are defined within the class
PlainTextExtractor in the first code snippet but the rest is left out for better readability.
# Initialize the state of interest variable with false def initialize @interesting = false end # This method is called whenever a comment occurs and # the comments text is passed in as string. def comment(string) case string.strip # strip leading and trailing whitespaces when /^someComment/ # match starting comment @interesting = true when /^\/someComment/ @interesting = false # match closing comment end end
Invoking the script again leads to,…
… and therefore shows that our state of interest is handled correctly.
The final step is to collect the desired plain text. We are lucky, there is another callback method method for this:
attr_reader :plaintext # Initialize the state of interest variable with false def initialize @interesting = false @plaintext = "" end # This callback method is called with any string between # a tag. def characters(string) @plaintext << string if @interesting end
To see the result you have to modify the code following the class definition as well:
pte = PlainTextExtractor.new parser = Nokogiri::HTML::SAX::Parser.new(pte) parser.parse_file ARGV puts pte.plaintext
Finally, here is what we should see when invoking the script again:
Here it goes. We are interested in this text. But some words are wrapped with HTML-Tags we are not interested in. Or links,.. Or a Table,...
This is exactly the plain text we wanted to extract. Yay!
It should be fairly easy to transfer this to whatever you would like to extract or even getting the whole plain text.
The complete source code is available in this gist.blog comments powered by Disqus