Looking for a java html parser (or groovy)
I’m currently looking for a java library (or groovy one) to parse html. As you know, most of the time webpages are not as clean or valid as they should be, so the ideal tool should be somewhat tolerant to poor html code. After some initial homework, here is a list of potentially useful libraries:
- Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.
- MozillaParser is a Java Html parser based on mozilla’s html parser. it acts as a bridge from java classes to Mozilla’s classes and outputs a java Document object from a raw ( and dirty) HTML input.
- HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans.
- NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces.
- JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
- HtmlCleaner is open-source HTML parser written in Java. From a (dirty) HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. The process is like the creation Document Object Model (DOM) in browsers. But you can provide custom tag and rule set for tag filtering and balancing.
- TagSoup is a SAX-compliant parser written in Java able to parse wild or nasty HTML as found on the web. A C++ port of the library is also avalaible.
The last 3 items are more cleaning tools intended to output well-formed balanced HTML code. Anyway, as JTidy have to build a DOM object of the HTML, you may use this to elegantly access data from your raw/dirty html source.
So for the time being, I would probably focus on Jericho HTML Parser and HTML Parser who seem the best candidates for the job. Moreover, they offer documentation and samples to get quickly started. I you already tried any of these, I would be happy to hear your recommandations.
On the other hand, if you need a generic parser, I would recommend JavaCC or JParsec. And if you are not satisfied yet, you may still look on java-source.net directory.
Now, let’s also talk a bit about Groovy, I only find 2 basic articles about parsing HTML in a Groovy way:
- http://groovy.codehaus.org/Testing+Web+Applications
- http://blog.foosion.org/2008/06/09/parse-html-the-groovy-way/
… which seems quite limited, so I should probably import some of this POJA (Plain Old Java API). Or maybe you can point me other references?
Tom Copeland said,
Wrote on January 14, 2009 @ 15:11
Since it uses JavaCC HtmlParser is very fast and flexible, but it fails to parse some HTML – for example, when presented with the HTML of this blog post, it raised an exception:
$ java com.quiotix.html.parser.HtmlParser < blog.html
Exception in thread “main” com.quiotix.html.parser.TokenMgrError: Lexical error at line 101, column 107. Encountered: “\u2020″ (8224), after : “”
at com.quiotix.html.parser.HtmlParserTokenManager.getNextToken(HtmlParserTokenManager.java:2046)
at com.quiotix.html.parser.HtmlParser.jj_ntk(HtmlParser.java:571)
at com.quiotix.html.parser.HtmlParser.ElementSequence(HtmlParser.java:46)
at com.quiotix.html.parser.HtmlParser.HtmlDocument(HtmlParser.java:35)
at com.quiotix.html.parser.HtmlParser.main(HtmlParser.java:27)
Ah well.
Eric Rodriguez said,
Wrote on January 14, 2009 @ 15:25
On first guess, I would suggest some problem with UTF8 encoding.
But I should definitely try by myself. Which page did you use, or could you send me the html file to test?
Eric Rodriguez said,
Wrote on January 14, 2009 @ 16:06
I’m having difficulties to download the jar version of your parser or to access quiotix website.
Could you tell me which full line of the html source is causing the problem?
Tom Copeland said,
Wrote on January 14, 2009 @ 21:51
Sure, yup, I used this web page (e.g., http://wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/), and the line causing the problem seems to be this one:
… which seems quite limited, so I should probably import some of this POJA (Plain Old Java API). Or maybe you can point me other references?
Oddly, the tokenizing exception seems to be happening on the first space after the period. Hm. Weird. Must be some hidden character there or something.
If you want, email me at tom@infoether.com and I’ll email you the source to Brian’s HtmlParser; it comes with a build.xml so all you need to do is run Ant and it’s all set to go. It also comes with a sample HtmlScrubber class that normalizes element name cases, strips optional quotes, and so forth.