Tag Archive for groovy

Groovy ecosystem

Among the devoxx 2008 university slides, you can find a Groovy/Grails presentation. This document contains interesting references:

  • Griffon is a Grails like application framework for developing desktop applications in Groovy. You can begin with the quick start guide.
  • Gradle is a build system like Ant, Maven or Ivy but trying to get the best from all. It supports multi-project build and  dependency management. And you still can use you old Ant tasks.
  • Easyb is Behavior Driven Development framework (BDD). It uses a specification based Domain Specific Language (DSL).  The main idea is to keep really close to the business needs all along the development process. With this tool you’ll have a readable documentation AND a unit-testing all-in-one. You may start by reading first this tutorial.
  • Compass is an open source project built on top of Lucene, to simplify the integration of search capabilities in your java applications .

And of course you can still refer to the Groovy and Grails websites.

Looking for a java html parser (or groovy)

I’m currently looking for a java library (or groovy one) to parse html. As you know, most of the time webpages are not as clean or valid as they should be, so the ideal tool should be somewhat tolerant to poor html code. After some initial homework, here is a list of potentially useful libraries:

  • Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.
  • MozillaParser is a Java Html parser based on mozilla’s html parser. it acts as a bridge from java classes to Mozilla’s classes and outputs a java Document object from a raw ( and dirty) HTML input.
  • HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans.
  • NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces.
  • JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
  • HtmlCleaner is open-source HTML parser written in Java. From a (dirty) HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. The process is like the creation Document Object Model (DOM) in browsers. But you can provide custom tag and rule set for tag filtering and balancing.
  • TagSoup is a SAX-compliant parser written in Java able to parse wild or nasty HTML as found on the web. A C++ port of the library is also avalaible.

The last 3 items are more cleaning tools intended to output well-formed balanced HTML code. Anyway, as JTidy have to build a DOM object of the HTML, you may use this to elegantly access data from your raw/dirty html source.
So for the time being, I would probably focus on Jericho HTML Parser and HTML Parser who seem the best candidates for the job. Moreover, they offer documentation and samples to get quickly started. I you already tried any of these, I would be happy to hear your recommandations.

On the other hand, if you need a generic parser, I would recommend JavaCC or JParsec. And if you are not satisfied yet, you may still look on java-source.net directory.

Now, let’s also talk a bit about Groovy, I only find 2 basic articles about parsing HTML in a Groovy way:

  • http://groovy.codehaus.org/Testing+Web+Applications
  • http://blog.foosion.org/2008/06/09/parse-html-the-groovy-way/

… which seems quite limited, so I should probably import some of this POJA (Plain Old Java API).  Or maybe you can point me other references?