<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Looking for a java html parser (or groovy)</title>
	<atom:link href="http://www.wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/</link>
	<description>Wavyx blog - Eric Rodriguez website</description>
	<lastBuildDate>Mon, 23 Jan 2012 21:18:50 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: Tom Copeland</title>
		<link>http://www.wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/comment-page-1/#comment-145</link>
		<dc:creator>Tom Copeland</dc:creator>
		<pubDate>Wed, 14 Jan 2009 20:51:09 +0000</pubDate>
		<guid isPermaLink="false">http://wavyx.net/?p=104#comment-145</guid>
		<description>Sure, yup, I used this web page (e.g., http://wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/), and the line causing the problem seems to be this one:

&#8230; which seems quite limited, so I should probably import some of this POJA (Plain Old Java API).  Or maybe you can point me other references?

Oddly, the tokenizing exception seems to be happening on the first space after the period.  Hm.  Weird.  Must be some hidden character there or something.  

If you want, email me at tom@infoether.com and I&#039;ll email you the source to Brian&#039;s HtmlParser; it comes with a build.xml so all you need to do is run Ant and it&#039;s all set to go.  It also comes with a sample HtmlScrubber class that normalizes element name cases, strips optional quotes, and so forth.</description>
		<content:encoded><![CDATA[<p>Sure, yup, I used this web page (e.g., <a href="http://wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/" rel="nofollow">http://wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/</a>), and the line causing the problem seems to be this one:</p>
<p>&#8230; which seems quite limited, so I should probably import some of this POJA (Plain Old Java API).  Or maybe you can point me other references?</p>
<p>Oddly, the tokenizing exception seems to be happening on the first space after the period.  Hm.  Weird.  Must be some hidden character there or something.  </p>
<p>If you want, email me at <a href="mailto:tom@infoether.com">tom@infoether.com</a> and I&#8217;ll email you the source to Brian&#8217;s HtmlParser; it comes with a build.xml so all you need to do is run Ant and it&#8217;s all set to go.  It also comes with a sample HtmlScrubber class that normalizes element name cases, strips optional quotes, and so forth.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Eric Rodriguez</title>
		<link>http://www.wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/comment-page-1/#comment-143</link>
		<dc:creator>Eric Rodriguez</dc:creator>
		<pubDate>Wed, 14 Jan 2009 15:06:30 +0000</pubDate>
		<guid isPermaLink="false">http://wavyx.net/?p=104#comment-143</guid>
		<description>I&#039;m having difficulties to download the jar version of your parser or to access quiotix website.
Could you tell me which full line of the html source is causing the problem?</description>
		<content:encoded><![CDATA[<p>I&#8217;m having difficulties to download the jar version of your parser or to access quiotix website.<br />
Could you tell me which full line of the html source is causing the problem?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Eric Rodriguez</title>
		<link>http://www.wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/comment-page-1/#comment-141</link>
		<dc:creator>Eric Rodriguez</dc:creator>
		<pubDate>Wed, 14 Jan 2009 14:25:33 +0000</pubDate>
		<guid isPermaLink="false">http://wavyx.net/?p=104#comment-141</guid>
		<description>On first guess, I would suggest some problem with UTF8 encoding.
But I should definitely try by myself. Which page did you use, or could you send me the html file to test?</description>
		<content:encoded><![CDATA[<p>On first guess, I would suggest some problem with UTF8 encoding.<br />
But I should definitely try by myself. Which page did you use, or could you send me the html file to test?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tom Copeland</title>
		<link>http://www.wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/comment-page-1/#comment-140</link>
		<dc:creator>Tom Copeland</dc:creator>
		<pubDate>Wed, 14 Jan 2009 14:11:32 +0000</pubDate>
		<guid isPermaLink="false">http://wavyx.net/?p=104#comment-140</guid>
		<description>Since it uses JavaCC HtmlParser is very fast and flexible, but it fails to parse some HTML - for example, when presented with the HTML of this blog post, it raised an exception:


$ java com.quiotix.html.parser.HtmlParser &lt; blog.html 
Exception in thread &quot;main&quot; com.quiotix.html.parser.TokenMgrError: Lexical error at line 101, column 107.  Encountered: &quot;\u2020&quot; (8224), after : &quot;&quot;
	at com.quiotix.html.parser.HtmlParserTokenManager.getNextToken(HtmlParserTokenManager.java:2046)
	at com.quiotix.html.parser.HtmlParser.jj_ntk(HtmlParser.java:571)
	at com.quiotix.html.parser.HtmlParser.ElementSequence(HtmlParser.java:46)
	at com.quiotix.html.parser.HtmlParser.HtmlDocument(HtmlParser.java:35)
	at com.quiotix.html.parser.HtmlParser.main(HtmlParser.java:27)


Ah well.</description>
		<content:encoded><![CDATA[<p>Since it uses JavaCC HtmlParser is very fast and flexible, but it fails to parse some HTML &#8211; for example, when presented with the HTML of this blog post, it raised an exception:</p>
<p>$ java com.quiotix.html.parser.HtmlParser &lt; blog.html<br />
Exception in thread &#8220;main&#8221; com.quiotix.html.parser.TokenMgrError: Lexical error at line 101, column 107.  Encountered: &#8220;\u2020&#8243; (8224), after : &#8220;&#8221;<br />
	at com.quiotix.html.parser.HtmlParserTokenManager.getNextToken(HtmlParserTokenManager.java:2046)<br />
	at com.quiotix.html.parser.HtmlParser.jj_ntk(HtmlParser.java:571)<br />
	at com.quiotix.html.parser.HtmlParser.ElementSequence(HtmlParser.java:46)<br />
	at com.quiotix.html.parser.HtmlParser.HtmlDocument(HtmlParser.java:35)<br />
	at com.quiotix.html.parser.HtmlParser.main(HtmlParser.java:27)</p>
<p>Ah well.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

