how to extract data using jtidy and xpath

不羁岁月 提交于 2019-12-25 01:08:02

问题


i have to extract d company name and face value from http://money.rediff.com/companies/20-microns-ltd/15110088

i noticed that this task could be accomplished using xpath api. since this is an html page, i am using jtidy parser.

this is the xpath for the face value which i have to extract.

/html/body/div[4]/div[6]/div[9]/div/table/tbody/tr[4]/td[2]

This is my code

URL oracle = new URL("http://money.rediff.com/companies/20-microns-ltd/15110088");
URLConnection yc = oracle.openConnection();
InputStream is = yc.getInputStream();
is = oracle.openStream();
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
String expression = "/html";
XPathExpression xPathExpression = xPath.compile(expression);
Object result = xPathExpression.evaluate(tidyDOM,XPathConstants.NODESET);
System.out.println(result.toString());

please guide me further, because, i cannot find a right solution for the above


回答1:


Try not to use "full" xpaths.

//div[@id='leftcontainer']//div[9]//table//tr[4]/td[2]

is better than

/html/body/.../.../.../.../.../...

Most HTML pages are not valid or even well-formed. So the DOM structure may change when processed by "real-world HTML parsers". For example, a <tbody> may be inserted under <table> if there isn't one. Things are worse when different HTML parsers generate different DOM trees so one XPath may be valid for one parser, but not the other. I would rather use "wildcards" like table//tr[4] instead of table/tbody/tr[4] or table/tr[4] so that I can forget about <tbody>. Such expressions are more robust when used against the messy real-world HTML pages.

You can use Firepath, a plugin for Firebug which is then a plugin for Firefox, to debug XPath expressions.

p.s. You can try my JHQL (http://github.com/wks/jhql) project for exactly this task. You will like it if you have more pages to extract data from.



来源:https://stackoverflow.com/questions/7049150/how-to-extract-data-using-jtidy-and-xpath

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!