Extracting Information from websites

问题

Not every website exposes their data well, with XML feeds, APIs, etc

How could I go about extracting information from a website? For example:

...
<div>
  <div>
    <span id="important-data">information here</span>
  </div>
</div>
...

I come from a background of Java programming and coding with Apache XMLBeans. Is there anything similar to parse HTML, when I know the structure and the data is between a known tag?

Thanks

回答1:

There are several Open Source HTML Parsers out there for Java.

I have used JTidy in the past, and have had good luck with it. It will give you a DOM of the html page, and you should be able to grab the tags you need from there.

回答2:

Here's an article that has a couple of screen scraping tools written in java.

In general, it sounds like you want to take a look at regular expressions, which do the pattern matching you're looking for.

Hope that helps!

回答3:

Java seems like a fairly difficult constraint for such a task. Is that a hard requirement? Scripting languages are ideal for building what is really lots of last-mile code.

If you're be open to it, ruby + hpricot makes that completely trivial. You can use css or xpath selectors (or both) to find (and manipulate) the content in HTML. Grabbing the document, parsing it, and extracting the text in your example is literally one line of code.

来源：https://stackoverflow.com/questions/318564/extracting-information-from-websites

标签

java

html

html-content-extraction

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!