Is there a library for extracting data from an HTML page? [closed]

自闭症网瘾萝莉.ら 提交于 2019-12-12 02:53:34

问题


I would like to extract information from a web page. Unfortunately, the website (4chan) doesn't have a public API, for as far as I know.

What is a good library to extract specific data from an HTML document? I prefer a free software library that works on UNIX systems.


Edit: basically I want to get posts and images from 4chan. The webpage isn't valid HTML (and doesn't have a doctype) so the parser shouldn't be too strict.


回答1:


What you are looking for is an HTML Dom Parse.

This link of a previous question should help you out. Also check out this question




回答2:


It is correct, there are lots of libraries for parsing html data. For example, if you use Perl, you can use HTML::Parse.

If you just want a fast result and you agree to use a system command you can use:

lynx -dump http://4chan.org

or

links -dump http://4chan.org


来源:https://stackoverflow.com/questions/8972013/is-there-a-library-for-extracting-data-from-an-html-page

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!