I would like to extract information from a web page. Unfortunately, the website (4chan) doesn't have a public API, for as far as I know.
What is a good library to extract specific data from an HTML document? I prefer a free software library that works on UNIX systems.
Edit: basically I want to get posts and images from 4chan. The webpage isn't valid HTML (and doesn't have a doctype) so the parser shouldn't be too strict.
What you are looking for is an HTML Dom Parse.
This link of a previous question should help you out. Also check out this question
It is correct, there are lots of libraries for parsing html data. For example, if you use Perl, you can use HTML::Parse.
If you just want a fast result and you agree to use a system command you can use:
lynx -dump http://4chan.org
links -dump http://4chan.org