How can I extract only the main textual content from an HTML page?

前端未结

关注

 9  1434

Update

Boilerpipe appears to work really well, but I realized that I don\'t need only the main content because many pages don\'t have an article, but only links with s

相关标签:

9条回答

予麋鹿

2021-01-31 05:19
There appears to be a possible problem with Boilerpipe. Why? Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.

So one can crudely classify web pages into three kinds in respect to Boilerpipe:
1. a web page with a single article in it (Boilerpipe worthy!)
2. a web with multiple articles in it, such as the front page of the New York times
3. a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.
0 讨论(0)
发布评论:

提交评论
- 加载中...
挽巷

2021-01-31 05:21

You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

0 讨论(0)
发布评论:

提交评论
- 加载中...

轻奢々

2021-01-31 05:22

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).

Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:

Reader reader = ...
InputSource is = new InputSource(reader);

// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();

// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);

// iterate over all blocks (= segments as "ArticleExtractor" sees them) 
for (TextBlock block : getTextBlocks()) {
    // block.isContent() tells you if it's likely to be content or not 
    // block.getText() gives you the block's text
}

TextBlock has some more exciting methods, feel free to play around!

0 讨论(0)

醉话见心

2021-01-31 05:23

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php

Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.

You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.

If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

0 讨论(0)
发布评论:

提交评论
- 加载中...
自闭症患者

2021-01-31 05:30

You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:

Tag Soup

HTML Unit

0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2021-01-31 05:31

You can filter the html junk and then parse the required details or use the apis of the existing site. Refer the below link to filter the html, i hope it helps. http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页