How can I extract only the main textual content from an HTML page?

前端 未结 9 1415
旧巷少年郎
旧巷少年郎 2021-01-31 04:48

Update

Boilerpipe appears to work really well, but I realized that I don\'t need only the main content because many pages don\'t have an article, but only links with s

相关标签:
9条回答
  • 2021-01-31 05:19

    There appears to be a possible problem with Boilerpipe. Why? Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.

    So one can crudely classify web pages into three kinds in respect to Boilerpipe:

    1. a web page with a single article in it (Boilerpipe worthy!)
    2. a web with multiple articles in it, such as the front page of the New York times
    3. a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.

    Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.

    0 讨论(0)
  • 2021-01-31 05:21

    You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

    0 讨论(0)
  • 2021-01-31 05:22

    You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).

    Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:

    Reader reader = ...
    InputSource is = new InputSource(reader);
    
    // parse the document into boilerpipe's internal data structure
    TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
    
    // perform the extraction/classification process on "doc"
    ArticleExtractor.INSTANCE.process(doc);
    
    // iterate over all blocks (= segments as "ArticleExtractor" sees them) 
    for (TextBlock block : getTextBlocks()) {
        // block.isContent() tells you if it's likely to be content or not 
        // block.getText() gives you the block's text
    }
    

    TextBlock has some more exciting methods, feel free to play around!

    0 讨论(0)
  • 2021-01-31 05:23

    http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php

    Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.

    You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.

    If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

    0 讨论(0)
  • 2021-01-31 05:30

    You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:

    Tag Soup

    HTML Unit

    0 讨论(0)
  • 2021-01-31 05:31

    You can filter the html junk and then parse the required details or use the apis of the existing site. Refer the below link to filter the html, i hope it helps. http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

    0 讨论(0)
提交回复
热议问题