Read article content using goose retrieving nothing

后端 未结 1 1967
南方客
南方客 2021-01-29 08:14

I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it\'s doesn\'t show any text. Please help me out here with th

1条回答
  •  时光取名叫无心
    2021-01-29 09:09

    Goose indeed uses several predefined elements which are likely a good starting point for finding the top node. If there are no "known" elements found, it starts looking for the top_node which in general is an element containing a lot of p tags inside it. You can read extractors/content.py for more details.

    The given article does not have many traits of a common article, which is normally wrapped inside an article tag, or a div tag with class and id such as 'post-content', 'story-body', 'article', etc. It's a div tag with id = 'docText' and has no paragraphs, thus Goose cannot predict a good thing about it.

    What I can suggest you is to add this line at the beginning of KNOWN_ARTICLE_CONTENT_TAGS constant in extractors/content.py:

    KNOWN_ARTICLE_CONTENT_TAGS = [
        {'attr': 'id', 'value': 'docText'},
        ... other paths go here
    ]
    

    and here is the extracted body:

    Chennai, Dec. 19 -- The Tamil Nadu Government on Monday appointed a one-man judicial commission of inquiry to look into the reasons for Sunday's stampede in state capital Chennai, which claimed 42 lives and left another 37 injured.\n\nThe announcement of the formation of the commission came even as family members of those killed in a stampede agonised and agitated over the unexpected tragedy.\n\nThe 42 homeless people were trampled to death during the distribution of flood relief supplies at a shelter in the Tamil Nadu capital.\n\nOfficials said over 5,000 people rushed in as the gates of the shelter opened, causing the stampede.\n\nChitra, family member of a victim, said it was mismanagement that led to the tragedy. \u2026

    0 讨论(0)
提交回复
热议问题