Read article content using goose retrieving nothing

a 夏天 提交于 2019-12-20 07:45:20

问题


I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue.

Goose version used:https://github.com/agolo/python-goose/ Present version gives some errors.

from goose import Goose
from requests import get

response = get('http://www.highbeam.com/doc/1P3-979471971.html')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
print text

回答1:


Goose indeed uses several predefined elements which are likely a good starting point for finding the top node. If there are no "known" elements found, it starts looking for the top_node which in general is an element containing a lot of p tags inside it. You can read extractors/content.py for more details.

The given article does not have many traits of a common article, which is normally wrapped inside an article tag, or a div tag with class and id such as 'post-content', 'story-body', 'article', etc. It's a div tag with id = 'docText' and has no paragraphs, thus Goose cannot predict a good thing about it.

What I can suggest you is to add this line at the beginning of KNOWN_ARTICLE_CONTENT_TAGS constant in extractors/content.py:

KNOWN_ARTICLE_CONTENT_TAGS = [
    {'attr': 'id', 'value': 'docText'},
    ... other paths go here
]

and here is the extracted body:

Chennai, Dec. 19 -- The Tamil Nadu Government on Monday appointed a one-man judicial commission of inquiry to look into the reasons for Sunday's stampede in state capital Chennai, which claimed 42 lives and left another 37 injured.\n\nThe announcement of the formation of the commission came even as family members of those killed in a stampede agonised and agitated over the unexpected tragedy.\n\nThe 42 homeless people were trampled to death during the distribution of flood relief supplies at a shelter in the Tamil Nadu capital.\n\nOfficials said over 5,000 people rushed in as the gates of the shelter opened, causing the stampede.\n\nChitra, family member of a victim, said it was mismanagement that led to the tragedy. \u2026



来源:https://stackoverflow.com/questions/30381944/read-article-content-using-goose-retrieving-nothing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!