问题
I am trying to extract the text from the test snippet that google scholar gives for a particular query. By text snippet I mean the text below the title (in black letter). Currently I am trying to extract it from the html file using python but it contains a lot of extra test such as
/div><div class="gs_fl"
...etc.
Is there a easy way or some code which can help me get the text without these redundant texts.
回答1:
You need an html parser:
import lxml.html
doc = lxml.html.fromstring(html)
text = doc.xpath('//div[@class="gs_fl"]').text_content()
You can install lxml with "pip install lxml", but you'll need to build its dependencies, and the details will be different depending on what your platform is.
来源:https://stackoverflow.com/questions/15768499/extract-text-from-google-scholar