extract text from google scholar

问题

I am trying to extract the text from the test snippet that google scholar gives for a particular query. By text snippet I mean the text below the title (in black letter). Currently I am trying to extract it from the html file using python but it contains a lot of extra test such as

/div><div class="gs_fl"...etc.

Is there a easy way or some code which can help me get the text without these redundant texts.

回答1:

You need an html parser:

import lxml.html

doc = lxml.html.fromstring(html)
text = doc.xpath('//div[@class="gs_fl"]').text_content()

You can install lxml with "pip install lxml", but you'll need to build its dependencies, and the details will be different depending on what your platform is.

来源：https://stackoverflow.com/questions/15768499/extract-text-from-google-scholar

标签

python

text-mining

google-scholar

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!