extract text from google scholar

强颜欢笑 提交于 2020-01-17 08:35:49

问题


I am trying to extract the text from the test snippet that google scholar gives for a particular query. By text snippet I mean the text below the title (in black letter). Currently I am trying to extract it from the html file using python but it contains a lot of extra test such as

/div><div class="gs_fl"...etc.

Is there a easy way or some code which can help me get the text without these redundant texts.


回答1:


You need an html parser:

import lxml.html

doc = lxml.html.fromstring(html)
text = doc.xpath('//div[@class="gs_fl"]').text_content()

You can install lxml with "pip install lxml", but you'll need to build its dependencies, and the details will be different depending on what your platform is.



来源:https://stackoverflow.com/questions/15768499/extract-text-from-google-scholar

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!