Python request api is not fetching data inside table bodies

橙三吉。 提交于 2020-01-02 18:06:49

问题


I am trying to scrap a webpage to get table values from text data returned from requests response.

</thead>
 <tbody class="stats"></tbody>
 <tbody class="annotation"></tbody>
 </table>
 </div>

Actually there is some data present inside tbody classes but `I am unable to access that data using requests.

Here is my code

server = "http://www.ebi.ac.uk/QuickGO/GProtein"
header = {'User-agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; 
rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'}
payloads = {'ac':'Q9BRY0'}
response = requests.get(server, params=payloads)

print(response.text)
#soup = BeautifulSoup(response.text, 'lxml')
#print(soup)

回答1:


Frankly, I'm beginning to lose interest in routine scraping involving products like selenium, and then beyond that I wasn't sure it would work. This approach does.

You would only do this, in this form at least, if you had more than a few files to download.

>>> import bs4
>>> form = '''<form method="POST" action="GAnnotation"><input name="a" value="" type="hidden"><input name="termUse" value="ancestor" type="hidden"><input name="relType" value="IPO=" type="hidden"><input name="customRelType" value="IPOR+-?=" type="hidden"><input name="protein" value="Q9BRY0" type="hidden"><input name="tax" value="" type="hidden"><input name="qualifier" value="" type="hidden"><input name="goid" value="" type="hidden"><input name="ref" value="" type="hidden"><input name="evidence" value="" type="hidden"><input name="with" value="" type="hidden"><input name="source" value="" type="hidden"><input name="q" value="" type="hidden"><input name="col" value="proteinDB,proteinID,proteinSymbol,qualifier,goID,goName,aspect,evidence,ref,with,proteinTaxon,date,from,splice" type="hidden"><input name="select" value="normal" type="hidden"><input name="aspectSorter" value="" type="hidden"><input name="start" value="0" type="hidden"><input name="count" value="25" type="hidden"><input name="format" value="gaf" type="hidden"><input name="gz" value="false" type="hidden"><input name="limit" value="22" type="hidden"></form>'''
>>> soup = bs4.BeautifulSoup(form, 'lxml')
>>> action = soup.find('form').attrs['action']
>>> action 
'GAnnotation'
>>> inputs = soup.findAll('input')
>>> params = {}
>>> for input in inputs:
...     params[input.attrs['name']] = input.attrs['value']
...     
>>> import requests
>>> r = requests.post('http://www.ebi.ac.uk/QuickGO/GAnnotation', data=params)
>>> r
<Response [200]>
>>> open('temp.htm', 'w').write(r.text)
4082

The downloaded file is what you would receive if you simply clicked on the button.

Details for the Chrome browser:

  • Open the page in Chrome.
  • Right-click on the 'Download' link.
  • Select 'Inspect'.
  • Select 'Network' in the Chrome _Developer_ menu (near the top), and then 'All'.
  • Click on 'Download' in the page.
  • --> Click on 'Download' in the newly opened window.
  • 'quickgoUtil.js:36' will appear in the 'Initiator' column.
  • Click on it.
  • Now you can set the breakpoint on `form.submit();` by clicking on its line number.
  • Click on 'Download' again; execution will pause at breakpoint.
  • In the right-hand window notice 'Local'. One of its contents is `form`. You can expand it for the contents of the form.

You want the outerHTML property of this element for the information used in the code above, namely for its action and name-value pairs. (And the implied information that POST is used.)

Now use the requests module to submit a request to the website.

Here's a list of the items in params in case you want to make other requests.

>>> for item in params.keys():
...     item, params[item]
... 
('qualifier', '')
('source', '')
('count', '25')
('protein', 'Q9BRY0')
('format', 'gaf')
('termUse', 'ancestor')
('gz', 'false')
('with', '')
('goid', '')
('start', '0')
('customRelType', 'IPOR+-?=')
('evidence', '')
('aspectSorter', '')
('tax', '')
('relType', 'IPO=')
('limit', '22')
('col', 'proteinDB,proteinID,proteinSymbol,qualifier,goID,goName,aspect,evidence,ref,with,proteinTaxon,date,from,splice')
('q', '')
('ref', '')
('select', 'normal')
('a', '')



回答2:


I get from your comment above that you're dealing with javascript. in order to scrape & parse javascript you could use selenium, Here is a snippet that could help in your case:

from selenium import webdriver
from bs4 import BeautifulSoup

url =''

browser = webdriver.Chrome()
browser.get(url)
soup = BeautifulSoup(browser.page_source, "lxml")
soup.prettify()

you will have to install ChromeDriver & Chrome Browser tho. if you want you could use headless browser like PhantomJs so you wouldn't have to deal with the whole chrome browser every time you execute the script.



来源:https://stackoverflow.com/questions/45241940/python-request-api-is-not-fetching-data-inside-table-bodies

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!