Crawling tables from webpage

我怕爱的太早我们不能终老 提交于 2020-01-05 03:19:05

问题


I'm trying to extract csu employee salary data from this webpage (http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento). I've tried using urlib2 and requests library, but none of them returned the actual table from the webpage. I guessed the reason could be that the table was generated dynamically by javascript. Below is my code using requests.

from lxml import html
import requests

page = requests.get("http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento")
tree = html.fromstring(page.text)
name = tree.xpath('//table/tbody/tr/td[2]/text()'

Any help/comments will be highly appreciated.


回答1:


Here's my attempt on it, as per my comment. Note that I only pulled out one line of data. All else is up to you.

Code:

import requests as rq

url = "http://api.sacbeelabs.com/v1/statepay/employee/search/name=/year=2013/department=CSU%20Sacramento.json"
data = "74XoegZ494trsvrus_As4B4handjZ494-Adl4B4olg494dnnk933pppAmWYXaaAYjh3mnWnakWq3-Ela-B-Oahkgjqaa07tw8tJmaWlYd07tw8tJiWha07tw8uH07tw8tJqaWl07tw8uHtrsu07tw8tJZakWlnhain07tw8uHGT-107tw8trTWYlWhainj4B4labalal494dnnk933mnWYfj-8albgjpAYjh3-Boamnejim3tt_v_rt_3YlWpgeic1nWXgam1bljh1paXkWca4B4nenga494TnWnaDVjlfalDTWgWlqDTaWlYdD1DUdaDTWYlWhainjDFaaBDTWYlWhainjBDGWgebjlieW4B4mYlV49sxzrB4mYlL49srwrB4peiV49sxzrB4peiL49_stB4oW4974Wcain494Oj-CeggW3wArD-I-6ss-MD-1Xoino-MDNeio-AD-Azx2xv-MDl-89tzAr-JDKaYfj3trsrrsrsDJelabj-A3tzAr4B4njoYd49bWgmaB4Zjh4954mnjlWca4B4WiehWneji4B4YWi-8WmtZ4B4paXmjYfan4B4pjlfal4B4WoZej4B4-8eZaj4B4m-8c4B4cajgjY46B4Ymm4954WiehWneji4B4nlWimbjlh468B4omal4974Woi494Koamn488"
headers = {
'Host': 'api.sacbeelabs.com',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'X-SBAPI-Auth-Token': '0QNWbefXw6fQQcWXqK8vDw',
'X-SBAPI-SID': '3gbRqglHXAVDy1vwdcVVMf',
'X-SBAPI-CID': '2HuWho39ZcDUlTswYSWUd9',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Referer': 'http://www.sacbee.com/statepay/',
'Content-Length': '684',
'Origin': 'http://www.sacbee.com',
'Cookie': 'sbapi-cid=2HuWho39ZcDUlTswYSWUd9; sbapi-sid=3gbRqglHXAVDy1vwdcVVMf',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}

r = rq.post(url, data=data, headers=headers)
json_data = r.json()

base = json_data["result"]["employees"][0] # First employee.

name = base["name"]
first_name = name["first"]
last_name = name["last"]

pay = base["pay"]["total"]

title = base["title"]
dept = base["department"]

print first_name, last_name, pay, title, dept
# Your turn here...

Result:

Clayton Abajian 9844 Lecturer - Academic Year CSU Sacramento
[Finished in 0.9s]



回答2:


Just took a quick look on the website you mentioned. It is indeed due to the fact that the table is loaded in using javascript. SO it is not actually part of the website you are requesting in your script.

To fix this, you'll probably have to look into the webrequests made by the website and find the one that retrieves the data of the table. It is not hard too do, just a nuisance. Take a look here; similar question. Hope it helps!



来源:https://stackoverflow.com/questions/22949029/crawling-tables-from-webpage

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!