unable to loop the last url with paging limits

坚强是说给别人听的谎言 提交于 2019-12-11 08:00:00

问题


I'm a newbie to web-scrapping - I set up a loop to scrap with 37900 records. Due to the way the url/ server is being set up, there's a limit of 200 records displayed in each url. Each url ends with 'skip=200', or mulitiple of 200 to loop to the next url page where the next 200 records are displayed. Eventually I want to loop through all urls and append them as a table.

I created two loops shown as below - one for creating urls with skip= every 200 records, and another one to get response of each of these urls, then append them to a single dataframe.

However I run into an error on my last url and unable to append these jsons into a single dataframe.

"The query specified in the URI is not valid. Invalid value 'i37800' for $skip query option found. The $skip query option requires a non-negative integer value."

Edited: after removing the i after 'skip=' on my url, the second loop threw me this error

TypeError: 'list' object is not callable

When I pop this url https://~/Projects?&$skip=37800 the records are properly displayed, so I'm not sure why python threw me this error. Please see below my codes - Would appreciate any suggestions to fix this error and loops!

Thanks!

import pandas as pd
import requests
import json

records = range(37900)
skip = records[0::200]

Page = []
for i in skip:
    endpoint = "https://~/Projects?&$skip=i{}".format(i)
    Page.append(endpoint)

tbls = []
for j in Page():
    response = session.get(j) #session here refers to requests.Session() I had to set up to authenticate my access to these urls
    responsejs = response.json()
    responsepd = pd.DataFrame(responsejs['value']) #I only want to extract header called 'value' in each json
    tbls.append(responsepd)

来源:https://stackoverflow.com/questions/58494208/unable-to-loop-the-last-url-with-paging-limits

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!