How to scrape multiple pages with an unchanging URL - Python 3

后端 未结 1 1349
清酒与你
清酒与你 2020-12-03 12:30

I recently got in touch with web scraping and tried to web scrape various pages. For now, I am trying to scrape the following site - http://www.pizzahut.com.cn/StoreList

相关标签:
1条回答
  • 2020-12-03 13:17

    Steps to get the data:

    Open the developer tools in your browser (for Google Chrome it's Ctrl+Shift+I). Now, go to the XHR tab which is located inside the Network tab.

    After doing that, click on the next page button. You'll see the following file.

    Click on that file. In the General block, you'll see these 2 things that we need.

    Scrolling down, in the Form Data tab, you can see the 3 variables as

    Here, you can see that changing the value of pageIndex will give all the pages required.

    Now, that we've got all the required data, we can write a POST method for the URL http://www.pizzahut.com.cn/StoreList/Index using the above data.

    Code:

    I'll show you the code to scrape first 2 pages, you can scrape any number of pages you want by changing the range().

    for page_no in range(1, 3):
        data = {
            'pageIndex': page_no,
            'pageSize': 10,
            'keyword': '输入餐厅地址或餐厅名称'
        }
        page = requests.post('http://www.pizzahut.com.cn/StoreList/Index', data=data)
        soup = BeautifulSoup(page.text, 'html.parser')
    
        print('PAGE', page_no)
        for row in soup.find_all('div',class_='re_RNew'):
            name = row.find('p',class_='re_NameNew').string
            info = row.find('input').get('value')
            location = info.split('|')
            location_data = location[0].split(',')
            longitude = location_data[0]
            latitude = location_data[1]
            print(longitude, latitude)
    

    Output:

    PAGE 1
    31.085877 121.399176
    31.271117 121.587577
    31.098122 121.413396
    31.331458 121.440183
    31.094581 121.503654
    31.270737000 121.481178000
    31.138214 121.386943
    30.915685 121.482079
    31.279029 121.529255
    31.168283 121.283322
    PAGE 2
    31.388674 121.35918
    31.231706 121.472644
    31.094857 121.219961
    31.228564 121.516609
    31.235717 121.478692
    31.288498 121.521882
    31.155139 121.428885
    31.235249 121.474639
    30.728829 121.341429
    31.260372 121.343066
    

    Note: You can change the results per page by changing the value of pageSize (currently it's 10).

    0 讨论(0)
提交回复
热议问题