Scrape tables with python

前端 未结 2 1106
没有蜡笔的小新
没有蜡笔的小新 2021-01-14 23:23

I am trying to scrape tables and convert them into data.tables in python, but I have little luck of election data in USA. This is html of the data I want to scrape.

相关标签:
2条回答
  • 2021-01-14 23:58

    So after some time I managed to scrape all data from this website. So the main problem was, that website was embedded in JavaScript, so I could not scrape with Beautifulsoup. So I used selenium + beautifulsoup4, to convert page into html and scrape it.

    from selenium import webdriver
    import time
    import os
    from bs4 import BeautifulSoup
    chrome_path = r"C:\Users\Desktop\chromedriver_win32\chromedriver.exe"
    driver = webdriver.Chrome(chrome_path)
    driver.get('http://www.politico.com/2016-election/primary/results/map/president/arizona/')
    time.sleep(80)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)
    html = driver.page_source
    soup = BeautifulSoup(html,'html.parser')
    for posts in soup.findAll('table',{'class':'results-table'}):
    for tr in posts.findAll('tr'):
        popular = [td for td in tr.stripped_strings]
        print(popular)
    

    Because it is dynamic webpage, I needed to simulate some things with selenium. Like scrolling page down. I used time.sleep(60) so the page could load. It loads really slowly, so I set time to 60s. Hope it helps someone.

    0 讨论(0)
  • 2021-01-15 00:06
    import requests, bs4
    
    r = requests.get('http://www.politico.com/2016-election/results/map/president/alabama/')
    soup = bs4.BeautifulSoup(r.text, 'lxml')
    contents = soup.find(class_='contrast-white')
    for table in contents.find_all(class_='results-group'):
        title = table.find(class_='title').text
        for tr in table.find_all('tr'):
            _, name, percentage, popular = [td for td in tr.stripped_strings]
            print(title, name, percentage, popular)
    

    out:

    Autauga County D. Trump 73.4% 18,110
    Autauga County H. Clinton 24.0% 5,908
    Autauga County G. Johnson 2.2% 538
    Autauga County J. Stein 0.4% 105
    Baldwin County D. Trump 77.4% 72,780
    Baldwin County H. Clinton 19.6% 18,409
    Baldwin County G. Johnson 2.6% 2,448
    Baldwin County J. Stein 0.5% 453
    Barbour County D. Trump 52.3% 5,431
    Barbour County H. Clinton 46.7% 4,848
    Barbour County G. Johnson 0.9% 93
    Barbour County J. Stein 0.2% 18
    Bibb County D. Trump 77.0% 6,733
    Bibb County H. Clinton 21.4% 1,874
    Bibb County G. Johnson 1.4% 124
    Bibb County J. Stein 0.2% 17
    Blount County D. Trump 89.9% 22,808
    Blount County H. Clinton 8.5% 2,150
    Blount County G. Johnson 1.3% 337
    Blount County J. Stein 0.4% 89
    Bullock County H. Clinton 75.1% 3,530
    Bullock County D. Trump 24.2% 1,139
    Bullock County G. Johnson 0.5% 22
    Bullock County J. Stein 0.2% 10
    Butler County D. Trump 56.3% 4,891
    Butler County H. Clinton 42.8% 3,716
    Butler County G. Johnson 0.7% 65
    Butler County J. Stein 0.1% 13
    Calhoun County D. Trump 69.2% 32,803
    Calhoun County H. Clinton 27.9% 13,197
    Calhoun County G. Johnson 2.4% 1,114
    Calhoun County J. Stein 0.6% 262
    Chambers County D. Trump 56.6% 7,803
    Chambers County H. Clinton 41.8% 5,763
    Chambers County G. Johnson 1.2% 168
    Chambers County J. Stein 0.3% 44
    Cherokee County D. Trump 83.9% 8,809
    Cherokee County H. Clinton 14.5% 1,524
    Cherokee County G. Johnson 1.4% 145
    Cherokee County J. Stein 0.2% 25
    

    The rest is empty, nothing in there.

    0 讨论(0)
提交回复
热议问题