问题
I have troubles sorting a wiki table and hope someone who has done it before can give me advice.
From the List_of_current_heads_of_state_and_government I need countries (works with the code below) and then only the first mention of Head of state + their names. I am not sure how to isolate the first mention as they all come in one cell. And my attempt to pull their names gives me this error: IndexError: list index out of range
. Will appreciate your help!
import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')
my_table = soup.find('table',{'class':'wikitable plainrowheaders'})
#print(my_table)
states = []
titles = []
names = []
for row in my_table.find_all('tr')[1:]:
state_cell = row.find_all('a')[0]
states.append(state_cell.text)
print(states)
for row in my_table.find_all('td'):
title_cell = row.find_all('a')[0]
titles.append(title_cell.text)
print(titles)
for row in my_table.find_all('td'):
name_cell = row.find_all('a')[1]
names.append(name_cell.text)
print(names)
Desirable output would be a pandas df:
State | Title | Name |
回答1:
If I could understand your question then the following should get you there:
import requests
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
data = items.find_all(['th','td'])
try:
country = data[0].a.text
title = data[1].a.text
name = data[1].a.find_next_sibling().text
except IndexError:pass
print("{}|{}|{}".format(country,title,name))
Output:
Afghanistan|President|Ashraf Ghani
Albania|President|Ilir Meta
Algeria|President|Abdelaziz Bouteflika
Andorra|Episcopal Co-Prince|Joan Enric Vives Sicília
Angola|President|João Lourenço
Antigua and Barbuda|Queen|Elizabeth II
Argentina|President|Mauricio Macri
And so on ----
回答2:
I appreciate this is an old thread however if anybody else was looking to do the same thing, I found a super easy and short way to do this, by importing the wikipedia
python module and then using pandas' read_html
to put it into a dataframe. From there you can apply any amount of analysis you wish.
Here is my code - which is called from command line:
Simply call by python yourfile.py -p Wikipedia_Page_Article_Here
import pandas as pd
import argparse
import wikipedia as wp
parser = argparse.ArgumentParser()
parser.add_argument("-p", "--wiki_page", help="Give a wiki page to get table", required=True)
args = parser.parse_args()
html = wp.page(args.wiki_page).html().encode("UTF-8")
try:
df = pd.read_html(html)[1] # Try 2nd table first as most pages contain contents table first
except IndexError:
df = pd.read_html(html)[0]
print(df.to_string())
Hope this helps someone out there!
OR without the command line arguments:
import pandas as pd
import wikipedia as wp
html = wp.page("List_of_video_games_considered_the_best").html().encode("UTF-8")
try:
df = pd.read_html(html)[1] # Try 2nd table first as most pages contain contents table first
except IndexError:
df = pd.read_html(html)[0]
print(df.to_string())
回答3:
it is not perfect but it almost works like this.
import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')
my_table = soup.find('table',{'class':'wikitable plainrowheaders'})
#print(my_table)
states = []
titles = []
names = []
""" for row in my_table.find_all('tr')[1:]:
state_cell = row.find_all('a')[0]
states.append(state_cell.text)
print(states)
for row in my_table.find_all('td'):
title_cell = row.find_all('a')[0]
titles.append(title_cell.text)
print(titles) """
for row in my_table.find_all('td'):
try:
names.append(row.find_all('a')[1].text)
except IndexError:
names.append(row.find_all('a')[0].text)
print(names)
There is just one mistake in this names list so far i can see. The table is a bit difficult due to the exceptions you must write. For example there are names they aren't a link and then the code catches just the first link it finds in that row. But you will just need to write some more if clauses for such cases. At least i would do it so.
来源:https://stackoverflow.com/questions/50355577/scraping-wikipedia-tables-with-python-selectively