问题
I am trying to web scrape, by using Python 3, a chart off of this website into a .csv file: 2016 NBA National TV Schedule
The chart starts out like:
Tuesday, October 25
8:00 PM Knicks/Cavaliers TNT
10:30 PM Spurs/Warriors TNT
Wednesday, October 26
8:00 PM Thunder/Sixers ESPN
10:30 PM Rockets/Lakers ESPN
I am using these packages:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
The output I want in a .csv file looks like this:
These are the first six lines of the chart on the website into the .csv file. Notice how multiple dates are used more than once. How do I implement the scraper to get this output?
回答1:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
url = 'https://fansided.com/2016/08/11/nba-schedule-2016-national-tv-games/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
days = 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'
data = soup.select_one('.article-content p:has(br)').get_text(strip=True, separator='|').split('|')
dates, last = {}, ''
for v, g in groupby(data, lambda k: any(d in k for d in days)):
if v:
last = [*g][0]
dates[last] = []
else:
dates[last].extend([re.findall(r'([\d:]+ [AP]M) (.*?)/(.*?) (.*)', d)[0] for d in g])
all_data = {'Date':[], 'Time': [], 'Team 1': [], 'Team 2': [], 'Network': []}
for k, v in dates.items():
for time, team1, team2, network in v:
all_data['Date'].append(k)
all_data['Time'].append(time)
all_data['Team 1'].append(team1)
all_data['Team 2'].append(team2)
all_data['Network'].append(network)
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
Prints:
Date Time Team 1 Team 2 Network
0 Tuesday, October 25 8:00 PM Knicks Cavaliers TNT
1 Tuesday, October 25 10:30 PM Spurs Warriors TNT
2 Wednesday, October 26 8:00 PM Thunder Sixers ESPN
3 Wednesday, October 26 10:30 PM Rockets Lakers ESPN
4 Thursday, October 27 8:00 PM Celtics Bulls TNT
.. ... ... ... ... ...
159 Saturday, April 8 8:30 PM Clippers Spurs ABC
160 Monday, April 10 8:00 PM Wizards Pistons TNT
161 Monday, April 10 10:30 PM Rockets Clippers TNT
162 Wednesday, April 12 8:00 PM Hawks Pacers ESPN
163 Wednesday, April 12 10:30 PM Pelicans Blazers ESPN
[164 rows x 5 columns]
And saves data.csv
(screenshot from Libre Office):
来源:https://stackoverflow.com/questions/61942119/how-to-web-scrape-a-chart-by-using-python