问题
I have scraped a table from wikipedia page and I am going to clean the data next. I have transformed the data in to Pandas format and now I have some problems cleaning the data
Here are the codes I have executed to scrape the table from the wikipedia page
import requests
import pandas as pd
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
PostalCode=[]
for row in My_table.findAll('tr')[1:]:
PostalCode_cell=row.findAll('td')[0]
PostalCode.append(PostalCode_cell.text)
print(PostalCode)
Borough=[]
for row in My_table.findAll('tr')[1:] :
Borough_cell=row.findAll('td')[1]
Borough.append(Borough_cell.text)
print(Borough)
Neighbourhood=[]
for row in My_table.findAll('tr')[1:]:
Neighbourhood_cell=row.findAll('td')[2]
Neighbourhood_cell.text.rstrip('\n')
Neighbourhood.append(Neighbourhood_cell.text)
print(Neighbourhood)
canada=pd.DataFrame({'PostalCode':PostalCode,'Borough':Borough,'Neighborhood':Neighbourhood})
canada.rename(columns = {'PostalCode':'PostalCode','Borough':'Borough','Neighborhood':'Neighborhood'}, inplace = True)
canada
I have tried the groupby function hoping to get the 2nd desired outcome, but did not worked out:
canada.groupby(['PostalCode', 'Borough'])
I have tried to drop the "Not assigned" value from the Borough:
canada=canada.Borough.drop("Not assigned",axis=0)
but it showed:"['Not assigned'] not found in axis"
Here are the expected results of my cleaned data: 1. Ignore cells with value "Not assigned" in Borough 2. For Neighborhoods with the same PostalCode and Borough, they should show in the same line and seperated with comma 3. If a cell has a Borough but a "Not assigned" Neighborhood, the Neighborhood will be the same as the Borough
And also, I noticed that the table I scraped contained "\n" at the end of each value in Neighborhood. Is there any codes I should add in the scraping process to get rid of it?
Many thanks for your help in advance.
回答1:
This feels a little long winded.
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
canada = tables[0]
canada.columns = canada.iloc[0]
canada = canada.iloc[1:]
canada = canada[canada.Borough != 'Not assigned']
canada['Neighbourhood'].loc[canada['Neighbourhood'] == 'Not assigned'] = canada.Borough
canada['Location'] = canada.Borough + ', ' + canada.Neighbourhood
canada.drop(['Borough', 'Neighbourhood'], axis=1, inplace = True)
canada.reset_index(drop=True)
References:
https://stackoverflow.com/a/49161313/6241235
Edit:
I think @bubble's point about a case insensitive search is a good one where they say canada = canada[canada.loc[:, 'Borough'].str.contains('Not assigned', case=False)]
but I didn't think of that)
来源:https://stackoverflow.com/questions/55566117/i-have-some-problems-with-data-cleaning