On the basis of list as below, I have to create a DataFrame with \"state\" and \"region\" columns:
Original data:
Alabama[edit]
Auburn (Auburn Universi
You can find an example of cleaning this dataset in the tutorial Pythonic Data Cleaning With NumPy and Pandas.
You can use a greedy for-loop over the lines of the file and load in O(n) time:
import pandas as pd
university_towns = []
with open('input/university_towns.txt') as file:
for line in file:
edit_pos = line.find('[edit]')
if edit_pos != -1:
# Remember this `state` until the next is found
state = line[:edit_pos]
else:
# Otherwise, we have a city; keep `state` as last-seen
parens = line.find(' (')
town = line[:parens] if parens != -1 else line
university_towns.append((state, town))
towns_df = pd.DataFrame(university_towns,
columns=['State', 'RegionName'])
Alternatively, you can do the string processing with Pandas' .str
accessor:
import re
import pandas as pd
university_towns = []
with open('input/university_towns.txt') as file:
for line in file:
if '[edit]' in line:
# Remember this `state` until the next is found
state = line
else:
# Otherwise, we have a city; keep `state` as last-seen
university_towns.append((state, line))
towns_df = pd.DataFrame(university_towns,
columns=['State', 'RegionName'])
towns_df['State'] = towns_df.State.str.replace(r'\[edit\]\n', '')
towns_df['RegionName'] = towns_df.RegionName\
.str.strip()\
.str.replace(r' \(.*', '')\
.str.replace(r'\[.*', '')
Output:
>>> towns_df.head()
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo