On the basis of list as below, I have to create a DataFrame with \"state\" and \"region\" columns:
Original data:
Alabama[edit]
Auburn (Auburn Universi
Shortest version I could think of:
import pandas as pd
lst = list()
with open('university_towns.txt', 'r', newline='\n') as infile:
for line in infile.readlines():
if '[edit]' in line:
state = line.split('[')[0]
else:
lst.append([state, line.split(' ')[0]])
df = pd.DataFrame(lst, columns=['State', 'RegionName'])
print(df)
Produces on my machine (Python 3.6):
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
You can find an example of cleaning this dataset in the tutorial Pythonic Data Cleaning With NumPy and Pandas.
You can use a greedy for-loop over the lines of the file and load in O(n) time:
import pandas as pd
university_towns = []
with open('input/university_towns.txt') as file:
for line in file:
edit_pos = line.find('[edit]')
if edit_pos != -1:
# Remember this `state` until the next is found
state = line[:edit_pos]
else:
# Otherwise, we have a city; keep `state` as last-seen
parens = line.find(' (')
town = line[:parens] if parens != -1 else line
university_towns.append((state, town))
towns_df = pd.DataFrame(university_towns,
columns=['State', 'RegionName'])
Alternatively, you can do the string processing with Pandas' .str
accessor:
import re
import pandas as pd
university_towns = []
with open('input/university_towns.txt') as file:
for line in file:
if '[edit]' in line:
# Remember this `state` until the next is found
state = line
else:
# Otherwise, we have a city; keep `state` as last-seen
university_towns.append((state, line))
towns_df = pd.DataFrame(university_towns,
columns=['State', 'RegionName'])
towns_df['State'] = towns_df.State.str.replace(r'\[edit\]\n', '')
towns_df['RegionName'] = towns_df.RegionName\
.str.strip()\
.str.replace(r' \(.*', '')\
.str.replace(r'\[.*', '')
Output:
>>> towns_df.head()
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
if I uderstand your question and desired output correct, you could do something like this:
univeristylist = []
with open('university_towns.txt', 'r') as file:
for line in file:
if '[edit]' in line:
state = row
else:
universitylist.append([state, row])
df = pd.DataFrame(universitylist, columns=['State', 'RegionName'])
If you don't want the '[edit]'
and '[1]'
part etc, then you could change the code to:
univeristylist = []
with open('university_towns.txt', 'r') as file:
for line in file:
if '[edit]' in line:
state = row.split(' [')[0]
else:
universitylist.append([state, row.split(' [')[0]])
df = pd.DataFrame(columns=['State', 'RegionName'])