Python for loop with if/else and append function

后端 未结 3 758
别那么骄傲
别那么骄傲 2021-01-21 18:26

On the basis of list as below, I have to create a DataFrame with \"state\" and \"region\" columns:

Original data:

 Alabama[edit]
 Auburn (Auburn Universi         


        
3条回答
  •  心在旅途
    2021-01-21 18:56

    You can find an example of cleaning this dataset in the tutorial Pythonic Data Cleaning With NumPy and Pandas.

    Option 1: Do String Processing in "Pure Python"

    You can use a greedy for-loop over the lines of the file and load in O(n) time:

    import pandas as pd
    
    university_towns = []
    
    with open('input/university_towns.txt') as file:
        for line in file:
            edit_pos = line.find('[edit]')
            if edit_pos != -1:
                # Remember this `state` until the next is found
                state = line[:edit_pos]
            else:
                # Otherwise, we have a city; keep `state` as last-seen
                parens = line.find(' (')
                town = line[:parens] if parens != -1 else line
                university_towns.append((state, town))
    
    towns_df = pd.DataFrame(university_towns,
                            columns=['State', 'RegionName'])
    

    Option 2: Do String Processing via Pandas API

    Alternatively, you can do the string processing with Pandas' .str accessor:

    import re
    
    import pandas as pd
    
    university_towns = []
    
    with open('input/university_towns.txt') as file:
        for line in file:
            if '[edit]' in line:
                # Remember this `state` until the next is found
                state = line
            else:
                # Otherwise, we have a city; keep `state` as last-seen
                university_towns.append((state, line))
    
    towns_df = pd.DataFrame(university_towns,
                            columns=['State', 'RegionName'])
    
    towns_df['State'] = towns_df.State.str.replace(r'\[edit\]\n', '')
    towns_df['RegionName'] = towns_df.RegionName\
        .str.strip()\
        .str.replace(r' \(.*', '')\
        .str.replace(r'\[.*', '')
    

    Output:

    >>> towns_df.head()
         State    RegionName
    0  Alabama        Auburn
    1  Alabama      Florence
    2  Alabama  Jacksonville
    3  Alabama    Livingston
    4  Alabama    Montevallo
    

提交回复
热议问题