Create Pandas DataFrame from txt file with specific pattern

前端 未结 6 2372
北海茫月
北海茫月 2020-11-22 09:04

I need to create a Pandas DataFrame based on a text file based on the following structure:

Alabama[edit]
Auburn (Aubu         


        
相关标签:
6条回答
  • 2020-11-22 09:16

    Assuming you have the following DF:

    In [73]: df
    Out[73]:
                                                     text
    0                                       Alabama[edit]
    1                       Auburn (Auburn University)[1]
    2              Florence (University of North Alabama)
    3     Jacksonville (Jacksonville State University)[2]
    4          Livingston (University of West Alabama)[2]
    5            Montevallo (University of Montevallo)[2]
    6                           Troy (Troy University)[2]
    7   Tuscaloosa (University of Alabama, Stillman Co...
    8                   Tuskegee (Tuskegee University)[5]
    9                                        Alaska[edit]
    10      Fairbanks (University of Alaska Fairbanks)[2]
    11                                      Arizona[edit]
    12         Flagstaff (Northern Arizona University)[6]
    13                   Tempe (Arizona State University)
    14                     Tucson (University of Arizona)
    15                                     Arkansas[edit]
    

    you can use Series.str.extract() method:

    In [117]: df['State'] = df.loc[df.text.str.contains('[edit]', regex=False), 'text'].str.extract(r'(.*?)\[edit\]', expand=False)
    
    In [118]: df['Region Name'] = df.loc[df.State.isnull(), 'text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', expand=False)
    
    In [120]: df.State = df.State.ffill()
    
    In [121]: df
    Out[121]:
                                                     text     State   Region Name
    0                                       Alabama[edit]   Alabama           NaN
    1                       Auburn (Auburn University)[1]   Alabama        Auburn
    2              Florence (University of North Alabama)   Alabama      Florence
    3     Jacksonville (Jacksonville State University)[2]   Alabama  Jacksonville
    4          Livingston (University of West Alabama)[2]   Alabama    Livingston
    5            Montevallo (University of Montevallo)[2]   Alabama    Montevallo
    6                           Troy (Troy University)[2]   Alabama          Troy
    7   Tuscaloosa (University of Alabama, Stillman Co...   Alabama    Tuscaloosa
    8                   Tuskegee (Tuskegee University)[5]   Alabama      Tuskegee
    9                                        Alaska[edit]    Alaska           NaN
    10      Fairbanks (University of Alaska Fairbanks)[2]    Alaska     Fairbanks
    11                                      Arizona[edit]   Arizona           NaN
    12         Flagstaff (Northern Arizona University)[6]   Arizona     Flagstaff
    13                   Tempe (Arizona State University)   Arizona         Tempe
    14                     Tucson (University of Arizona)   Arizona        Tucson
    15                                     Arkansas[edit]  Arkansas           NaN
    
    In [122]: df = df.dropna()
    
    In [123]: df
    Out[123]:
                                                     text    State   Region Name
    1                       Auburn (Auburn University)[1]  Alabama        Auburn
    2              Florence (University of North Alabama)  Alabama      Florence
    3     Jacksonville (Jacksonville State University)[2]  Alabama  Jacksonville
    4          Livingston (University of West Alabama)[2]  Alabama    Livingston
    5            Montevallo (University of Montevallo)[2]  Alabama    Montevallo
    6                           Troy (Troy University)[2]  Alabama          Troy
    7   Tuscaloosa (University of Alabama, Stillman Co...  Alabama    Tuscaloosa
    8                   Tuskegee (Tuskegee University)[5]  Alabama      Tuskegee
    10      Fairbanks (University of Alaska Fairbanks)[2]   Alaska     Fairbanks
    12         Flagstaff (Northern Arizona University)[6]  Arizona     Flagstaff
    13                   Tempe (Arizona State University)  Arizona         Tempe
    14                     Tucson (University of Arizona)  Arizona        Tucson
    
    0 讨论(0)
  • 2020-11-22 09:17

    TL;DR
    s.groupby(s.str.extract('(?P<State>.*?)\[edit\]', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name='Region_Name').iloc[:, [0, 2]]


    regex = '(?P<State>.*?)\[edit\]'  # pattern to match
    print(s.groupby(
        # will get nulls where we don't have "[edit]"
        # forward fill fills in the most recent line
        # where we did have an "[edit]"
        s.str.extract(regex, expand=False).ffill()  
    ).apply(
        # I still have all the original values
        # If I group by the forward filled rows
        # I'll want to drop the first one within each group
        pd.Series.tail, n=-1
    ).reset_index(
        # munge the dataframe to get columns sorted
        name='Region_Name'
    )[['State', 'Region_Name']])
    
          State                                        Region_Name
    0   Alabama                      Auburn (Auburn University)[1]
    1   Alabama             Florence (University of North Alabama)
    2   Alabama    Jacksonville (Jacksonville State University)[2]
    3   Alabama         Livingston (University of West Alabama)[2]
    4   Alabama           Montevallo (University of Montevallo)[2]
    5   Alabama                          Troy (Troy University)[2]
    6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
    7   Alabama                  Tuskegee (Tuskegee University)[5]
    8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
    9   Arizona         Flagstaff (Northern Arizona University)[6]
    10  Arizona                   Tempe (Arizona State University)
    11  Arizona                     Tucson (University of Arizona)
    

    setup

    txt = """Alabama[edit]
    Auburn (Auburn University)[1]
    Florence (University of North Alabama)
    Jacksonville (Jacksonville State University)[2]
    Livingston (University of West Alabama)[2]
    Montevallo (University of Montevallo)[2]
    Troy (Troy University)[2]
    Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
    Tuskegee (Tuskegee University)[5]
    Alaska[edit]
    Fairbanks (University of Alaska Fairbanks)[2]
    Arizona[edit]
    Flagstaff (Northern Arizona University)[6]
    Tempe (Arizona State University)
    Tucson (University of Arizona)
    Arkansas[edit]"""
    
    s = pd.read_csv(StringIO(txt), sep='|', header=None, squeeze=True)
    
    0 讨论(0)
  • 2020-11-22 09:18

    You will probably need to perform some additional manipulation on the file before getting it into a dataframe.

    A starting point would be to split the file into lines, search for the string [edit] in each line, put the string name as the key of a dictionary when it is there...

    I do not think that Pandas has any built in methods that would handle a file in this format.

    0 讨论(0)
  • 2020-11-22 09:24

    You seem to be from Coursera's Introduction to Data Science course. Passed my test with this solution. I would advice not copying the whole solution but using it just for refrence purpose :)

    lines = open('university_towns.txt').readlines()
    
    l=[]
    lofl=[]
    flag=False
    for line in lines:
        l = []
        if('[edit]' in line):
            index = line[:-7]
        elif('(' in line):
            pos = line.find('(')
            line = line[:pos-1]
            l.append(index)
            l.append(line)
            flag=True
        else:
            line = line[:-1]
            l.append(index)
            l.append(line)
            flag=True
        if(flag and np.array(l).size!=0):
            lofl.append(l)
    df = pd.DataFrame(lofl,columns=["State","RegionName"])
    
    0 讨论(0)
  • 2020-11-22 09:27

    You could parse the file into tuples first:

    import pandas as pd
    from collections import namedtuple
    
    Item = namedtuple('Item', 'state area')
    items = []
    
    with open('unis.txt') as f: 
        for line in f:
            l = line.rstrip('\n') 
            if l.endswith('[edit]'):
                state = l.rstrip('[edit]')
            else:            
                i = l.index(' (')
                area = l[:i]
                items.append(Item(state, area))
    
    df = pd.DataFrame.from_records(items, columns=['State', 'Area'])
    
    print df
    

    output:

          State          Area
    0   Alabama        Auburn
    1   Alabama      Florence
    2   Alabama  Jacksonville
    3   Alabama    Livingston
    4   Alabama    Montevallo
    5   Alabama          Troy
    6   Alabama    Tuscaloosa
    7   Alabama      Tuskegee
    8    Alaska     Fairbanks
    9   Arizona     Flagstaff
    10  Arizona         Tempe
    11  Arizona        Tucson
    
    0 讨论(0)
  • 2020-11-22 09:30

    You can first read_csv with parameter name for create DataFrame with column Region Name, separator is value which is NOT in values (like ;):

    df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
    

    Then insert new column State with extract rows where text [edit] and replace all values from ( to the end to column Region Name.

    df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
    df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')
    

    Last remove rows where text [edit] by boolean indexing, mask is created by str.contains:

    df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
    print (df)
          State   Region Name
    0   Alabama        Auburn
    1   Alabama      Florence
    2   Alabama  Jacksonville
    3   Alabama    Livingston
    4   Alabama    Montevallo
    5   Alabama          Troy
    6   Alabama    Tuscaloosa
    7   Alabama      Tuskegee
    8    Alaska     Fairbanks
    9   Arizona     Flagstaff
    10  Arizona         Tempe
    11  Arizona        Tucson
    

    If need all values solution is easier:

    df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
    df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
    df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
    print (df)
          State                                        Region Name
    0   Alabama                      Auburn (Auburn University)[1]
    1   Alabama             Florence (University of North Alabama)
    2   Alabama    Jacksonville (Jacksonville State University)[2]
    3   Alabama         Livingston (University of West Alabama)[2]
    4   Alabama           Montevallo (University of Montevallo)[2]
    5   Alabama                          Troy (Troy University)[2]
    6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
    7   Alabama                  Tuskegee (Tuskegee University)[5]
    8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
    9   Arizona         Flagstaff (Northern Arizona University)[6]
    10  Arizona                   Tempe (Arizona State University)
    11  Arizona                     Tucson (University of Arizona)
    
    0 讨论(0)
提交回复
热议问题