Using defined strings for regex searching with python

后端 未结 1 1997
执念已碎
执念已碎 2021-01-27 20:47

I am looking to enhance the script I have below. I am wondering if it is possible to use defined strings such as \'G\', \'SG\', \'PF\', \'PG\', \'SF\', \'F\', \'UTIL\', \'

相关标签:
1条回答
  • 2021-01-27 20:54

    We can simply update your regex expression to check if the capitalised word is not directly next to the previous.

    r"(?<![A-Z] )\b([A-Z]+) "
    

    Note we have added a negative lookbehind. To not match if the previous word is not [A-Z]

    You can find a more in-depth explanation on the above regex here; https://regex101.com/r/j6RbSP/1

    You can now update your code to include the new regex patterns, ensure you remember to add r"" in front of the string.

    import pandas as pd, numpy as np
    import re
    
    dk_cont_lineup_df = pd.DataFrame(data=np.array([['G CJ McCollum SG Donovan Mitchell PF Robert Covington PG Collin Sexton SF Bojan Bogdanovic F Larry Nance Jr. UTIL Trey Lyles C Maxi Kleber'],['UTIL Nikola Vucevic PF Kevin Love F Robert Covington SG Collin Sexton SF Bojan Bogdanovic G Coby White PG RJ Barrett C Larry Nance Jr.']]))
    dk_cont_lineup_df.rename(columns={ dk_cont_lineup_df.columns[0]: 'Lineup' }, inplace = True)
    
    
    def calc_col(col):
        '''This function takes a string,
        finds the upper case letters or words placed as delimeter,
        converts it to a list,
        adds a number to the list elements if recurring.
        Eg. input list :['W','W','W','D','D','G','C','C','UTIL']
        o/p list: ['W1','W2','W3','D1','D2','G','C1','C2','UTIL']
        '''
        col_list = re.findall(r"(?<![A-Z] )\b([A-Z]+) ", col)
        col_list2 = []
        for i_pos in col_list:
            cnt = col_list.count(i_pos)
            if cnt == 1:
                col_list2.append(i_pos)
            if cnt > 1:
                if i_pos in " ".join(col_list2):
                    continue;
                col_list2 += [i_pos+str(k) for k in range(1,cnt+1)] 
        return col_list2
    
    
    extr_row = dk_cont_lineup_df['Lineup'].replace(to_replace =r"(?<![A-Z] )\b([A-Z]+) ", value="\n", regex = True) #split the rows on 
    df_final = pd.DataFrame(columns = sorted(calc_col(dk_cont_lineup_df['Lineup'].iloc[0])))
    
    for i_pos in range(len(extr_row)): #traverse all the rows in the original dataframe and append the formatted rows to df3
        df_temp = pd.DataFrame((extr_row.values[i_pos].split("\n")[1:])).T
        df_temp.columns = calc_col(dk_cont_lineup_df['Lineup'].iloc[i_pos])
        df_temp= df_temp[sorted(df_temp)]
        df_final = df_final.append(df_temp)
    df_final.reset_index(drop = True, inplace = True)
    
    print(df_final.to_string())
    

    Produces the desired output:

                     C                  F             G                 PF              PG                 SF                 SG             UTIL
    0      Maxi Kleber   Larry Nance Jr.   CJ McCollum   Robert Covington   Collin Sexton   Bojan Bogdanovic   Donovan Mitchell       Trey Lyles 
    1  Larry Nance Jr.  Robert Covington    Coby White         Kevin Love      RJ Barrett   Bojan Bogdanovic      Collin Sexton   Nikola Vucevic 
    
    0 讨论(0)
提交回复
热议问题