I am looking to enhance the script I have below. I am wondering if it is possible to use defined strings such as \'G\', \'SG\', \'PF\', \'PG\', \'SF\', \'F\', \'UTIL\', \'
We can simply update your regex expression to check if the capitalised word is not directly next to the previous.
r"(?<![A-Z] )\b([A-Z]+) "
Note we have added a negative lookbehind. To not match if the previous word is not [A-Z]
You can find a more in-depth explanation on the above regex here; https://regex101.com/r/j6RbSP/1
You can now update your code to include the new regex patterns, ensure you remember to add r""
in front of the string.
import pandas as pd, numpy as np
import re
dk_cont_lineup_df = pd.DataFrame(data=np.array([['G CJ McCollum SG Donovan Mitchell PF Robert Covington PG Collin Sexton SF Bojan Bogdanovic F Larry Nance Jr. UTIL Trey Lyles C Maxi Kleber'],['UTIL Nikola Vucevic PF Kevin Love F Robert Covington SG Collin Sexton SF Bojan Bogdanovic G Coby White PG RJ Barrett C Larry Nance Jr.']]))
dk_cont_lineup_df.rename(columns={ dk_cont_lineup_df.columns[0]: 'Lineup' }, inplace = True)
def calc_col(col):
'''This function takes a string,
finds the upper case letters or words placed as delimeter,
converts it to a list,
adds a number to the list elements if recurring.
Eg. input list :['W','W','W','D','D','G','C','C','UTIL']
o/p list: ['W1','W2','W3','D1','D2','G','C1','C2','UTIL']
'''
col_list = re.findall(r"(?<![A-Z] )\b([A-Z]+) ", col)
col_list2 = []
for i_pos in col_list:
cnt = col_list.count(i_pos)
if cnt == 1:
col_list2.append(i_pos)
if cnt > 1:
if i_pos in " ".join(col_list2):
continue;
col_list2 += [i_pos+str(k) for k in range(1,cnt+1)]
return col_list2
extr_row = dk_cont_lineup_df['Lineup'].replace(to_replace =r"(?<![A-Z] )\b([A-Z]+) ", value="\n", regex = True) #split the rows on
df_final = pd.DataFrame(columns = sorted(calc_col(dk_cont_lineup_df['Lineup'].iloc[0])))
for i_pos in range(len(extr_row)): #traverse all the rows in the original dataframe and append the formatted rows to df3
df_temp = pd.DataFrame((extr_row.values[i_pos].split("\n")[1:])).T
df_temp.columns = calc_col(dk_cont_lineup_df['Lineup'].iloc[i_pos])
df_temp= df_temp[sorted(df_temp)]
df_final = df_final.append(df_temp)
df_final.reset_index(drop = True, inplace = True)
print(df_final.to_string())
Produces the desired output:
C F G PF PG SF SG UTIL
0 Maxi Kleber Larry Nance Jr. CJ McCollum Robert Covington Collin Sexton Bojan Bogdanovic Donovan Mitchell Trey Lyles
1 Larry Nance Jr. Robert Covington Coby White Kevin Love RJ Barrett Bojan Bogdanovic Collin Sexton Nikola Vucevic