问题
Hi I have a dataframe that follows this format:
df = pd.DataFrame(np.array([[1, 2, 'Apples 20pk ABC123', 4, 5], [6, 7,
'Oranges 40pk XYZ123', 9, 0], [5, 6, 'Bananas 20pk ABC123', 8, 9]]), columns=
['Serial #', 'Branch ID', 'Info', 'Value1', 'Value2'])
Serial# Branch ID Info Value1 Value2
0 1 2 Apples 20pk ABC123 4 5
1 6 7 Bananas 20pk ABC123 9 0
2 5 6 Oranges 40pk XYZ123 8 9
I want to split the "Info" column's values based on the "pk" character. Essentially, I want to create two new columns, like in the dataframe below:
Serial# Branch ID Package Branch Value1 Value2
0 1 2 Apples 20pk ABC123 4 5
1 6 7 Bananas 20pk ABC123 9 0
2 5 6 Oranges 40pk XYZ123 8 9
I tried using:
info = df["Info"].str.split("pk ", n=1, expand=True)
df['Package'] = branch[0]
df['Branch'] = branch[1]
del df['Info']
but the result is that in df's column, 'Package', I only get "Apples 20" instead of "Apples 20pk".
I wanted to split using the " " character (a space) but, then I get three values ('Apples', '20pk', 'ABC123').
Because there are n number of rows (not just 3), I was wondering what's the most efficient way to go about this? Thanks!
回答1:
We can use regular expression here with positive lookbehind. In this case we split on a whitespace (\s
) which is preceded (?<=
) by the string pk
:
df['Info'].str.split('(?<=pk)\s', expand=True)
0 1
0 Apples 20pk ABC123
1 Oranges 40pk XYZ123
2 Bananas 20pk ABC123
To get your expected output, we create the two columns in one go and drop Info
afterwards:
df[['Package', 'Branch']] = df['Info'].str.split('(?<=pk)\s', expand=True)
df.drop('Info', axis=1, inplace=True)
Serial # Branch ID Value1 Value2 Package Branch
0 1 2 4 5 Apples 20pk ABC123
1 6 7 9 0 Oranges 40pk XYZ123
2 5 6 8 9 Bananas 20pk ABC123
回答2:
Could you append pk to the column afterward?
来源:https://stackoverflow.com/questions/56961327/splitting-columns-values-in-pandas-by-delimiter-without-losing-delimiter