I have two pandas DataFrames in python. DF A contains a column, which is basically sentence-length strings.
|---------------------|------------------|
|
There's no need for a loop here. Looping over a dataframe is slow and we have optimized pandas
or numpy
methods for almost all of our problems.
In this case, for your first problem, you are looking for Series.str.extract:
dfa['country'] = dfa['sentenceCol'].str.extract(f"({'|'.join(dfb['country'])})")
sentenceCol other column country
0 this is from france 15 france
For your second problem, you need Series.str.extractall with Series.drop_duplicates & to_numpy:
dfa['country'] = (
dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
.drop_duplicates()
.to_numpy()
)
sentenceCol other column country
0 this is from france and spain 15 france
1 this is from france and spain 15 spain
Edit
Or if your sentenceCol
is not duplicated, we have to get the extracted values to a single row. We use GroupBy.agg
:
dfa['country'] = (
dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
.groupby(level=0)
.agg(', '.join)
.to_numpy()
)
sentenceCol other column country
0 this is from france and spain 15 france, spain
Edit2
To duplicate the original rows. We join
the dataframe back to our extraction:
extraction = (
dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
.rename(columns={0: 'country'})
)
dfa = extraction.droplevel(1).join(dfa).reset_index(drop=True)
country sentenceCol other column
0 france this is from france and spain 15
1 spain this is from france and spain 15
Dataframes used:
dfa = pd.DataFrame({'sentenceCol':['this is from france and spain']*2,
'other column':[15]*2})
dfb = pd.DataFrame({'country':['france', 'spain']})