Find the target word and the before word in col_a and append matched string in col_b_PY and col_c_LG columns
This code i have tried to
You may use
df['col_b_PY'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+PY)\b")
df['col_c_LG'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+LG)\b")
Or, to extract all matches and join them with a space:
df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
Note you need to use a capturing group in the regex pattern so that extract could actually extract the text:
Extract capture groups in the regex pat as columns in a DataFrame.
Note the \b
word boundary is necessary to match PY
/ LG
as a whole word.
Also, if you want to only start a match from a letter, you may revamp the pattern to
r"([a-zA-Z][a-zA-Z'-]*\s+PY)\b"
r"([a-zA-Z][a-zA-Z'-]*\s+LG)\b"
^^^^^^^^ ^
where [a-zA-Z]
will match a letter and [a-zA-Z'-]*
will match 0 or more letters, apostrophes or hyphens.
Python 3.7 with Pandas 0.24.2:
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 500)
df = pd.DataFrame({
'col_a': ['Python PY is a general-purpose language LG',
'Programming language LG in Python PY',
'Its easier LG to understand PY',
'The syntax of the language LG is clean PY',
'Python PY is a general purpose PY language LG']
})
df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
Output:
col_a col_b_PY col_c_LG
0 Python PY is a general-purpose language LG Python PY language LG
1 Programming language LG in Python PY Python PY language LG
2 Its easier LG to understand PY understand PY easier LG
3 The syntax of the language LG is clean PY clean PY language LG
4 Python PY is a general purpose PY language LG Python PY purpose PY language LG
Check with
df['col_c_LG'],df['col_c_PY']=df['col_a'].str.extract(r"(\w+\s+LG)"),df['col_a'].str.extract(r"(\w+\s+PY)")
df
Out[474]:
col_a ... col_c_PY
0 Python PY is a general-purpose language LG ... Python PY
1 Programming language LG in Python PY ... Python PY
2 Its easier LG to understand PY ... understand PY
3 The syntax of the language LG is clean PY ... clean PY
[4 rows x 3 columns]