PANDAS Finding the exact word and before word in a column of string and append that new column in python (pandas) column

后端 未结 2 1029
梦谈多话
梦谈多话 2020-12-19 16:35

Find the target word and the before word in col_a and append matched string in col_b_PY and col_c_LG columns

    This code i have tried to          


        
相关标签:
2条回答
  • 2020-12-19 16:49

    You may use

    df['col_b_PY'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+PY)\b")
    df['col_c_LG'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+LG)\b")
    

    Or, to extract all matches and join them with a space:

    df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
    df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
    

    Note you need to use a capturing group in the regex pattern so that extract could actually extract the text:

    Extract capture groups in the regex pat as columns in a DataFrame.

    Note the \b word boundary is necessary to match PY / LG as a whole word.

    Also, if you want to only start a match from a letter, you may revamp the pattern to

    r"([a-zA-Z][a-zA-Z'-]*\s+PY)\b"
    r"([a-zA-Z][a-zA-Z'-]*\s+LG)\b"
       ^^^^^^^^          ^
    

    where [a-zA-Z] will match a letter and [a-zA-Z'-]* will match 0 or more letters, apostrophes or hyphens.

    Python 3.7 with Pandas 0.24.2:

    pd.set_option('display.width', 1000)
    pd.set_option('display.max_columns', 500)
    
    df = pd.DataFrame({
        'col_a': ['Python PY is a general-purpose language LG',
                 'Programming language LG in Python PY',
                 'Its easier LG to understand  PY',
                 'The syntax of the language LG is clean PY',
                 'Python PY is a general purpose PY language LG']
        })
    df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
    df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
    

    Output:

                                               col_a              col_b_PY     col_c_LG
    0     Python PY is a general-purpose language LG             Python PY  language LG
    1           Programming language LG in Python PY             Python PY  language LG
    2                Its easier LG to understand  PY        understand  PY    easier LG
    3      The syntax of the language LG is clean PY              clean PY  language LG
    4  Python PY is a general purpose PY language LG  Python PY purpose PY  language LG
    
    0 讨论(0)
  • 2020-12-19 17:15

    Check with

    df['col_c_LG'],df['col_c_PY']=df['col_a'].str.extract(r"(\w+\s+LG)"),df['col_a'].str.extract(r"(\w+\s+PY)")
    df
    Out[474]: 
                                            col_a       ...              col_c_PY
    0  Python PY is a general-purpose language LG       ...             Python PY
    1       Programming language LG in Python PY        ...             Python PY
    2             Its easier LG to understand  PY       ...        understand  PY
    3   The syntax of the language LG is clean PY       ...              clean PY
    [4 rows x 3 columns]
    
    0 讨论(0)
提交回复
热议问题