Pattern matching on a website URL with Pandas DataFrame

前端 未结 1 686
逝去的感伤
逝去的感伤 2021-01-29 13:30

I am trying to solve a slightly complex project with Pattern matching for a website URL.

I have a particular column that contains URL with several informat

1条回答
  •  执笔经年
    2021-01-29 14:01

    Remove data=, split by everything you see:

    df_split = df['input'].str.replace('data=', '').str.split(r' |/|\?|\.', expand=True).replace('', np.nan).dropna(how='all', axis=1)
    then you can rename your columns as you wish.
    

    Edit: I added the dropping of empty columns.

    Edit2: to take into account the absent hostname, split separately:

    df_split1 = df['input'].str.split(r'\?data=', expand=True)
    df_left = df_split1.loc[:, 0].str.rsplit(r'/', n=5, expand=True)
    df_right = df_split1.loc[:, 1].str.split(r'\.| ', expand=True)
    
    df_left['option_a'] = df_left.iloc[:, 0].str.split(r'/', expand=True).iloc[:, -1].fillna(df_left.iloc[:, 0])
    df_left['sitename'] = df_left.iloc[:, 0].apply(lambda x: np.NaN if '/' not in x else re.split(r'/', x)[0])
    

    then concat

    df = pd.concat([df_left, df_right], axis=1).iloc[:, 1:].replace('', np.nan).dropna(how='all', axis=1)
    

    then do the remaining renaming of the columns.

    0 讨论(0)
提交回复
热议问题