How can I remove a substring from a given String using Pandas

问题

Recently I started to analyse a data frame and I want to remove all the substrings that don't contain

('Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing')

But when I use this syntax-

df = df[~df["GrupoAssunto"].str.contains('Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing')]

I get this error:

TypeError: contains() takes from 2 to 6 positional arguments but 10 were given

回答1:

Use the .isin() function instead.

For example:

vals1 = ['good val1', 'good val2', 'good val3', 'Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing']
vals2 = ['Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing']

df = pd.DataFrame({'col1': vals1})

Using the negated .isin() function will provide a view on the DataFrame excluding the values in the vals2 list.:

df[~df['col1'].isin(vals2)]

Output:

        col1
0  good val1
1  good val2
2  good val3

回答2:

Just seperate the different words by | with regex turned on. This is the proper syntax for searching for multiple strings with contains. The re safe conversion deals with escaping the parenthesis and any other special characters.

bad_strings = ['Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing']
safe_bad_strings = [re.escape(s) for s in bad_strings]
df = df[~df["GrupoAssunto"].str.contains('|'.join(safe_bad_strings), regex=True]

Your error is occurring because the 10 strings are all being passed as arguments to contains. But contains doesn't expect more than one pattern so it is throwing an error.

来源：https://stackoverflow.com/questions/65081257/how-can-i-remove-a-substring-from-a-given-string-using-pandas

标签

python

pandas

string

substring

contains