Say I have a list BRANDS that contains brand names:
BRANDS = [\'Samsung\', \'Apple\', \'Nike\', .....]
Dataframe A has following structure
You can achieve the result you are after by writing a simple function. You can then use .apply() with a lambda function to generate your desired column.
def contains_any(s, arr):
for item in arr:
if item in s: return item
return np.nan
df['brand_name'] = df['product'].apply(lambda x: match_substring(x, product_map))
Vectorized solution:
In [168]: x = df.item_title.str.split(expand=True)
In [169]: df['brand_name'] = \
df['brand_name'].fillna(x[x.isin(BRANDS)]
.ffill(axis=1)
.bfill(axis=1)
.iloc[:, 0])
In [170]: df
Out[170]:
row item_title brand_name
0 1 Apple 6S Apple
1 2 Nike BB Shoes Nike
2 3 Samsung TV Samsung
3 4 Used bike NaN
One approach is to use apply()
:
import pandas as pd
BRANDS = ['Samsung', 'Apple', 'Nike']
def get_brand_name(row):
if ~pd.isnull(row['brand_name']):
# don't do anything if brand_name is not null
return row['brand_name']
item_title = row['item_title']
title_words = map(str.title, item_title.split())
for tw in title_words:
if tw in BRANDS:
# return first 'match'
return tw
# default return None
return None
df['brand_name'] = df.apply(lambda x: get_brand_name(x), axis=1)
print(df)
# row item_title brand_name
#0 1 Apple 6S Apple
#1 2 Nike BB Shoes Nike
#2 3 Samsung TV Samsung
#3 4 Used bike None
Notes
set
instead of a list
because lookups will be faster. However, this won't work if you care about order.