问题
The table looks like this :
ID |CITY
----------------------------------
1 |London|Paris|Tokyo
2 |Tokyo|Barcelona|Mumbai|London
3 |Vienna|Paris|Seattle
The city column contains around 1000+ values which are | delimited
I want to create a flag column to indicate if a person visited only the city of interest.
city_of_interest=['Paris','Seattle','Tokyo']
There are 20 such values in the list.
Ouput should look like this :
ID |Paris | Seattle | Tokyo
-------------------------------------------
1 |1 |0 |1
2 |0 |0 |1
3 |1 |1 |0
The solution can either be in pandas or pyspark.
回答1:
For pyspark, use split + array_contains:
from pyspark.sql.functions import split, array_contains
df.withColumn('cities', split('CITY', '\|')) \
.select('ID', *[ array_contains('cities', c).astype('int').alias(c) for c in city_of_interest ])
.show()
+---+-----+-------+-----+
| ID|Paris|Seattle|Tokyo|
+---+-----+-------+-----+
| 1| 1| 0| 1|
| 2| 0| 0| 1|
| 3| 1| 1| 0|
+---+-----+-------+-----+
For Pandas, use Series.str.get_dummies:
df[city_of_interest] = df.CITY.str.get_dummies()[city_of_interest]
df = df.drop('CITY', axis=1)
回答2:
Pandas Solution
First transform to list to use DataFrame.explode:
new_df=df.copy()
new_df['CITY']=new_df['CITY'].str.lstrip('|').str.split('|')
#print(new_df)
# ID CITY
#0 1 [London, Paris, Tokyo]
#1 2 [Tokyo, Barcelona, Mumbai, London]
#2 3 [Vienna, Paris, Seattle]
Then we can use:
Method 1: DataFrame.pivot_table
new_df=( new_df.explode('CITY')
.pivot_table(columns='CITY',index='ID',aggfunc='size',fill_value=0)
[city_of_interest]
.reset_index()
.rename_axis(columns=None)
)
print(new_df)
Method 2: DataFrame.groupby + DataFrame.unstack
new_df=( new_df.explode('CITY')
.groupby(['ID'])
.CITY
.value_counts()
.unstack('CITY',fill_value=0)[city_of_interest]
.reset_index()
.rename_axis(columns=None)
)
print(new_df)
Output new_df:
ID Paris Seattle Tokyo
0 1 1 0 1
1 2 0 0 1
2 3 1 1 0
回答3:
Using a UDF to check if the city of interest value is in the delimited column.
from pyspark.sql.functions import udf
#Input list
city_of_interest=['Paris','Seattle','Tokyo']
#UDF definition
def city_present(city_name,city_list):
return len(set([city_name]) & set(city_list.split('|')))
city_present_udf = udf(city_present,IntegerType())
#Converting cities list to a column of array type for adding columns to the dataframe
city_array = array(*[lit(city) for city in city_of_interest])
l = len(city_of_interest)
col_names = df.columns + [city for city in city_of_interest]
result = df.select(df.columns + [city_present_udf(city_array[i],df.city) for i in range(l)])
result = result.toDF(*col_names)
result.show()
来源:https://stackoverflow.com/questions/59222061/how-to-create-multiple-flag-columns-based-on-list-values-found-in-the-dataframe