问题
I have a pyspark data Frame with 5M data and I am going to apply fuzzy logic(Levenshtein and Soundex functions) to find duplicates at the first name and last name columns.
Inpt data
Before that, I want to do resequence first name and last name columns value so that I get correct Levenshtein distance.
df = df.withColumn('full_name', f.concat(f.col('first'),f.lit('_'), f.col('last')))
output I get
I need something like below to get the proper result in fuzzy logic. At the time of concatenating take column value according to alphabets order.
output I want
回答1:
try:
df["full_name"] = df.apply(lambda x: "_".join(sorted((x["first"], x["last"]))), axis=1)
来源:https://stackoverflow.com/questions/60968052/concatenating-two-columns-in-pyspark-data-frame-according-to-alphabets-order