concatenating two columns in pyspark data frame according to alphabets order [duplicate]

萝らか妹 提交于 2020-04-17 22:54:55

问题


I have a pyspark data Frame with 5M data and I am going to apply fuzzy logic(Levenshtein and Soundex functions) to find duplicates at the first name and last name columns.

Inpt data

Before that, I want to do resequence first name and last name columns value so that I get correct Levenshtein distance.

df = df.withColumn('full_name', f.concat(f.col('first'),f.lit('_'), f.col('last')))

output I get

I need something like below to get the proper result in fuzzy logic. At the time of concatenating take column value according to alphabets order.

output I want


回答1:


try:

df["full_name"] = df.apply(lambda x: "_".join(sorted((x["first"], x["last"]))), axis=1)


来源:https://stackoverflow.com/questions/60968052/concatenating-two-columns-in-pyspark-data-frame-according-to-alphabets-order

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!