How can I get inverted index?

前端 未结 1 802
别跟我提以往
别跟我提以往 2021-01-27 14:29

I\'new using Spark. How can I get inverted index for csv file by using Spark? I have csv file

df.show()
+--------+--------------------+--------------------+---         


        
相关标签:
1条回答
  • 2021-01-27 14:45

    First you need to convert the tags column to array, like this:

    from pyspark.sql.types import *
    def read_tags_raw(tags_string): 
        return tags_string.strip('>').strip('<').split('><') if tags_string else []
    
    read_tags = udf(read_tags_raw, ArrayType(StringType()))
    df_valid_tags = df.withColumn('tags', read_tags('tags'))
    

    Then you call explode and group by tag to create inverted index:

    df_valid_tags.select(explode('tags').alias('tag'), 'id') \
        .groupBy('tag').agg(collect_list('id').alias('ids'))
    

    Good luck working with SO database ;-)

    0 讨论(0)
提交回复
热议问题