I\'new using Spark. How can I get inverted index for csv file by using Spark? I have csv file
df.show()
+--------+--------------------+--------------------+---
First you need to convert the tags column to array, like this:
from pyspark.sql.types import *
def read_tags_raw(tags_string):
return tags_string.strip('>').strip('<').split('><') if tags_string else []
read_tags = udf(read_tags_raw, ArrayType(StringType()))
df_valid_tags = df.withColumn('tags', read_tags('tags'))
Then you call explode and group by tag to create inverted index:
df_valid_tags.select(explode('tags').alias('tag'), 'id') \
.groupBy('tag').agg(collect_list('id').alias('ids'))
Good luck working with SO database ;-)