Python - Pickle Spacy for PySpark

前端 未结 2 768
半阙折子戏
半阙折子戏 2021-02-06 09:29

The documentation for Spacy 2.0 mentions that the developers have added functionality to allow for Spacy to be pickled so that it can be used by a Spark Cluster interfaced by Py

2条回答
  •  一整个雨季
    2021-02-06 09:42

    Not really an answer, but the best workaround I've discovered:

    from pyspark.sql.functions import udf
    from pyspark.sql.types import StringType, ArrayType
    import spacy
    
    def get_entities_udf():
        def get_entities(text):
            global nlp
            try:
                doc = nlp(unicode(text))
            except:
                nlp = spacy.load('en')
                doc = nlp(unicode(text))
            return [t.label_ for t in doc.ents]
        res_udf = udf(get_entities, StringType(ArrayType()))
        return res_udf
    
    documents_df = documents_df.withColumn('entities', get_entities_udf()('text'))
    

提交回复
热议问题