Python - Pickle Spacy for PySpark

前端未结

关注

 2  775

半阙折子戏 2021-02-06 09:29

The documentation for Spacy 2.0 mentions that the developers have added functionality to allow for Spacy to be pickled so that it can be used by a Spark Cluster interfaced by Py

2条回答

一整个雨季 (楼主)

2021-02-06 09:42

Not really an answer, but the best workaround I've discovered:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType
import spacy

def get_entities_udf():
    def get_entities(text):
        global nlp
        try:
            doc = nlp(unicode(text))
        except:
            nlp = spacy.load('en')
            doc = nlp(unicode(text))
        return [t.label_ for t in doc.ents]
    res_udf = udf(get_entities, StringType(ArrayType()))
    return res_udf

documents_df = documents_df.withColumn('entities', get_entities_udf()('text'))

0 讨论(0)

查看其它2个回答