PySpark string syntax error on UDF that returns MapType(StringType(), StringType())

假如想象 提交于 2021-01-07 01:29:08

问题


I'm getting the following syntax error:

pyspark.sql.utils.AnalysisException: syntax error in attribute name: No I am not.;

When performing some aspect sentiment classification on the text column of a Spark dataframe df_text that looks more or less like the following:

index       id              text           
1995        ev0oyrq         [sign up](     
2014        eugwxff         No I am not.
2675        g9f914q         It’s hard for her to move around and even sit down, hard for her to walk and squeeze her hands. She hunches now.       
1310        echja0g         Thank you!     
2727        gc725t2         great post!    

My classify_text(text) function returns a dictionary of the following format:

{"aspect1": "positive",
"aspect2": "positive",
"aspect3": "neutral",
"aspect4": "negative"
}

And my code is as follows:

udfClassifyText = udf(classify_text, MapType(StringType(), StringType()))

df_with_aspects = df_text.withColumn("aspects", udfClassifyText("text"))

The expected output of df_with_aspects.show() in this case is:

index       id              text               aspects
1995        ev0oyrq         [sign up](         ["aspect1" -> "positive", "aspect2" -> "positive", "aspect3": "neutral", "aspect4" -> "negative"]         
2014        eugwxff         No I am not.       ["aspect1" -> "positive", "aspect2" -> "positive", "aspect3": "neutral", "aspect4" -> "negative"]
2675        g9f914q         Try another one    ["aspect1" -> "positive", "aspect2" -> "positive", "aspect3": "neutral", "aspect4" -> "negative"]
1310        echja0g         Thank you!         ["aspect1" -> "positive", "aspect2" -> "positive", "aspect3": "neutral", "aspect4" -> "negative"]
2727        gc725t2         great post!        ["aspect1" -> "positive", "aspect2" -> "positive", "aspect3": "neutral", "aspect4" -> "negative"]

Looking at the function calls in the stacktrace, I see _create_column_from_name being invoked. It seems like Spark is trying to create a column with a name given by the text and hence the syntax error. I don't want to remove punctuation, because the model in classify_text actually splits text into sentences. Does anyone know why I am getting this error and how I could avoid it? Thank you so much!

----EDIT-----

My UDF is quite complex but it simply returns a dictionary. It uses the Aspect-Based Sentiment Analysis library and BERT transformers. I took inspiration from the following answer to avoid computing complex models in every row iteration.

from transformers import BertTokenizer
import aspect_based_sentiment_analysis as absa

model_name = 'absa/classifier-rest-0.2'
aspects = ['overall', 'effectiveness', 'side effects', 'dosage']

ABSA_PIPELINE = None

def get_absa_pipeline():
    global ABSA_PIPELINE
    if not ABSA_PIPELINE:
        model = absa.BertABSClassifier.from_pretrained(model_name)
        tokenizer = BertTokenizer.from_pretrained(model_name)
        professor = absa.Professor()     
        text_splitter = absa.sentencizer()  
        _nlp = absa.Pipeline(model, tokenizer, professor, text_splitter)
        ABSA_PIPELINE = _nlp

    return ABSA_PIPELINE

@udf(returnType=MapType(StringType(), StringType()))
def classify_text(text):
    nlp = get_absa_pipeline()

    task = nlp.preprocess(text=text, aspects=aspects)
    tokenized_examples = nlp.tokenize(task.examples)
    
    input_batch = nlp.encode(tokenized_examples)
    output_batch = nlp.predict(input_batch)
    predictions = nlp.review(tokenized_examples, output_batch)
    completed_task = nlp.postprocess(task, predictions)
  
    sentiment_dict = {k: v.sentiment.name for k,
                      v in completed_task.subtasks.items()}
    
    return sentiment_dict

The stacktrace:

PythonException                           Traceback (most recent call last)
<ipython-input-19-420be959a305> in <module>()
----> 1 df_with_aspects.show()

3 frames
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
    438         """
    439         if isinstance(truncate, bool) and truncate:
--> 440             print(self._jdf.showString(n, 20, vertical))
    441         else:
    442             print(self._jdf.showString(n, int(truncate), vertical))

/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py in deco(*a, **kw)
    132                 # Hide where the exception came from that shows a non-Pythonic
    133                 # JVM exception message.
--> 134                 raise_from(converted)
    135             else:
    136                 raise

/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py in raise_from(e)

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/udf.py", line 197, in wrapper
    return self(*args)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 177, in __call__
    return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/column.py", line 68, in _to_seq
    cols = [converter(c) for c in cols]
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/column.py", line 68, in <listcomp>
    cols = [converter(c) for c in cols]
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/column.py", line 50, in _to_java_column
    jcol = _create_column_from_name(col)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/column.py", line 43, in _create_column_from_name
    return sc._jvm.functions.col(name)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: syntax error in attribute name: No I am not.;

df_with_aspects.explain() outputs

== Physical Plan ==
*(2) Project [comment_id#0, subreddit#1, text#2, id#3, length#4L, pythonUDF0#139 AS aspects#132]
+- BatchEvalPython [classify_text(text#2)], [pythonUDF0#139]
   +- *(1) Scan ExistingRDD[comment_id#0,subreddit#1,text#2,id#3,length#4L]

来源:https://stackoverflow.com/questions/65565976/pyspark-string-syntax-error-on-udf-that-returns-maptypestringtype-stringtype

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!