问题
I'm getting the following syntax error:
pyspark.sql.utils.AnalysisException: syntax error in attribute name: No I am not.;
When performing some aspect sentiment classification on the text column of a Spark dataframe df_text
that looks more or less like the following:
index id text
1995 ev0oyrq [sign up](
2014 eugwxff No I am not.
2675 g9f914q It’s hard for her to move around and even sit down, hard for her to walk and squeeze her hands. She hunches now.
1310 echja0g Thank you!
2727 gc725t2 great post!
My classify_text(text)
function returns a dictionary of the following format:
{"aspect1": "positive",
"aspect2": "positive",
"aspect3": "neutral",
"aspect4": "negative"
}
And my code is as follows:
udfClassifyText = udf(classify_text, MapType(StringType(), StringType()))
df_with_aspects = df_text.withColumn("aspects", udfClassifyText("text"))
The expected output of df_with_aspects.show()
in this case is:
index id text aspects
1995 ev0oyrq [sign up]( ["aspect1" -> "positive", "aspect2" -> "positive", "aspect3": "neutral", "aspect4" -> "negative"]
2014 eugwxff No I am not. ["aspect1" -> "positive", "aspect2" -> "positive", "aspect3": "neutral", "aspect4" -> "negative"]
2675 g9f914q Try another one ["aspect1" -> "positive", "aspect2" -> "positive", "aspect3": "neutral", "aspect4" -> "negative"]
1310 echja0g Thank you! ["aspect1" -> "positive", "aspect2" -> "positive", "aspect3": "neutral", "aspect4" -> "negative"]
2727 gc725t2 great post! ["aspect1" -> "positive", "aspect2" -> "positive", "aspect3": "neutral", "aspect4" -> "negative"]
Looking at the function calls in the stacktrace, I see _create_column_from_name
being invoked. It seems like Spark is trying to create a column with a name given by the text and hence the syntax error. I don't want to remove punctuation, because the model in classify_text
actually splits text into sentences. Does anyone know why I am getting this error and how I could avoid it?
Thank you so much!
----EDIT-----
My UDF is quite complex but it simply returns a dictionary. It uses the Aspect-Based Sentiment Analysis library and BERT transformers. I took inspiration from the following answer to avoid computing complex models in every row iteration.
from transformers import BertTokenizer
import aspect_based_sentiment_analysis as absa
model_name = 'absa/classifier-rest-0.2'
aspects = ['overall', 'effectiveness', 'side effects', 'dosage']
ABSA_PIPELINE = None
def get_absa_pipeline():
global ABSA_PIPELINE
if not ABSA_PIPELINE:
model = absa.BertABSClassifier.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)
professor = absa.Professor()
text_splitter = absa.sentencizer()
_nlp = absa.Pipeline(model, tokenizer, professor, text_splitter)
ABSA_PIPELINE = _nlp
return ABSA_PIPELINE
@udf(returnType=MapType(StringType(), StringType()))
def classify_text(text):
nlp = get_absa_pipeline()
task = nlp.preprocess(text=text, aspects=aspects)
tokenized_examples = nlp.tokenize(task.examples)
input_batch = nlp.encode(tokenized_examples)
output_batch = nlp.predict(input_batch)
predictions = nlp.review(tokenized_examples, output_batch)
completed_task = nlp.postprocess(task, predictions)
sentiment_dict = {k: v.sentiment.name for k,
v in completed_task.subtasks.items()}
return sentiment_dict
The stacktrace:
PythonException Traceback (most recent call last)
<ipython-input-19-420be959a305> in <module>()
----> 1 df_with_aspects.show()
3 frames
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
438 """
439 if isinstance(truncate, bool) and truncate:
--> 440 print(self._jdf.showString(n, 20, vertical))
441 else:
442 print(self._jdf.showString(n, int(truncate), vertical))
/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py in deco(*a, **kw)
132 # Hide where the exception came from that shows a non-Pythonic
133 # JVM exception message.
--> 134 raise_from(converted)
135 else:
136 raise
/usr/local/lib/python3.6/dist-packages/pyspark/sql/utils.py in raise_from(e)
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
for item in iterator:
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
return lambda *a: f(*a)
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/pyspark/sql/udf.py", line 197, in wrapper
return self(*args)
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 177, in __call__
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/column.py", line 68, in _to_seq
cols = [converter(c) for c in cols]
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/column.py", line 68, in <listcomp>
cols = [converter(c) for c in cols]
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/column.py", line 50, in _to_java_column
jcol = _create_column_from_name(col)
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/column.py", line 43, in _create_column_from_name
return sc._jvm.functions.col(name)
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.6/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: syntax error in attribute name: No I am not.;
df_with_aspects.explain()
outputs
== Physical Plan ==
*(2) Project [comment_id#0, subreddit#1, text#2, id#3, length#4L, pythonUDF0#139 AS aspects#132]
+- BatchEvalPython [classify_text(text#2)], [pythonUDF0#139]
+- *(1) Scan ExistingRDD[comment_id#0,subreddit#1,text#2,id#3,length#4L]
来源:https://stackoverflow.com/questions/65565976/pyspark-string-syntax-error-on-udf-that-returns-maptypestringtype-stringtype