问题
I created data frame in PySpark by reading data from HDFS like this:
df = spark.read.parquet('path/to/parquet')
I expect the data frame to have two column of strings:
+------------+------------------+
|my_column |my_other_column |
+------------+------------------+
|my_string_1 |my_other_string_1 |
|my_string_2 |my_other_string_2 |
|my_string_3 |my_other_string_3 |
|my_string_4 |my_other_string_4 |
|my_string_5 |my_other_string_5 |
|my_string_6 |my_other_string_6 |
|my_string_7 |my_other_string_7 |
|my_string_8 |my_other_string_8 |
+------------+------------------+
However, I get my_column
column with some strings starting with [Ljava.lang.Object;
, looking like this:
>> df.show(truncate=False)
+-----------------------------+------------------+
|my_column |my_other_column |
+-----------------------------+------------------+
|[Ljava.lang.Object;@7abeeeb6 |my_other_string_1 |
|[Ljava.lang.Object;@5c1bbb1c |my_other_string_2 |
|[Ljava.lang.Object;@6be335ee |my_other_string_3 |
|[Ljava.lang.Object;@153bdb33 |my_other_string_4 |
|[Ljava.lang.Object;@1a23b57f |my_other_string_5 |
|[Ljava.lang.Object;@3a101a1a |my_other_string_6 |
|[Ljava.lang.Object;@33846636 |my_other_string_7 |
|[Ljava.lang.Object;@521a0a3d |my_other_string_8 |
+-----------------------------+------------------+
>> df.printSchema()
root
|-- my_column: string (nullable = true)
|-- my_other_column: string (nullable = true)
As you can see, my_other_column
column is looking as expected. Is there any way, how to convert objects in my_column
column to humanly readable strings?
回答1:
Jaroslav,
I tried with the following code, and have used a sample parquet file from here. I am able to get the desired output from the dataframe, can u please chk your code using the code snippet below and also sample file referred above to see if there's any other issue:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Read a Parquet file").getOrCreate()
df = spark.read.parquet('E:\\...\\..\\userdata1.parquet')
df.show(10)
df.printSchema()
Replace the path to your HDFS location.
Dataframe output for your reference:
来源:https://stackoverflow.com/questions/57748029/pyspark-how-to-covert-column-with-ljava-lang-object