问题
i am trying to read a csv file as a spark df by enabling inferSchema, but then am unable to get the fv_df.columns. below is the error message
>>> fv_df = spark.read.option("header", "true").option("delimiter", "\t").csv('/home/h212957/FacilityView/datapoints_FV.csv', inferSchema=True)
>>> fv_df.columns
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/h212957/spark/python/pyspark/sql/dataframe.py", line 687, in columns
return [f.name for f in self.schema.fields]
File "/home/h212957/spark/python/pyspark/sql/dataframe.py", line 227, in schema
self._schema = _parse_datatype_json_string(self._jdf.schema().json())
File "/home/h212957/spark/python/pyspark/sql/types.py", line 894, in _parse_datatype_json_string
return _parse_datatype_json_value(json.loads(json_string))
File "/home/h212957/spark/python/pyspark/sql/types.py", line 911, in _parse_datatype_json_value
return _all_complex_types[tpe].fromJson(json_value)
File "/home/h212957/spark/python/pyspark/sql/types.py", line 562, in fromJson
return StructType([StructField.fromJson(f) for f in json["fields"]])
File "/home/h212957/spark/python/pyspark/sql/types.py", line 428, in fromJson
_parse_datatype_json_value(json["type"]),
File "/home/h212957/spark/python/pyspark/sql/types.py", line 907, in _parse_datatype_json_value
raise ValueError("Could not parse datatype: %s" % json_value)
ValueError: Could not parse datatype: decimal(7,-31)
However If i don't infer the Schema than I am able to fetch the columns and do further operations. I am unable to get as why this is working in this way. Can anyone please explain me.
回答1:
I suggest you use the function '.load' rather than '.csv', something like this:
data = sc.read.load(path_to_file,
format='com.databricks.spark.csv',
header='true',
inferSchema='true').cache()
Of you course you can add more options. Then you can simply get you want:
data.columns
Another way of doing this (to get the columns) is to use it this way:
data = sc.textFile(path_to_file)
And to get the headers (columns) just use
data.first()
Looks like you are trying to get your schema from your csv file without opening it! The above should help you to get them and hence manipulate whatever you like.
Note: to use '.columns' your 'sc' should be configured as:
spark = SparkSession.builder \
.master("yarn") \
.appName("experiment-airbnb") \
.enableHiveSupport() \
.getOrCreate()
sc = SQLContext(spark)
Good luck!
回答2:
It would be good if you can provide some sample data next time. How should we know how your csv looks like. Concerning your question, it looks like that your csv column is not a decimal all the time. InferSchema takes the first row and assign a datatype, in your case, it is a DecimalType but then in the second row you might have a text so that the error would occur.
If you don't infer the schema then, of course, it would work since everything will be cast as a StringType.
回答3:
Please try the below code and this infers the schema along with header
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('operation').getOrCreate()
df=spark.read.csv("C:/LEARNING//Spark_DataFrames/stock.csv ",inferSchema=True, header=True)
df.show()
来源:https://stackoverflow.com/questions/43628701/inferschema-in-spark-csv-package