I tried a simple example like:
data = sqlContext.read.format(\"csv\").option(\"header\", \"true\").option(\"inferSchema\", \"true\").load(\"/databricks-datasets/
Because header contains spaces or tabs,remove spaces or tabs and try
1) My example script
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df=spark.read.csv(r'test.csv',header=True,sep='^')
print("#################################################################")
print df.printSchema()
df.createOrReplaceTempView("test")
re=spark.sql("select max_seq from test")
print(re.show())
print("################################################################")
2) Input file,here 'max_seq ' contains space so we are getting bellow exception
Trx_ID^max_seq ^Trx_Type^Trx_Record_Type^Trx_Date
Traceback (most recent call last):
File "D:/spark-2.1.0-bin-hadoop2.7/bin/test.py", line 14, in <module>
re=spark.sql("select max_seq from test")
File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\session.py", line 541, in sql
File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u"cannot resolve '`max_seq`' given input columns: [Venue_City_Name, Trx_Type, Trx_Booking_Status_Committed, Payment_Reference1, Trx_Date, max_seq , Event_ItemVariable_Name, Amount_CurrentPrice, cinema_screen_count, Payment_IsMyPayment, r
2) Remove space after 'max_seq' column then it will work fine
Trx_ID^max_seq^Trx_Type^Trx_Record_Type^Trx_Date
17/03/20 12:16:25 INFO DAGScheduler: Job 3 finished: showString at <unknown>:0, took 0.047602 s
17/03/20 12:16:25 INFO CodeGenerator: Code generated in 8.494073 ms
max_seq
10
23
22
22
only showing top 20 rows
None
##############################################################
I found the issue: some of the column names contain white spaces before the name itself. So
data = data.select(" timedelta", " shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
worked. I could catch the white spaces using
assert " " not in ''.join(df.columns)
Now I am thinking of a way to remove the white spaces. Any idea is much appreciated!
As there were tabs in my input file, removing the tabs or spaces in the header helped display the answer.
My example:
saledf = spark.read.csv("SalesLTProduct.txt", header=True, inferSchema= True, sep='\t')
saledf.printSchema()
root
|-- ProductID: string (nullable = true)
|-- Name: string (nullable = true)
|-- ProductNumber: string (nullable = true)
saledf.describe('ProductNumber').show()
+-------+-------------+
|summary|ProductNumber|
+-------+-------------+
| count| 295|
| mean| null|
| stddev| null|
| min| BB-7421|
| max| WB-H098|
+-------+-------------+