AnalysisException: u"cannot resolve 'name' given input columns: [ list] in sqlContext in spark

前端 未结 3 928
庸人自扰
庸人自扰 2021-02-19 08:30

I tried a simple example like:

data = sqlContext.read.format(\"csv\").option(\"header\", \"true\").option(\"inferSchema\", \"true\").load(\"/databricks-datasets/         


        
相关标签:
3条回答
  • 2021-02-19 09:05

    Because header contains spaces or tabs,remove spaces or tabs and try

    1) My example script

    from pyspark.sql import SparkSession
    spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
    
    df=spark.read.csv(r'test.csv',header=True,sep='^')
    print("#################################################################")
    print df.printSchema()
    df.createOrReplaceTempView("test")
    re=spark.sql("select max_seq from test")
    print(re.show())
    print("################################################################")
    

    2) Input file,here 'max_seq ' contains space so we are getting bellow exception

    Trx_ID^max_seq ^Trx_Type^Trx_Record_Type^Trx_Date
    
    Traceback (most recent call last):
      File "D:/spark-2.1.0-bin-hadoop2.7/bin/test.py", line 14, in <module>
        re=spark.sql("select max_seq from test")
      File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\session.py", line 541, in sql
      File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
      File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 69, in deco
    pyspark.sql.utils.AnalysisException: u"cannot resolve '`max_seq`' given input columns: [Venue_City_Name, Trx_Type, Trx_Booking_Status_Committed, Payment_Reference1, Trx_Date, max_seq , Event_ItemVariable_Name, Amount_CurrentPrice, cinema_screen_count, Payment_IsMyPayment, r
    

    2) Remove space after 'max_seq' column then it will work fine

    Trx_ID^max_seq^Trx_Type^Trx_Record_Type^Trx_Date
    
    
    17/03/20 12:16:25 INFO DAGScheduler: Job 3 finished: showString at <unknown>:0, took 0.047602 s
    17/03/20 12:16:25 INFO CodeGenerator: Code generated in 8.494073 ms
      max_seq
        10
        23
        22
        22
    only showing top 20 rows
    
    None
    ##############################################################
    
    0 讨论(0)
  • 2021-02-19 09:13

    I found the issue: some of the column names contain white spaces before the name itself. So

    data = data.select(" timedelta", " shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
    

    worked. I could catch the white spaces using

    assert " " not in ''.join(df.columns)  
    

    Now I am thinking of a way to remove the white spaces. Any idea is much appreciated!

    0 讨论(0)
  • 2021-02-19 09:28
    As there were tabs in my input file, removing the tabs or spaces in the header helped display the answer.
    
    My example:
    
    saledf = spark.read.csv("SalesLTProduct.txt", header=True, inferSchema= True, sep='\t')
    
    
    saledf.printSchema()
    
    root
    |-- ProductID: string (nullable = true)
    |-- Name: string (nullable = true)
    |-- ProductNumber: string (nullable = true)
    
    saledf.describe('ProductNumber').show()
    
     +-------+-------------+
     |summary|ProductNumber|
     +-------+-------------+
     |  count|          295|
     |   mean|         null|
     | stddev|         null|
     |    min|      BB-7421|
     |    max|      WB-H098|
     +-------+-------------+
    
    0 讨论(0)
提交回复
热议问题