pyspark: TypeError: IntegerType can not accept object in type

前端 未结 2 1080
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-02-20 17:11

programming with pyspark on a Spark cluster, the data is large and in pieces so can not be loaded into the memory or check the sanity of the data easily

basically it loo

2条回答
  •  鱼传尺愫
    2021-02-20 17:53

    As noted by ccheneson you pass wrong types.

    Assuming you data looks like this:

    data = sc.parallelize(["af.b Current%20events 1 996"])
    

    After the first map you get RDD[List[String]]:

    parts = data.map(lambda l: l.split())
    parts.first()
    ## ['af.b', 'Current%20events', '1', '996']
    

    The second map converts it to tuple (String, String, String, String):

    wikis = parts.map(lambda p: (p[0], p[1], p[2],p[3]))
    wikis.first()
    ## ('af.b', 'Current%20events', '1', '996')
    

    Your schema states that 3rd columns is an integer:

    [f.dataType for f in schema.fields]
    ## [StringType, StringType, IntegerType, StringType]
    

    Schema is used most to avoid a full table scan to infer types and doesn't perform any type casting.

    You can either cast your data during last map:

    wikis = parts.map(lambda p: (p[0], p[1], int(p[2]), p[3]))
    

    Or define count as a StringType and cast column

    fields[2] = StructField("count", StringType(), True)
    schema = StructType(fields) 
    
    wikis.toDF(schema).withColumn("cnt", col("count").cast("integer")).drop("count")
    

    On a side note count is reserved word in SQL and shouldn't be used as a column name. In Spark it will work as expected in some contexts and fail in another.

提交回复
热议问题