pyspark: TypeError: IntegerType can not accept object in type

前端未结

关注

 2  1110

爱一瞬间的悲伤 2021-02-20 17:11

programming with pyspark on a Spark cluster, the data is large and in pieces so can not be loaded into the memory or check the sanity of the data easily

basically it loo

2条回答

鱼传尺愫 (楼主)

2021-02-20 17:53
As noted by ccheneson you pass wrong types.

Assuming you data looks like this:
```
data = sc.parallelize(["af.b Current%20events 1 996"])
```
After the first map you get RDD[List[String]]:
```
parts = data.map(lambda l: l.split())
parts.first()
## ['af.b', 'Current%20events', '1', '996']
```
The second map converts it to tuple (String, String, String, String):
```
wikis = parts.map(lambda p: (p[0], p[1], p[2],p[3]))
wikis.first()
## ('af.b', 'Current%20events', '1', '996')
```
Your schema states that 3rd columns is an integer:
```
[f.dataType for f in schema.fields]
## [StringType, StringType, IntegerType, StringType]
```
Schema is used most to avoid a full table scan to infer types and doesn't perform any type casting.

You can either cast your data during last map:
```
wikis = parts.map(lambda p: (p[0], p[1], int(p[2]), p[3]))
```
Or define count as a StringType and cast column
```
fields[2] = StructField("count", StringType(), True)
schema = StructType(fields) 

wikis.toDF(schema).withColumn("cnt", col("count").cast("integer")).drop("count")
```
On a side note count is reserved word in SQL and shouldn't be used as a column name. In Spark it will work as expected in some contexts and fail in another.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...