问题
I'm trying to learn machine learning with PySpark
. I have a dataset that has a couple of String
columns which have either True or False or Yes or No
as its value. I'm working with DecisionTree
and I wanted to convert these String
values to corresponding Double
values i.e. True, Yes
should change to 1.0
and False, No
should change to 0.0
. I saw a tutorial where they did the same thing and I came up with this code
df = sqlContext.read.csv("C:/../churn-bigml-20.csv",inferSchema=True,header=True)
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import UserDefinedFunction
binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())
csv_data = df.drop('State').drop('Area code') \
.withColumn('Churn', toNum(df['Churn'])) \
.withColumn('International plan', toNum(df['International plan'])) \
.withColumn('Voice mail plan', toNum(df['Voice mail plan'])).cache()
However when I run this, I get so many errors that look like this.
File "C:\..\spark-2.1.0\python\lib\pyspark.zip\pyspark\worker.py", line 70, in <lambda>
File "C:\..\workspace\PyML\src\ModelBuilding.py", line 20, in <lambda>
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())
KeyError: False
Note: I'm working on PySpark with Spark 2.1 and Python 3.5 and I guess the tutorial that I follow uses spark 1.6 and Python 2.7. So I don't if this is one of the Python grammar issues.
回答1:
I solved it by changing mapping part to:
binary_map = {'Yes':1.0, 'No':0.0, True : 1.0, False : 0.0}
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())
I just removed the quote from True and False. I thought that was weird but when I checked the schema of the DataFrame
using print(df.printSchema())
, it showed that the field that has True and False values is of type boolean
.
The Schema
root
|-- State: string (nullable = true)
|-- Account length: integer (nullable = true)
|-- Area code: integer (nullable = true)
|-- International plan: string (nullable = true)
|-- Voice mail plan: string (nullable = true)
.
.
.
|-- Customer service calls: integer (nullable = true)
|-- Churn: boolean (nullable = true)
So that's why I had to take the quotes off. Thank you.
来源:https://stackoverflow.com/questions/43511085/pyspark-keyerror-when-converting-a-dataframe-column-of-string-type-to-double