I am working on something where I have to make predictions for numeric
data (monthly employee spending) using non-numeric
features. I am using Sp
Generally speaking if you have data that can be processed using Pandas data frames and scikit-learn
using Spark seems to be a serious overkill. Still, if you do it probably make more sense to use Spark tools all the way. Lets start with indexing your features:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import VectorAssembler
label_col = "x3" # For example
# I assume this comes from your previous question
df = (rdd.map(lambda row: [row[i] for i in columns_num])
.toDF(("x0", "x1", "x2", "x3")))
# Indexers encode strings with doubles
string_indexers = [
StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))
# For classifications problems
# - if you want to use ML you should index label as well
# - if you want to use MLlib it is not necessary
# For regression problems you should omit label in the indexing
# as shown below
for x in df.columns if x not in {label_col} # Exclude other columns if needed
]
# Assembles multiple columns into a single vector
assembler = VectorAssembler(
inputCols=["idx_{0}".format(x) for x in df.columns if x != label_col],
outputCol="features"
)
pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(df)
indexed = model.transform(df)
Pipeline defined above will create following data frame:
indexed.printSchema()
## root
## |-- x0: string (nullable = true)
## |-- x1: string (nullable = true)
## |-- x2: string (nullable = true)
## |-- x3: string (nullable = true)
## |-- idx_x0: double (nullable = true)
## |-- idx_x1: double (nullable = true)
## |-- idx_x2: double (nullable = true)
## |-- features: vector (nullable = true)
where features
should be a valid input for mllib.tree.DecisionTree
(see SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?).
You can create label points out of it as follows:
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col
label_points = (indexed
.select(col(label_col).alias("label"), col("features"))
.map(lambda row: LabeledPoint(row.label, row.features)))