issue in encoding non-numeric feature to numeric in Spark and Ipython

前端 未结 1 1753
迷失自我
迷失自我 2021-01-28 03:16

I am working on something where I have to make predictions for numeric data (monthly employee spending) using non-numeric features. I am using Sp

相关标签:
1条回答
  • 2021-01-28 03:58

    Generally speaking if you have data that can be processed using Pandas data frames and scikit-learn using Spark seems to be a serious overkill. Still, if you do it probably make more sense to use Spark tools all the way. Lets start with indexing your features:

    from pyspark.ml.feature import StringIndexer
    from pyspark.ml.pipeline import Pipeline
    from pyspark.ml.feature import VectorAssembler
    
    label_col = "x3"  # For example
    
    # I assume this comes from your previous question
    df = (rdd.map(lambda row: [row[i] for i in columns_num])
        .toDF(("x0", "x1", "x2", "x3")))
    
    # Indexers encode strings with doubles
    string_indexers = [
       StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))
    
       # For classifications problems
       #   - if you want to use ML you should index label as well
       #   - if you want to use MLlib it is not necessary
       # For regression problems you should omit label in the indexing
       # as shown below
       for x in df.columns if x not in {label_col} # Exclude other columns if needed
    ]
    
    # Assembles multiple columns into a single vector
    assembler = VectorAssembler(
        inputCols=["idx_{0}".format(x) for x in df.columns if x != label_col],
        outputCol="features"
    )
    
    
    pipeline = Pipeline(stages=string_indexers + [assembler])
    model = pipeline.fit(df)
    indexed = model.transform(df)
    

    Pipeline defined above will create following data frame:

    indexed.printSchema()
    ## root
    ##  |-- x0: string (nullable = true)
    ##  |-- x1: string (nullable = true)
    ##  |-- x2: string (nullable = true)
    ##  |-- x3: string (nullable = true)
    ##  |-- idx_x0: double (nullable = true)
    ##  |-- idx_x1: double (nullable = true)
    ##  |-- idx_x2: double (nullable = true)
    ##  |-- features: vector (nullable = true)
    

    where features should be a valid input for mllib.tree.DecisionTree (see SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?).

    You can create label points out of it as follows:

    from pyspark.mllib.regression import LabeledPoint
    from pyspark.sql.functions import col
    
    label_points = (indexed
        .select(col(label_col).alias("label"), col("features"))
        .map(lambda row: LabeledPoint(row.label, row.features)))
    
    0 讨论(0)
提交回复
热议问题