issue in encoding non-numeric feature to numeric in Spark and Ipython

前端未结

关注

 1  1753

I am working on something where I have to make predictions for numeric data (monthly employee spending) using non-numeric features. I am using Sp


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2021-01-28 03:58
              
            
            
                                                                       
Generally speaking if you have data that can be processed using Pandas data frames and scikit-learn using Spark seems to be a serious overkill. Still, if you do it probably make more sense to use Spark tools all the way. Lets start with indexing your features:

from pyspark.ml.feature import StringIndexer
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import VectorAssembler

label_col = "x3"  # For example

# I assume this comes from your previous question
df = (rdd.map(lambda row: [row[i] for i in columns_num])
    .toDF(("x0", "x1", "x2", "x3")))

# Indexers encode strings with doubles
string_indexers = [
   StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))

   # For classifications problems
   #   - if you want to use ML you should index label as well
   #   - if you want to use MLlib it is not necessary
   # For regression problems you should omit label in the indexing
   # as shown below
   for x in df.columns if x not in {label_col} # Exclude other columns if needed
]

# Assembles multiple columns into a single vector
assembler = VectorAssembler(
    inputCols=["idx_{0}".format(x) for x in df.columns if x != label_col],
    outputCol="features"
)


pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(df)
indexed = model.transform(df)


Pipeline defined above will create following data frame:

indexed.printSchema()
## root
##  |-- x0: string (nullable = true)
##  |-- x1: string (nullable = true)
##  |-- x2: string (nullable = true)
##  |-- x3: string (nullable = true)
##  |-- idx_x0: double (nullable = true)
##  |-- idx_x1: double (nullable = true)
##  |-- idx_x2: double (nullable = true)
##  |-- features: vector (nullable = true)


where features should be a valid input for mllib.tree.DecisionTree (see SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?).

You can create label points out of it as follows:

from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col

label_points = (indexed
    .select(col(label_col).alias("label"), col("features"))
    .map(lambda row: LabeledPoint(row.label, row.features)))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复