Oversampling or SMOTE in Pyspark

前端 未结 2 556
野趣味
野趣味 2021-01-13 00:10

I have 7 classes and the total number of records are 115 and I wanted to run Random Forest model over this data. But as the data is not enough to get a high accuracy. So i w

相关标签:
2条回答
  • 2021-01-13 00:38

    Here is another implementation of Pyspark and Scala smote that I have used in the past. I have copped the code across and referenced the source because its quite small:

    Pyspark:

    import random
    import numpy as np
    from pyspark.sql import Row
    from sklearn import neighbors
    from pyspark.ml.feature import VectorAssembler
    
    def vectorizerFunction(dataInput, TargetFieldName):
        if(dataInput.select(TargetFieldName).distinct().count() != 2):
            raise ValueError("Target field must have only 2 distinct classes")
        columnNames = list(dataInput.columns)
        columnNames.remove(TargetFieldName)
        dataInput = dataInput.select((','.join(columnNames)+','+TargetFieldName).split(','))
        assembler=VectorAssembler(inputCols = columnNames, outputCol = 'features')
        pos_vectorized = assembler.transform(dataInput)
        vectorized = pos_vectorized.select('features',TargetFieldName).withColumn('label',pos_vectorized[TargetFieldName]).drop(TargetFieldName)
        return vectorized
    
    def SmoteSampling(vectorized, k = 5, minorityClass = 1, majorityClass = 0, percentageOver = 200, percentageUnder = 100):
        if(percentageUnder > 100|percentageUnder < 10):
            raise ValueError("Percentage Under must be in range 10 - 100");
        if(percentageOver < 100):
            raise ValueError("Percentage Over must be in at least 100");
        dataInput_min = vectorized[vectorized['label'] == minorityClass]
        dataInput_maj = vectorized[vectorized['label'] == majorityClass]
        feature = dataInput_min.select('features')
        feature = feature.rdd
        feature = feature.map(lambda x: x[0])
        feature = feature.collect()
        feature = np.asarray(feature)
        nbrs = neighbors.NearestNeighbors(n_neighbors=k, algorithm='auto').fit(feature)
        neighbours =  nbrs.kneighbors(feature)
        gap = neighbours[0]
        neighbours = neighbours[1]
        min_rdd = dataInput_min.drop('label').rdd
        pos_rddArray = min_rdd.map(lambda x : list(x))
        pos_ListArray = pos_rddArray.collect()
        min_Array = list(pos_ListArray)
        newRows = []
        nt = len(min_Array)
        nexs = percentageOver/100
        for i in range(nt):
            for j in range(nexs):
                neigh = random.randint(1,k)
                difs = min_Array[neigh][0] - min_Array[i][0]
                newRec = (min_Array[i][0]+random.random()*difs)
                newRows.insert(0,(newRec))
        newData_rdd = sc.parallelize(newRows)
        newData_rdd_new = newData_rdd.map(lambda x: Row(features = x, label = 1))
        new_data = newData_rdd_new.toDF()
        new_data_minor = dataInput_min.unionAll(new_data)
        new_data_major = dataInput_maj.sample(False, (float(percentageUnder)/float(100)))
        return new_data_major.unionAll(new_data_minor)
    
    dataInput = spark.read.format('csv').options(header='true',inferSchema='true').load("sam.csv").dropna()
    SmoteSampling(vectorizerFunction(dataInput, 'Y'), k = 2, minorityClass = 1, majorityClass = 0, percentageOver = 90, percentageUnder = 5)
    

    Scala:

    // Import the necessary packages
    import org.apache.spark.ml.feature.BucketedRandomProjectionLSH
    import org.apache.spark.ml.feature.VectorAssembler
    import org.apache.spark.sql.expressions.Window
    import org.apache.spark.mllib.linalg.Vectors
    import org.apache.spark.sql.functions.rand
    import org.apache.spark.sql.functions._
    
    object smoteClass{
      def KNNCalculation(
        dataFinal:org.apache.spark.sql.DataFrame,
        feature:String,
        reqrows:Int,
        BucketLength:Int,
        NumHashTables:Int):org.apache.spark.sql.DataFrame = {
          val b1 = dataFinal.withColumn("index", row_number().over(Window.partitionBy("label").orderBy("label")))
          val brp = new BucketedRandomProjectionLSH().setBucketLength(BucketLength).setNumHashTables(NumHashTables).setInputCol(feature).setOutputCol("values")
          val model = brp.fit(b1)
          val transformedA = model.transform(b1)
          val transformedB = model.transform(b1)
          val b2 = model.approxSimilarityJoin(transformedA, transformedB, 2000000000.0)
          require(b2.count > reqrows, println("Change bucket lenght or reduce the percentageOver"))
          val b3 = b2.selectExpr("datasetA.index as id1",
            "datasetA.feature as k1",
            "datasetB.index as id2",
            "datasetB.feature as k2",
            "distCol").filter("distCol>0.0").orderBy("id1", "distCol").dropDuplicates().limit(reqrows)
          return b3
      }
    
      def smoteCalc(key1: org.apache.spark.ml.linalg.Vector, key2: org.apache.spark.ml.linalg.Vector)={
        val resArray = Array(key1, key2)
        val res = key1.toArray.zip(key2.toArray.zip(key1.toArray).map(x => x._1 - x._2).map(_*0.2)).map(x => x._1 + x._2)
        resArray :+ org.apache.spark.ml.linalg.Vectors.dense(res)}
    
      def Smote(
        inputFrame:org.apache.spark.sql.DataFrame,
        feature:String,
        label:String,
        percentOver:Int,
        BucketLength:Int,
        NumHashTables:Int):org.apache.spark.sql.DataFrame = {
          val groupedData = inputFrame.groupBy(label).count
          require(groupedData.count == 2, println("Only 2 labels allowed"))
          val classAll = groupedData.collect()
          val minorityclass = if (classAll(0)(1).toString.toInt > classAll(1)(1).toString.toInt) classAll(1)(0).toString else classAll(0)(0).toString
          val frame = inputFrame.select(feature,label).where(label + " == " + minorityclass)
          val rowCount = frame.count
          val reqrows = (rowCount * (percentOver/100)).toInt
          val md = udf(smoteCalc _)
          val b1 = KNNCalculation(frame, feature, reqrows, BucketLength, NumHashTables)
          val b2 = b1.withColumn("ndtata", md($"k1", $"k2")).select("ndtata")
          val b3 = b2.withColumn("AllFeatures", explode($"ndtata")).select("AllFeatures").dropDuplicates
          val b4 = b3.withColumn(label, lit(minorityclass).cast(frame.schema(1).dataType))
          return inputFrame.union(b4).dropDuplicates
      }
    }
    
    • Source
    0 讨论(0)
  • 2021-01-13 00:50

    Maybe this project can be useful for your goal: Spark SMOTE

    But I think that 115 records aren't enough for a random forest. You can use other simplest technique like decision trees

    You can check this answer:

    Is Random Forest suitable for very small data sets?

    0 讨论(0)
提交回复
热议问题