Anomaly detection with PCA in Spark

后端 未结 1 502
萌比男神i
萌比男神i 2021-01-23 06:33

I read the following article

Anomaly detection with Principal Component Analysis (PCA)

In the article is written following:

• PCA algorithm basica

1条回答
  •  执笔经年
    2021-01-23 06:56

    Lets assume you have a dataset of 3-dimensional points. Each point has coordinates (x, y, z). Those (x, y, z) are dimensions. Point represented by three values e. g. (8, 7, 4). It called input vector.

    When you applying PCA algorithm you basically transform your input vector to new vector. It can be represented as function that turns (x, y, z) => (v, w).

    Example: (8, 7, 4) => (-4, 13)

    Now you received a vector, shorter one (you reduced an nr. of dimension), but your point still has coordinates, namely (v, w). This means that you can compute the distance between two points using Mahalanobis measure. Points that have a long distance from a mean coordinate are in fact anomalies.

    Example solution:

    import breeze.linalg.{DenseVector, inv}
    import org.apache.spark.ml.Pipeline
    import org.apache.spark.ml.feature.{PCA, StandardScaler, VectorAssembler}
    import org.apache.spark.ml.linalg.{Matrix, Vector}
    import org.apache.spark.ml.stat.Correlation
    import org.apache.spark.sql.{DataFrame, Row, SparkSession}
    import org.apache.spark.sql.functions._
    
    object SparkApp extends App {
      val session = SparkSession.builder()
        .appName("spark-app").master("local[*]").getOrCreate()
      session.sparkContext.setLogLevel("ERROR")
      import session.implicits._
    
      val df = Seq(
        (1, 4, 0),
        (3, 4, 0),
        (1, 3, 0),
        (3, 3, 0),
        (67, 37, 0) //outlier
      ).toDF("x", "y", "z")
      val vectorAssembler = new VectorAssembler().setInputCols(Array("x", "y", "z")).setOutputCol("vector")
      val standardScalar = new StandardScaler().setInputCol("vector").setOutputCol("normalized-vector").setWithMean(true)
        .setWithStd(true)
    
      val pca = new PCA().setInputCol("normalized-vector").setOutputCol("pca-features").setK(2)
    
      val pipeline = new Pipeline().setStages(
        Array(vectorAssembler, standardScalar, pca)
      )
    
      val pcaDF = pipeline.fit(df).transform(df)
    
      def withMahalanobois(df: DataFrame, inputCol: String): DataFrame = {
        val Row(coeff1: Matrix) = Correlation.corr(df, inputCol).head
    
        val invCovariance = inv(new breeze.linalg.DenseMatrix(2, 2, coeff1.toArray))
    
        val mahalanobois = udf[Double, Vector] { v =>
          val vB = DenseVector(v.toArray)
          vB.t * invCovariance * vB
        }
    
        df.withColumn("mahalanobois", mahalanobois(df(inputCol)))
      }
    
      val withMahalanobois: DataFrame = withMahalanobois(pcaDF, "pca-features")
    
      session.close()
    }
    

    0 讨论(0)
提交回复
热议问题