apache-spark-ml | 易学教程

Should we parallelize a DataFrame like we parallelize a Seq before training

阅读更多关于 Should we parallelize a DataFrame like we parallelize a Seq before training

问题 Consider the code given here, https://spark.apache.org/docs/1.2.0/ml-guide.html import org.apache.spark.ml.classification.LogisticRegression val training = sparkContext.parallelize(Seq( LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5)))) val lr = new LogisticRegression() lr.setMaxIter(10).setRegParam(0.01) val model1 = lr.fit(training) Assuming we

Training ml models on spark per partitions. Such that there will be a trained model per partition of dataframe

阅读更多关于 Training ml models on spark per partitions. Such that there will be a trained model per partition of dataframe

问题 How to do parallel model training per partition in spark using scala? The solution given here is in Pyspark. I'm looking for solution in scala. How can you efficiently build one ML model per partition in Spark with foreachPartition? 回答1: Get the distinct partitions using partition col Create a threadpool of say 100 threads create future object for each threads and run sample code may be as follows- // Get an ExecutorService val threadPoolExecutorService = getExecutionContext("name", 100) //

pyspark - Convert sparse vector obtained after one hot encoding into columns

阅读更多关于 pyspark - Convert sparse vector obtained after one hot encoding into columns

问题 I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to figure out how to convert this vector into columns so that I get a new transformed dataframe.Take this dataset for example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss = StringIndexer

PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

阅读更多关于 PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

问题 I am doing binary classification using Spark ML Multilayer Perceptron Classifier. mlp = MultilayerPerceptronClassifier(labelCol="evt", featuresCol="features", layers=[inputneurons,(inputneurons*2)+1,2]) The output layer has of two neurons as it is a binary classification problem. Now I would like get the values two neurons for each of the rows in the test set instead of just getting the prediction column containing either 0 or 1. I could not find anything to get that in the API document. 回答1:

PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

阅读更多关于 PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

StandardScaler returns NaN

阅读更多关于 StandardScaler returns NaN

问题 env: spark-1.6.0 with scala-2.10.4 usage: // row of df : DataFrame = (String,String,double,Vector) as (id1,id2,label,feature) val df = sqlContext.read.parquet("data/Labeled.parquet") val SC = new StandardScaler() .setInputCol("feature").setOutputCol("scaled") .setWithMean(false).setWithStd(true).fit(df) val scaled = SC.transform(df) .drop("feature").withColumnRenamed("scaled","feature") Code as the example here http://spark.apache.org/docs/latest/ml-features.html#standardscaler NaN exists in

How to find the argmax of a vector in PySpark ML

阅读更多关于 How to find the argmax of a vector in PySpark ML

问题 My model has output a DenseVector column, and I'd like to find the argmax. This page suggests this function should be available, but I'm not sure what the syntax should be. Is it df.select("mycolumn").argmax() ? 回答1: I could not find the documents for argmax operation in python. but you can do them by converting them to arrays For pyspark 3.0.0 from pyspark.ml.functions import vector_to_array tst_arr = tst_df.withColumn("arr",vector_to_array(F.col('vector_column'))) tst_max=tst_arr.withColumn

Covid Death Predictions gone wrong [closed]

阅读更多关于 Covid Death Predictions gone wrong [closed]

问题 Closed. This question needs debugging details. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 days ago . Improve this question I'm attempting to write a code that will predict fatalities in Toronto due to Covid19...with no luck. I'm sure this has an easy fix that I'm over looking, but I'm too new to spark to know what that is... does anyone have any insight on making this code run-able? Data set is here

How can you efficiently build one ML model per partition in Spark with foreachPartition?

阅读更多关于 How can you efficiently build one ML model per partition in Spark with foreachPartition?

问题 I am trying to fit one ML model for each partition of my dataset, and I do not know how to do it in Spark. My dataset is basically looking like this and is partitioned by Company : Company | Features | Target A xxx 0.9 A xxx 0.8 A xxx 1.0 B xxx 1.2 B xxx 1.0 B xxx 0.9 C xxx 0.7 C xxx 0.9 C xxx 0.9 My goal is to train one regressor for each company, in a parallelised way (I have a few hundred millions of records, with 100k companies). My intuition is that I need to use foreachPartition to have

Interpreting coefficientMatrix, interceptVector and Confusion matrix on multinomial logistic regression

阅读更多关于 Interpreting coefficientMatrix, interceptVector and Confusion matrix on multinomial logistic regression

问题 Can anyone explain how to interpret coefficientMatrix , interceptVector , Confusion matrix of a multinomial logistic regression . According to Spark documentation: Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression, the algorithm produces K sets of coefficients, or a matrix of dimension K×J where K is the number of outcome classes and J is the number of features. If the algorithm is fit with an intercept term then a length