apache-spark-ml

Should we parallelize a DataFrame like we parallelize a Seq before training

不羁的心 提交于 2021-02-17 15:36:40
问题 Consider the code given here, https://spark.apache.org/docs/1.2.0/ml-guide.html import org.apache.spark.ml.classification.LogisticRegression val training = sparkContext.parallelize(Seq( LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5)))) val lr = new LogisticRegression() lr.setMaxIter(10).setRegParam(0.01) val model1 = lr.fit(training) Assuming we

Training ml models on spark per partitions. Such that there will be a trained model per partition of dataframe

余生颓废 提交于 2021-02-11 15:41:37
问题 How to do parallel model training per partition in spark using scala? The solution given here is in Pyspark. I'm looking for solution in scala. How can you efficiently build one ML model per partition in Spark with foreachPartition? 回答1: Get the distinct partitions using partition col Create a threadpool of say 100 threads create future object for each threads and run sample code may be as follows- // Get an ExecutorService val threadPoolExecutorService = getExecutionContext("name", 100) //

pyspark - Convert sparse vector obtained after one hot encoding into columns

僤鯓⒐⒋嵵緔 提交于 2021-02-07 18:43:41
问题 I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to figure out how to convert this vector into columns so that I get a new transformed dataframe.Take this dataset for example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss = StringIndexer

PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

北战南征 提交于 2021-02-07 09:11:16
问题 I am doing binary classification using Spark ML Multilayer Perceptron Classifier. mlp = MultilayerPerceptronClassifier(labelCol="evt", featuresCol="features", layers=[inputneurons,(inputneurons*2)+1,2]) The output layer has of two neurons as it is a binary classification problem. Now I would like get the values two neurons for each of the rows in the test set instead of just getting the prediction column containing either 0 or 1. I could not find anything to get that in the API document. 回答1:

PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

半城伤御伤魂 提交于 2021-02-07 09:07:43
问题 I am doing binary classification using Spark ML Multilayer Perceptron Classifier. mlp = MultilayerPerceptronClassifier(labelCol="evt", featuresCol="features", layers=[inputneurons,(inputneurons*2)+1,2]) The output layer has of two neurons as it is a binary classification problem. Now I would like get the values two neurons for each of the rows in the test set instead of just getting the prediction column containing either 0 or 1. I could not find anything to get that in the API document. 回答1:

StandardScaler returns NaN

时间秒杀一切 提交于 2021-01-29 17:50:06
问题 env: spark-1.6.0 with scala-2.10.4 usage: // row of df : DataFrame = (String,String,double,Vector) as (id1,id2,label,feature) val df = sqlContext.read.parquet("data/Labeled.parquet") val SC = new StandardScaler() .setInputCol("feature").setOutputCol("scaled") .setWithMean(false).setWithStd(true).fit(df) val scaled = SC.transform(df) .drop("feature").withColumnRenamed("scaled","feature") Code as the example here http://spark.apache.org/docs/latest/ml-features.html#standardscaler NaN exists in

How to find the argmax of a vector in PySpark ML

∥☆過路亽.° 提交于 2021-01-07 05:48:27
问题 My model has output a DenseVector column, and I'd like to find the argmax. This page suggests this function should be available, but I'm not sure what the syntax should be. Is it df.select("mycolumn").argmax() ? 回答1: I could not find the documents for argmax operation in python. but you can do them by converting them to arrays For pyspark 3.0.0 from pyspark.ml.functions import vector_to_array tst_arr = tst_df.withColumn("arr",vector_to_array(F.col('vector_column'))) tst_max=tst_arr.withColumn

Covid Death Predictions gone wrong [closed]

邮差的信 提交于 2020-11-30 02:01:04
问题 Closed. This question needs debugging details. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 days ago . Improve this question I'm attempting to write a code that will predict fatalities in Toronto due to Covid19...with no luck. I'm sure this has an easy fix that I'm over looking, but I'm too new to spark to know what that is... does anyone have any insight on making this code run-able? Data set is here

How can you efficiently build one ML model per partition in Spark with foreachPartition?

不问归期 提交于 2020-06-17 05:34:29
问题 I am trying to fit one ML model for each partition of my dataset, and I do not know how to do it in Spark. My dataset is basically looking like this and is partitioned by Company : Company | Features | Target A xxx 0.9 A xxx 0.8 A xxx 1.0 B xxx 1.2 B xxx 1.0 B xxx 0.9 C xxx 0.7 C xxx 0.9 C xxx 0.9 My goal is to train one regressor for each company, in a parallelised way (I have a few hundred millions of records, with 100k companies). My intuition is that I need to use foreachPartition to have

Interpreting coefficientMatrix, interceptVector and Confusion matrix on multinomial logistic regression

大城市里の小女人 提交于 2020-06-13 08:11:53
问题 Can anyone explain how to interpret coefficientMatrix , interceptVector , Confusion matrix of a multinomial logistic regression . According to Spark documentation: Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression, the algorithm produces K sets of coefficients, or a matrix of dimension K×J where K is the number of outcome classes and J is the number of features. If the algorithm is fit with an intercept term then a length