apache-spark-2.0

Spark submit jobs fails with no error in local mode

倖福魔咒の 提交于 2019-12-11 19:08:38
问题 I have a spark application that I am trying to execute on windows/Unix using below command spark-submit --class com.myorg.dataquality.DataVerification --master local[*] C:\Users\workspace\dataQuality\target\data-quality-framework-0.0.1-SNAPSHOT.jar The job is getting terminated immediately after submission. However it is working perfect in eclipse. C:\Users>spark-submit --class com.myorg.dataquality.DataVerification --master local[*] C:\Users\workspace\dataQuality\target\data-quality

com.mysql.jdbc.Driver not found in spark2 scala

此生再无相见时 提交于 2019-12-11 11:46:56
问题 I am using Jupyter Notebook with Scala kernel, below is my code to import mysql table to a dataframe: val sql="""select * from customer""" val df_customer = spark.read .format("jdbc") .option("url", "jdbc:mysql://localhost:3306/ccfd") .option("driver", "com.mysql.jdbc.Driver") .option("dbtable", s"( $sql ) t") .option("user", "root") .option("password", "xxxxxxx") .load() Below is the error: Name: java.lang.ClassNotFoundException Message: com.mysql.jdbc.Driver StackTrace: at scala.reflect

Spark Cassandra NoClassDefFoundError guava/cache/CacheLoader

天涯浪子 提交于 2019-12-11 07:35:49
问题 Running Cassandra 2.2.8, Win7, JDK8, Spark2, HAve thse in the CP: Cassandra core 3.12, spark-cassandra-2.11, Spark-cassandra-java-2.11, Spark2.11, spark-network-common_2.11, Guava-16.0.jar, sacala2.11.jar, etc Trying to run a basic example- compiles fine , but when when I try to run- at the first line itself get error: SparkConf conf = new SparkConf(); java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheLoader Missing spark-network-common is supposed to cause this error - but I

Huge delays translating the DAG to tasks

ε祈祈猫儿з 提交于 2019-12-11 07:27:24
问题 this are my steps: Submit the spark app to a EMR cluster The driver starts and I can see the Spark-ui (no stages have been created yet) The driver reads an orc file with ~3000 parts from s3, make some transformations and save it back to s3 The execution of the save should create some stages in the spark-ui but the stages take really long time to appear in the spark-ui The stages appear and start the execution Why am I getting that huge delay in step 4? During this time the cluster is

Creating a unique grouping key from column-wise runs in a Spark DataFrame

六月ゝ 毕业季﹏ 提交于 2019-12-11 06:38:45
问题 I have something analogous to this, where spark is my sparkContext . I've imported implicits._ in my sparkContext so I can use the $ syntax: val df = spark.createDataFrame(Seq(("a", 0L), ("b", 1L), ("c", 1L), ("d", 1L), ("e", 0L), ("f", 1L))) .toDF("id", "flag") .withColumn("index", monotonically_increasing_id) .withColumn("run_key", when($"flag" === 1, $"index").otherwise(0)) df.show df: org.apache.spark.sql.DataFrame = [id: string, flag: bigint ... 2 more fields] +---+----+-----+-------+ |

Jaro-Winkler score calculation in Apache Spark

北战南征 提交于 2019-12-11 06:08:13
问题 We need to implement Jaro-Winkler distance calculation across string in Apache Spark Dataset . We are new to spark and after searching in web we are not able to find much. It would be great if you can guide us. We thought of using flatMap then realized it won’t help, then we tried to use couple of foreach loops but not able to figure how to go forward. As each of the string has to be compared against all. Like in the below dataset. RowFactory.create(0, "Hi I heard about Spark"), RowFactory

How to load only first n files in pyspark spark.read.csv from a single directory

可紊 提交于 2019-12-11 05:07:59
问题 I have a scenario where I am loading and processing 4TB of data, which is about 15000 .csv files in a folder. since I have limited resources, I am planning to process them in two batches and them union them. I am trying to understand if I can load only 50% (or first n number of files in batch1 and the rest in batch 2) using spark.read.csv. I can not use a regular expression as these files are generated from multiple sources and they are of uneven number(from some sources they are few and from

Add new fitted stage to a exitsting PipelineModel without fitting again

江枫思渺然 提交于 2019-12-10 19:16:28
问题 I would like to concatenate several trained Pipelines to one, which is similar to "Spark add new fitted stage to a exitsting PipelineModel without fitting again" however the solution as below is for PySpark. > pipe_model_new = PipelineModel(stages = [pipe_model , pipe_model2]) > final_df = pipe_model_new.transform(df1) In Apache Spark 2.0 "PipelineModel"'s constructor is marked as private, hence it can not be called outside. While in "Pipeline" class, only "fit" method creates "PipelineModel"

value toDF is not a member of org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]

房东的猫 提交于 2019-12-10 18:22:07
问题 Am getting a compilation error converting the pre-LDA transformation to a data frame using SCALA in SPARK 2.0. The specific code that is throwing an error is as per below: val documents = PreLDAmodel.transform(mp_listing_lda_df) .select("docId","features") .rdd .map{ case Row(row_num: Long, features: MLVector) => (row_num, features) } .toDF() The complete compilation error is: Error:(132, 8) value toDF is not a member of org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]

How to build Spark from the sources from the Download Spark page?

让人想犯罪 __ 提交于 2019-12-10 13:37:34
问题 I tried to install and build Spark 2.0.0 on Ubuntu VM with Ubuntu 16.04 as follows: Install Java sudo apt-add-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer Install Scala Go to their Downloads tab on their site: scala-lang.org/download/all.html I used Scala 2.11.8. sudo mkdir /usr/local/src/scala sudo tar -xvf scala-2.11.8.tgz -C /usr/local/src/scala/ Modify the .bashrc file and include the path for scala: export SCALA_HOME=/usr/local/src/scala