apache-spark

Regarding org.apache.spark.sql.AnalysisException error when creating a jar file using Scala

笑着哭i 提交于 2021-02-17 05:33:34
问题 I have following simple Scala class , which i will later modify to fit some machine learning models. I need to create a jar file out of this as i am going to run these models in amazon-emr . I am a beginner in this process. So i first tested whether i can successfully import the following csv file and write to another file by creating a jar file using the Scala class mention below. The csv file looks like this and its include a Date column as one of the variables. +-------------------+-------

Spark How to Specify Number of Resulting Files for DataFrame While/After Writing

蓝咒 提交于 2021-02-17 05:25:06
问题 I saw several q/a's about writing single file into hdfs,it seems using coalesce(1) is sufficient. E.g; df.coalesce(1).write.mode("overwrite").format(format).save(location) But how can I specify "exact" number of files that will written after save operation? So my question is; If I have dataframe which consist 100 partitions when I make write operation will it write 100 files? If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce

Transposing table to given format in spark [duplicate]

六月ゝ 毕业季﹏ 提交于 2021-02-17 05:09:18
问题 This question already has answers here : How to pivot Spark DataFrame? (10 answers) Closed 4 days ago . I am using sparkv2.4.1, Have a scenario , where i need to convert given table structred as below val df = Seq( ("A", "2016-01-01", "2016-12-01", "0.044999408"), ("A", "2016-01-01", "2016-12-01", "0.0449999426"), ("A", "2016-01-01", "2016-12-01", "0.045999415"), ("B", "2016-01-01", "2016-12-01", "0.0787888909"), ("B", "2016-01-01", "2016-12-01", "0.079779426"), ("B", "2016-01-01", "2016-12

Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext

丶灬走出姿态 提交于 2021-02-17 04:42:13
问题 I am using IntelliJ 2016.3 version. import sbt.Keys._ import sbt._ object ApplicationBuild extends Build { object Versions { val spark = "1.6.3" } val projectName = "example-spark" val common = Seq( version := "1.0", scalaVersion := "2.11.7" ) val customLibraryDependencies = Seq( "org.apache.spark" %% "spark-core" % Versions.spark % "provided", "org.apache.spark" %% "spark-sql" % Versions.spark % "provided", "org.apache.spark" %% "spark-hive" % Versions.spark % "provided", "org.apache.spark"

Spark - Read and Write back to same S3 location

a 夏天 提交于 2021-02-17 02:48:10
问题 I am reading a dataset dataset1 and dataset2 from S3 locations. I then transform them and write back to the same location where dataset2 was read from. However, I get below error message: An error occurred while calling o118.save. No such file or directory 's3://<myPrefix>/part-00001-a123a120-7d11-581a-b9df-bc53076d57894-c000.snappy.parquet If I try to write to a new S3 location e.g. s3://dataset_new_path.../ then the code works fine. my_df \ .write.mode('overwrite') \ .format('parquet') \

Spark __getnewargs__ error … Method or([class java.lang.String]) does not exist

混江龙づ霸主 提交于 2021-02-16 20:01:20
问题 I am trying to add a column to DataFrame depending on whether column value is in another column as follow: df=df.withColumn('new_column',when(df['color']=='blue'|df['color']=='green','A').otherwise('WD')) after running the code I obtain the following error: Py4JError: An error occurred while calling o59.or. Trace: py4j.Py4JException: Method or([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine

pyspark 'DataFrame' object has no attribute '_get_object_id'

百般思念 提交于 2021-02-16 18:57:44
问题 I am trying to run some code, but getting error: 'DataFrame' object has no attribute '_get_object_id' The code: items = [(1,12),(1,float('Nan')),(1,14),(1,10),(2,22),(2,20),(2,float('Nan')),(3,300), (3,float('Nan'))] sc = spark.sparkContext rdd = sc.parallelize(items) df = rdd.toDF(["id", "col1"]) import pyspark.sql.functions as func means = df.groupby("id").agg(func.mean("col1")) # The error is thrown at this line df = df.withColumn("col1", func.when((df["col1"].isNull()), means.where(func

How to display (or operate on) objects encoded by Kryo in Spark Dataset?

旧城冷巷雨未停 提交于 2021-02-16 13:55:08
问题 Say you have this: // assume we handle custom type class MyObj(val i: Int, val j: String) implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[MyObj] val ds = spark.createDataset(Seq(new MyObj(1, "a"),new MyObj(2, "b"),new MyObj(3, "c"))) When do a ds.show , I got: +--------------------+ | value| +--------------------+ |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| +--------------------+ I understand that it's because the contents are encoded into internal

How to display (or operate on) objects encoded by Kryo in Spark Dataset?

北战南征 提交于 2021-02-16 13:54:10
问题 Say you have this: // assume we handle custom type class MyObj(val i: Int, val j: String) implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[MyObj] val ds = spark.createDataset(Seq(new MyObj(1, "a"),new MyObj(2, "b"),new MyObj(3, "c"))) When do a ds.show , I got: +--------------------+ | value| +--------------------+ |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| +--------------------+ I understand that it's because the contents are encoded into internal

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

老子叫甜甜 提交于 2021-02-16 08:43:54
问题 My input dataframe looks like the below from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Basics").getOrCreate() df=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low']) +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 4.3|null| | Bob| NaN| 897| +-----+----+----+ Expected Output if divided by 10.0 +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 0.43|null| | Bob| NaN| 89.7| +-----+----+----+