apache-spark | 易学教程

Regarding org.apache.spark.sql.AnalysisException error when creating a jar file using Scala

阅读更多关于 Regarding org.apache.spark.sql.AnalysisException error when creating a jar file using Scala

问题 I have following simple Scala class , which i will later modify to fit some machine learning models. I need to create a jar file out of this as i am going to run these models in amazon-emr . I am a beginner in this process. So i first tested whether i can successfully import the following csv file and write to another file by creating a jar file using the Scala class mention below. The csv file looks like this and its include a Date column as one of the variables. +-------------------+-------

Spark How to Specify Number of Resulting Files for DataFrame While/After Writing

阅读更多关于 Spark How to Specify Number of Resulting Files for DataFrame While/After Writing

问题 I saw several q/a's about writing single file into hdfs,it seems using coalesce(1) is sufficient. E.g; df.coalesce(1).write.mode("overwrite").format(format).save(location) But how can I specify "exact" number of files that will written after save operation? So my question is; If I have dataframe which consist 100 partitions when I make write operation will it write 100 files? If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce

Transposing table to given format in spark [duplicate]

阅读更多关于 Transposing table to given format in spark [duplicate]

问题 This question already has answers here : How to pivot Spark DataFrame? (10 answers) Closed 4 days ago . I am using sparkv2.4.1, Have a scenario , where i need to convert given table structred as below val df = Seq( ("A", "2016-01-01", "2016-12-01", "0.044999408"), ("A", "2016-01-01", "2016-12-01", "0.0449999426"), ("A", "2016-01-01", "2016-12-01", "0.045999415"), ("B", "2016-01-01", "2016-12-01", "0.0787888909"), ("B", "2016-01-01", "2016-12-01", "0.079779426"), ("B", "2016-01-01", "2016-12

Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext

阅读更多关于 Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext

问题 I am using IntelliJ 2016.3 version. import sbt.Keys._ import sbt._ object ApplicationBuild extends Build { object Versions { val spark = "1.6.3" } val projectName = "example-spark" val common = Seq( version := "1.0", scalaVersion := "2.11.7" ) val customLibraryDependencies = Seq( "org.apache.spark" %% "spark-core" % Versions.spark % "provided", "org.apache.spark" %% "spark-sql" % Versions.spark % "provided", "org.apache.spark" %% "spark-hive" % Versions.spark % "provided", "org.apache.spark"

Spark - Read and Write back to same S3 location

阅读更多关于 Spark - Read and Write back to same S3 location

问题 I am reading a dataset dataset1 and dataset2 from S3 locations. I then transform them and write back to the same location where dataset2 was read from. However, I get below error message: An error occurred while calling o118.save. No such file or directory 's3://<myPrefix>/part-00001-a123a120-7d11-581a-b9df-bc53076d57894-c000.snappy.parquet If I try to write to a new S3 location e.g. s3://dataset_new_path.../ then the code works fine. my_df \ .write.mode('overwrite') \ .format('parquet') \

Spark getnewargs error … Method or([class java.lang.String]) does not exist

阅读更多关于 Spark __getnewargs__ error … Method or([class java.lang.String]) does not exist

问题 I am trying to add a column to DataFrame depending on whether column value is in another column as follow: df=df.withColumn('new_column',when(df['color']=='blue'|df['color']=='green','A').otherwise('WD')) after running the code I obtain the following error: Py4JError: An error occurred while calling o59.or. Trace: py4j.Py4JException: Method or([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine

pyspark 'DataFrame' object has no attribute '_get_object_id'

阅读更多关于 pyspark 'DataFrame' object has no attribute '_get_object_id'

问题 I am trying to run some code, but getting error: 'DataFrame' object has no attribute '_get_object_id' The code: items = [(1,12),(1,float('Nan')),(1,14),(1,10),(2,22),(2,20),(2,float('Nan')),(3,300), (3,float('Nan'))] sc = spark.sparkContext rdd = sc.parallelize(items) df = rdd.toDF(["id", "col1"]) import pyspark.sql.functions as func means = df.groupby("id").agg(func.mean("col1")) # The error is thrown at this line df = df.withColumn("col1", func.when((df["col1"].isNull()), means.where(func

How to display (or operate on) objects encoded by Kryo in Spark Dataset?

阅读更多关于 How to display (or operate on) objects encoded by Kryo in Spark Dataset?

问题 Say you have this: // assume we handle custom type class MyObj(val i: Int, val j: String) implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[MyObj] val ds = spark.createDataset(Seq(new MyObj(1, "a"),new MyObj(2, "b"),new MyObj(3, "c"))) When do a ds.show , I got: +--------------------+ | value| +--------------------+ |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| +--------------------+ I understand that it's because the contents are encoded into internal

How to display (or operate on) objects encoded by Kryo in Spark Dataset?

阅读更多关于 How to display (or operate on) objects encoded by Kryo in Spark Dataset?

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

阅读更多关于 How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

问题 My input dataframe looks like the below from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Basics").getOrCreate() df=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low']) +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 4.3|null| | Bob| NaN| 897| +-----+----+----+ Expected Output if divided by 10.0 +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 0.43|null| | Bob| NaN| 89.7| +-----+----+----+