spark-dataframe | 易学教程

Combining csv files with mismatched columns

阅读更多关于 Combining csv files with mismatched columns

Spark append mode for partitioned text file fails with SaveMode.Append - IOException File already Exists

阅读更多关于 Spark append mode for partitioned text file fails with SaveMode.Append - IOException File already Exists

问题 Something simple as writing partitioned text files fails. dataDF.write.partitionBy("year", "month", "date").mode(SaveMode.Append).text("s3://data/test2/events/") Exception - 16/07/06 02:15:05 ERROR datasources.DynamicPartitionWriterContainer: Aborting task. java.io.IOException: File already exists:s3://path/1839dd1ed38a.gz at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:614) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913) at org.apache

Spark, add new Column with the same value in Scala

阅读更多关于 Spark, add new Column with the same value in Scala

问题 I have some problem with the withColumn function in Spark-Scala environment. I would like to add a new Column in my DataFrame like that: +---+----+---+ | A| B| C| +---+----+---+ | 4|blah| 2| | 2| | 3| | 56| foo| 3| |100|null| 5| +---+----+---+ became: +---+----+---+-----+ | A| B| C| D | +---+----+---+-----+ | 4|blah| 2| 750| | 2| | 3| 750| | 56| foo| 3| 750| |100|null| 5| 750| +---+----+---+-----+ the column D in one value repeated N-time for each row in my DataFrame. The code are this: var

Pyspark: filter dataframe by regex with string formatting?

阅读更多关于 Pyspark: filter dataframe by regex with string formatting?

问题 I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows: input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword

Pyspark: filter dataframe by regex with string formatting?

阅读更多关于 Pyspark: filter dataframe by regex with string formatting?

Dataframe from List<String> in Java

阅读更多关于 Dataframe from List in Java

问题 Spark Version : 1.6.2 Java Version: 7 I have a List<String> data. Something like: [[dev, engg, 10000], [karthik, engg, 20000]..] I know schema for this data. name (String) degree (String) salary (Integer) I tried: JavaRDD<String> data = new JavaSparkContext(sc).parallelize(datas); DataFrame df = sqlContext.read().json(data); df.printSchema(); df.show(false); Output: root |-- _corrupt_record: string (nullable = true) +-----------------------------+ |_corrupt_record | +-------------------------

Spark SQL removing white spaces

阅读更多关于 Spark SQL removing white spaces

问题 I have a simple Spark Program which reads a JSON file and emits a CSV file. IN the JSON data the values contain leading and trailing white spaces, when I emit the CSV the leading and trailing white spaces are gone. Is there a way I can retain the spaces. I tried many options like ignoreTrailingWhiteSpace , ignoreLeadingWhiteSpace but no luck input.json {"key" : "k1", "value1": "Good String", "value2": "Good String"} {"key" : "k1", "value1": "With Spaces ", "value2": "With Spaces "} {"key" :

Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

阅读更多关于 Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

问题 I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command. 17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed 17/12/27 18:33:19 WARN Executor: Issue communicating with driver in heartbeater org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.rpc.RpcTimeout

Spark (Scala) execute dataframe within for loop

阅读更多关于 Spark (Scala) execute dataframe within for loop

问题 I am using spark 1.6.1 version. I have requirement to execute dataframe in loop. for ( i <- List ('a','b')){ val i = sqlContext.sql("SELECT i, col1, col2 FROM DF1")} I want this dataframe to be executed twice ( i = a and i = b ). 回答1: Your code is almost correct. Except two things : i is already used in your for loop so don't use it in val i = If you want to use the value of i in a string, use String Interpolation So your code should look like : for (i <- List ('a','b')) { val df = sqlContext

Spark (Scala) execute dataframe within for loop

阅读更多关于 Spark (Scala) execute dataframe within for loop