spark-dataframe

Combining csv files with mismatched columns

有些话、适合烂在心里 提交于 2020-01-13 06:30:08
问题 I need to combine multiple csv files into one object (a dataframe, I assume) but they all have mismatched columns, like so: CSV A store_location_key | product_key | collector_key | trans_dt | sales | units | trans_key CSV B collector_key | trans_dt | store_location_key | product_key | sales | units | trans_key CSV C collector_key | trans_dt | store_location_key |product_key | sales | units | trans_id On top of that, I need these to match with two additional csv files that have a matching

Spark append mode for partitioned text file fails with SaveMode.Append - IOException File already Exists

守給你的承諾、 提交于 2020-01-13 04:36:10
问题 Something simple as writing partitioned text files fails. dataDF.write.partitionBy("year", "month", "date").mode(SaveMode.Append).text("s3://data/test2/events/") Exception - 16/07/06 02:15:05 ERROR datasources.DynamicPartitionWriterContainer: Aborting task. java.io.IOException: File already exists:s3://path/1839dd1ed38a.gz at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:614) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913) at org.apache

Spark, add new Column with the same value in Scala

佐手、 提交于 2020-01-12 04:54:08
问题 I have some problem with the withColumn function in Spark-Scala environment. I would like to add a new Column in my DataFrame like that: +---+----+---+ | A| B| C| +---+----+---+ | 4|blah| 2| | 2| | 3| | 56| foo| 3| |100|null| 5| +---+----+---+ became: +---+----+---+-----+ | A| B| C| D | +---+----+---+-----+ | 4|blah| 2| 750| | 2| | 3| 750| | 56| foo| 3| 750| |100|null| 5| 750| +---+----+---+-----+ the column D in one value repeated N-time for each row in my DataFrame. The code are this: var

Pyspark: filter dataframe by regex with string formatting?

孤街醉人 提交于 2020-01-12 01:47:09
问题 I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows: input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword

Pyspark: filter dataframe by regex with string formatting?

馋奶兔 提交于 2020-01-12 01:47:04
问题 I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows: input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword

Dataframe from List<String> in Java

扶醉桌前 提交于 2020-01-11 07:12:10
问题 Spark Version : 1.6.2 Java Version: 7 I have a List<String> data. Something like: [[dev, engg, 10000], [karthik, engg, 20000]..] I know schema for this data. name (String) degree (String) salary (Integer) I tried: JavaRDD<String> data = new JavaSparkContext(sc).parallelize(datas); DataFrame df = sqlContext.read().json(data); df.printSchema(); df.show(false); Output: root |-- _corrupt_record: string (nullable = true) +-----------------------------+ |_corrupt_record | +-------------------------

Spark SQL removing white spaces

守給你的承諾、 提交于 2020-01-10 06:08:51
问题 I have a simple Spark Program which reads a JSON file and emits a CSV file. IN the JSON data the values contain leading and trailing white spaces, when I emit the CSV the leading and trailing white spaces are gone. Is there a way I can retain the spaces. I tried many options like ignoreTrailingWhiteSpace , ignoreLeadingWhiteSpace but no luck input.json {"key" : "k1", "value1": "Good String", "value2": "Good String"} {"key" : "k1", "value1": "With Spaces ", "value2": "With Spaces "} {"key" :

Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

大城市里の小女人 提交于 2020-01-10 02:46:15
问题 I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command. 17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed 17/12/27 18:33:19 WARN Executor: Issue communicating with driver in heartbeater org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.rpc.RpcTimeout

Spark (Scala) execute dataframe within for loop

最后都变了- 提交于 2020-01-07 09:55:15
问题 I am using spark 1.6.1 version. I have requirement to execute dataframe in loop. for ( i <- List ('a','b')){ val i = sqlContext.sql("SELECT i, col1, col2 FROM DF1")} I want this dataframe to be executed twice ( i = a and i = b ). 回答1: Your code is almost correct. Except two things : i is already used in your for loop so don't use it in val i = If you want to use the value of i in a string, use String Interpolation So your code should look like : for (i <- List ('a','b')) { val df = sqlContext

Spark (Scala) execute dataframe within for loop

瘦欲@ 提交于 2020-01-07 09:53:49
问题 I am using spark 1.6.1 version. I have requirement to execute dataframe in loop. for ( i <- List ('a','b')){ val i = sqlContext.sql("SELECT i, col1, col2 FROM DF1")} I want this dataframe to be executed twice ( i = a and i = b ). 回答1: Your code is almost correct. Except two things : i is already used in your for loop so don't use it in val i = If you want to use the value of i in a string, use String Interpolation So your code should look like : for (i <- List ('a','b')) { val df = sqlContext