spark-dataframe

Ho to read “.gz” compressed file using spark DF or DS?

人走茶凉 提交于 2020-05-29 05:11:16
问题 I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS? Details : File is csv with tab delimited. 回答1: Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter): val df = spark.read.option("sep", "\t").csv("file.csv.gz") PySpark: df = spark.read.csv("file.csv.gz", sep='\t') The only extra consideration to take into

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

一笑奈何 提交于 2020-05-28 13:46:55
问题 I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks. 回答1: writing DataFrame to HDFS (Spark 1.6). df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object. some of the format options are csv , parquet , json etc. reading DataFrame from HDFS (Spark 1.6). from pyspark.sql import SQLContext sqlContext = SQLContext(sc)

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

為{幸葍}努か 提交于 2020-05-28 13:46:48
问题 I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks. 回答1: writing DataFrame to HDFS (Spark 1.6). df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object. some of the format options are csv , parquet , json etc. reading DataFrame from HDFS (Spark 1.6). from pyspark.sql import SQLContext sqlContext = SQLContext(sc)

Spark colocated join between two partitioned dataframes

*爱你&永不变心* 提交于 2020-05-25 06:52:47
问题 For the following join between two DataFrames in Spark 1.6.0 val df0Rep = df0.repartition(32, col("a")).cache val df1Rep = df1.repartition(32, col("a")).cache val dfJoin = df0Rep.join(df1Rep, "a") println(dfJoin.count) Does this join not only co-partitioned but also co-located? I know that for RDDs if using the same partitioner and shuffled in the same operation, the join would be co-located. But what about dataframes? Thank you. 回答1: [https://medium.com/@achilleus/https-medium-com-joins-in

Spark colocated join between two partitioned dataframes

﹥>﹥吖頭↗ 提交于 2020-05-25 06:52:40
问题 For the following join between two DataFrames in Spark 1.6.0 val df0Rep = df0.repartition(32, col("a")).cache val df1Rep = df1.repartition(32, col("a")).cache val dfJoin = df0Rep.join(df1Rep, "a") println(dfJoin.count) Does this join not only co-partitioned but also co-located? I know that for RDDs if using the same partitioner and shuffled in the same operation, the join would be co-located. But what about dataframes? Thank you. 回答1: [https://medium.com/@achilleus/https-medium-com-joins-in

Spark colocated join between two partitioned dataframes

依然范特西╮ 提交于 2020-05-25 06:52:25
问题 For the following join between two DataFrames in Spark 1.6.0 val df0Rep = df0.repartition(32, col("a")).cache val df1Rep = df1.repartition(32, col("a")).cache val dfJoin = df0Rep.join(df1Rep, "a") println(dfJoin.count) Does this join not only co-partitioned but also co-located? I know that for RDDs if using the same partitioner and shuffled in the same operation, the join would be co-located. But what about dataframes? Thank you. 回答1: [https://medium.com/@achilleus/https-medium-com-joins-in

Spark - creating schema programmatically with different data types

≯℡__Kan透↙ 提交于 2020-05-15 06:31:24
问题 I have a dataset consisting of 7-8 fields which are of type String, Int & Float. Am trying to create Schema by Programmatic approach by using this : val schema = StructType(header.split(",").map(column => StructField(column, StringType, true))) And Then mapping it to Row type like : val dataRdd = datafile.filter(x => x!=header).map(x => x.split(",")).map(col => Row(col(0).trim, col(1).toInt, col(2).toFloat, col(3), col(4) ,col(5), col(6), col(7), col(8))) But after creating DataFrame when i

Spark Reading Compressed with Special Format

生来就可爱ヽ(ⅴ<●) 提交于 2020-05-15 04:20:06
问题 I have a file .gz I need to read this file and add the time and file name to this file I have some problems and need your help to recommend a way for this points. Because the file is compressed the first line is reading with not the proper format I think due to encoding problem I tried the below code but not working implicit val codec = Codec("UTF-8") codec.onMalformedInput(CodingErrorAction.REPLACE) codec.onUnmappableCharacter(CodingErrorAction.REPLACE) File has special format and I need to

check if a row value is null in spark dataframe

泄露秘密 提交于 2020-05-08 05:36:17
问题 I am using a custom function in pyspark to check a condition for each row in a spark dataframe and add columns if condition is true. The code is as below: from pyspark.sql.types import * from pyspark.sql.functions import * from pyspark.sql import Row def customFunction(row): if (row.prod.isNull()): prod_1 = "new prod" return (row + Row(prod_1)) else: prod_1 = row.prod return (row + Row(prod_1)) sdf = sdf_temp.map(customFunction) sdf.show() I get the error mention below: AttributeError:

How does createOrReplaceTempView work in Spark?

和自甴很熟 提交于 2020-05-06 00:13:57
问题 I am new to Spark and Spark SQL. How does createOrReplaceTempView work in Spark? If we register an RDD of objects as a table will spark keep all the data in memory? 回答1: createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a hive table in Spark SQL. It does not persist to memory unless you cache the dataset that underpins the view. scala> val s = Seq(1,2,3).toDF("num") s: org.apache.spark.sql.DataFrame = [num: int]