spark-dataframe | 易学教程

Ho to read “.gz” compressed file using spark DF or DS?

阅读更多关于 Ho to read “.gz” compressed file using spark DF or DS?

问题 I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS? Details : File is csv with tab delimited. 回答1: Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter): val df = spark.read.option("sep", "\t").csv("file.csv.gz") PySpark: df = spark.read.csv("file.csv.gz", sep='\t') The only extra consideration to take into

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

阅读更多关于 How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

问题 I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks. 回答1: writing DataFrame to HDFS (Spark 1.6). df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object. some of the format options are csv , parquet , json etc. reading DataFrame from HDFS (Spark 1.6). from pyspark.sql import SQLContext sqlContext = SQLContext(sc)

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

阅读更多关于 How to write pyspark dataframe to HDFS and then how to read it back into dataframe?

Spark colocated join between two partitioned dataframes

阅读更多关于 Spark colocated join between two partitioned dataframes

问题 For the following join between two DataFrames in Spark 1.6.0 val df0Rep = df0.repartition(32, col("a")).cache val df1Rep = df1.repartition(32, col("a")).cache val dfJoin = df0Rep.join(df1Rep, "a") println(dfJoin.count) Does this join not only co-partitioned but also co-located? I know that for RDDs if using the same partitioner and shuffled in the same operation, the join would be co-located. But what about dataframes? Thank you. 回答1: [https://medium.com/@achilleus/https-medium-com-joins-in

Spark colocated join between two partitioned dataframes

阅读更多关于 Spark colocated join between two partitioned dataframes

Spark colocated join between two partitioned dataframes

阅读更多关于 Spark colocated join between two partitioned dataframes

Spark - creating schema programmatically with different data types

阅读更多关于 Spark - creating schema programmatically with different data types

问题 I have a dataset consisting of 7-8 fields which are of type String, Int & Float. Am trying to create Schema by Programmatic approach by using this : val schema = StructType(header.split(",").map(column => StructField(column, StringType, true))) And Then mapping it to Row type like : val dataRdd = datafile.filter(x => x!=header).map(x => x.split(",")).map(col => Row(col(0).trim, col(1).toInt, col(2).toFloat, col(3), col(4) ,col(5), col(6), col(7), col(8))) But after creating DataFrame when i

Spark Reading Compressed with Special Format

阅读更多关于 Spark Reading Compressed with Special Format

问题 I have a file .gz I need to read this file and add the time and file name to this file I have some problems and need your help to recommend a way for this points. Because the file is compressed the first line is reading with not the proper format I think due to encoding problem I tried the below code but not working implicit val codec = Codec("UTF-8") codec.onMalformedInput(CodingErrorAction.REPLACE) codec.onUnmappableCharacter(CodingErrorAction.REPLACE) File has special format and I need to

check if a row value is null in spark dataframe

阅读更多关于 check if a row value is null in spark dataframe

问题 I am using a custom function in pyspark to check a condition for each row in a spark dataframe and add columns if condition is true. The code is as below: from pyspark.sql.types import * from pyspark.sql.functions import * from pyspark.sql import Row def customFunction(row): if (row.prod.isNull()): prod_1 = "new prod" return (row + Row(prod_1)) else: prod_1 = row.prod return (row + Row(prod_1)) sdf = sdf_temp.map(customFunction) sdf.show() I get the error mention below: AttributeError:

How does createOrReplaceTempView work in Spark?

阅读更多关于 How does createOrReplaceTempView work in Spark?

问题 I am new to Spark and Spark SQL. How does createOrReplaceTempView work in Spark? If we register an RDD of objects as a table will spark keep all the data in memory? 回答1: createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a hive table in Spark SQL. It does not persist to memory unless you cache the dataset that underpins the view. scala> val s = Seq(1,2,3).toDF("num") s: org.apache.spark.sql.DataFrame = [num: int]