spark-dataframe

spark how to remove last line in a csv file

随声附和 提交于 2020-05-01 05:22:05
问题 I am new to spark I want to remove header and last line from a csv file Notes xyz "id","member_id" "60045257","63989975", "60981766","65023535", Total amount:4444228900 Total amount: 133826689 I want to remove line Notes xyz , Total amount:4444228900 and Total amount: 133826689 from the file .I have removed the first line from the file val dfRetail = sc.textFile("file:////home/cloudera/Projects/Project3/test/test_3.csv"); var header=dfRetail.first(); var final_data=dfRetail.filter(row => row!

spark how to remove last line in a csv file

此生再无相见时 提交于 2020-05-01 05:19:27
问题 I am new to spark I want to remove header and last line from a csv file Notes xyz "id","member_id" "60045257","63989975", "60981766","65023535", Total amount:4444228900 Total amount: 133826689 I want to remove line Notes xyz , Total amount:4444228900 and Total amount: 133826689 from the file .I have removed the first line from the file val dfRetail = sc.textFile("file:////home/cloudera/Projects/Project3/test/test_3.csv"); var header=dfRetail.first(); var final_data=dfRetail.filter(row => row!

Spark 2.3.0 Read Text File With Header Option Not Working

心已入冬 提交于 2020-04-10 03:53:10
问题 The code below is working and creates a Spark dataframe from a text file. However, I'm trying to use the header option to use the first column as header and for some reason it doesn't seem to be happening. I cannot understand why! It must be something stupid but I cannot solve this. >>>from pyspark.sql import SparkSession >>>spark = SparkSession.builder.master("local").appName("Word Count")\ .config("spark.some.config.option", "some-value")\ .getOrCreate() >>>df = spark.read.option("header",

Spark “replacing null with 0” performance comparison

|▌冷眼眸甩不掉的悲伤 提交于 2020-04-08 09:45:07
问题 Spark 1.6.1, Scala api. For a dataframe, I need to replace all null value of a certain column with 0. I have 2 ways to do this. 1. myDF.withColumn("pipConfidence", when($"mycol".isNull, 0).otherwise($"mycol")) 2. myDF.na.fill(0, Seq("mycol")) Are they essentially the same or one way is preferred? Thank you! 回答1: There are not the same but performance should be similar. na.fill uses coalesce but it replaces NaN and NULLs not only NULLS . val y = when($"x" === 0, $"x".cast("double")).when($"x"

Getting last value of group in Spark

给你一囗甜甜゛ 提交于 2020-04-07 03:44:12
问题 I have a SparkR DataFrame as shown below: #Create R data.frame custId <- c(rep(1001, 5), rep(1002, 3), 1003) date <- c('2013-08-01','2014-01-01','2014-02-01','2014-03-01','2014-04-01','2014-02-01','2014-03-01','2014-04-01','2014-04-01') desc <- c('New','New','Good','New', 'Bad','New','Good','Good','New') newcust <- c(1,1,0,1,0,1,0,0,1) df <- data.frame(custId, date, desc, newcust) #Create SparkR DataFrame df <- createDataFrame(df) display(df) custId| date | desc | newcust --------------------

How Count unique ID after groupBy in pyspark

て烟熏妆下的殇ゞ 提交于 2020-04-05 15:41:49
问题 I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year. from pyspark.sql.functions import col import pyspark.sql.functions as fn gr = Df2.groupby(['Year']) df_grouped = gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year')) The result is : [students by year][1] The problem that I discovered that so many ID's are repeated So the result is wrong and huge. I want to agregate the students by year, count the total

TypeError: 'Column' object is not callable using WithColumn

落爺英雄遲暮 提交于 2020-02-21 11:22:54
问题 I would like append a new column on dataframe "df" from function get_distance : def get_distance(x, y): dfDistPerc = hiveContext.sql("select column3 as column3, \ from tab \ where column1 = '" + x + "' \ and column2 = " + y + " \ limit 1") result = dfDistPerc.select("column3").take(1) return result df = df.withColumn( "distance", lit(get_distance(df["column1"], df["column2"])) ) But, I get this: TypeError: 'Column' object is not callable I think it happens because x and y are Column objects

TypeError: 'Column' object is not callable using WithColumn

℡╲_俬逩灬. 提交于 2020-02-21 11:22:05
问题 I would like append a new column on dataframe "df" from function get_distance : def get_distance(x, y): dfDistPerc = hiveContext.sql("select column3 as column3, \ from tab \ where column1 = '" + x + "' \ and column2 = " + y + " \ limit 1") result = dfDistPerc.select("column3").take(1) return result df = df.withColumn( "distance", lit(get_distance(df["column1"], df["column2"])) ) But, I get this: TypeError: 'Column' object is not callable I think it happens because x and y are Column objects

Get Schema of Parquet file without loading file into spark data frame in python?

冷暖自知 提交于 2020-02-20 11:40:19
问题 Is there any python library that can be used to just get the schema of parquet file. Currently we are loading the parquet file into dataframe in Spark and getting schema from the dataframe to display in some UI of the application. But initializing spark-context and loading data frame and getting the schema from dataframe is time consuming activity. So looking for an alternative way to just get the schema. 回答1: This is supported by using pyarrow (https://github.com/apache/arrow/). from pyarrow

Get Schema of Parquet file without loading file into spark data frame in python?

﹥>﹥吖頭↗ 提交于 2020-02-20 11:40:09
问题 Is there any python library that can be used to just get the schema of parquet file. Currently we are loading the parquet file into dataframe in Spark and getting schema from the dataframe to display in some UI of the application. But initializing spark-context and loading data frame and getting the schema from dataframe is time consuming activity. So looking for an alternative way to just get the schema. 回答1: This is supported by using pyarrow (https://github.com/apache/arrow/). from pyarrow