spark-dataframe

How to get the last row from DataFrame?

岁酱吖の 提交于 2020-01-23 06:46:29
问题 I hava a DataFrame,the DataFrame hava two column 'value' and 'timestamp',,the 'timestmp' is ordered,I want to get the last row of the DataFrame,what should I do? this is my input: +-----+---------+ |value|timestamp| +-----+---------+ | 1| 1| | 4| 2| | 3| 3| | 2| 4| | 5| 5| | 7| 6| | 3| 7| | 5| 8| | 4| 9| | 18| 10| +-----+---------+ this is my code: val arr = Array((1,1),(4,2),(3,3),(2,4),(5,5),(7,6),(3,7),(5,8),(4,9),(18,10)) var df=m_sparkCtx.parallelize(arr).toDF("value","timestamp") this

Creating a simple 1-row Spark DataFrame with Java API

China☆狼群 提交于 2020-01-22 07:33:05
问题 In Scala, I can create a single-row DataFrame from an in-memory string like so: val stringAsList = List("buzz") val df = sqlContext.sparkContext.parallelize(jsonValues).toDF("fizz") df.show() When df.show() runs, it outputs: +-----+ | fizz| +-----+ | buzz| +-----+ Now I'm trying to do this from inside a Java class. Apparently JavaRDD s don't have a toDF(String) method. I've tried: List<String> stringAsList = new ArrayList<String>(); stringAsList.add("buzz"); SQLContext sqlContext = new

Datasets in Apache Spark

£可爱£侵袭症+ 提交于 2020-01-22 03:12:24
问题 Dataset<Tweet> ds = sc.read().json("path").as(Encoders.bean(Tweet.class)); ds.show(); JavaRDD<Tweet> dstry = ds.toJavaRDD(); System.out.println(dstry.first().getClass()); Caused by: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 50, Column 16: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 50, Column 16: No applicable constructor/method found for actual parameters "org.apache

How does Spark keep track of the splits in randomSplit?

蓝咒 提交于 2020-01-21 12:46:09
问题 This question explains how Spark's random split works, How does Sparks RDD.randomSplit actually split the RDD, but I don't understand how spark keeps track of what values went to one split so that those same values don't go to the second split. If we look at the implementation of randomSplit: def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = { // It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its // constituent partitions each time

Count on Spark Dataframe is extremely slow

落花浮王杯 提交于 2020-01-21 03:26:32
问题 I'm creating a new DataFrame with a handful of records from a Join. val joined_df = first_df.join(second_df, first_df.col("key") === second_df.col("key") && second_df.col("key").isNull, "left_outer") joined_df.repartition(1) joined_df.cache() joined_df.count() Everything is fast (under one second) except the count operation. The RDD conversion kicks in and literally takes hours to complete. Is there any way to speed things up? INFO MemoryStore: Block rdd_63_140 stored as values in memory

Spark scala - Nested StructType conversion to Map

匆匆过客 提交于 2020-01-15 11:10:51
问题 I am using Spark 1.6 in scala. I created an index in ElasticSearch with an object. The object "params" was created as a Map[String, Map[String, String]]. Example: val params : Map[String, Map[String, String]] = ("p1" -> ("p1_detail" -> "table1"), "p2" -> (("p2_detail" -> "table2"), ("p2_filter" -> "filter2")), "p3" -> ("p3_detail" -> "table3")) That gives me records that look like the following: { "_index": "x", "_type": "1", "_id": "xxxxxxxxxxxx", "_score": 1, "_timestamp": 1506537199650, "

unzip list of tuples in pyspark dataframe

与世无争的帅哥 提交于 2020-01-14 06:52:24
问题 I want unzip list of tuples in a column of a pyspark dataframe Let's say a column as [(blue, 0.5), (red, 0.1), (green, 0.7)] , I want to split into two columns, with first column as [blue, red, green] and second column as [0.5, 0.1, 0.7] +-----+-------------------------------------------+ |Topic| Tokens | +-----+-------------------------------------------+ | 1| ('blue', 0.5),('red', 0.1),('green', 0.7)| | 2| ('red', 0.9),('cyan', 0.5),('white', 0.4)| +-----+-----------------------------------

DataFrame filtering based on second Dataframe

前提是你 提交于 2020-01-13 14:05:04
问题 Using Spark SQL, I have two dataframes, they are created from one, such as: df = sqlContext.createDataFrame(...); df1 = df.filter("value = 'abc'"); //[path, value] df2 = df.filter("value = 'qwe'"); //[path, value] I want to filter df1, if part of its 'path' is any path in df2. So if df1 has row with path 'a/b/c/d/e' I would find out if in df2 is a row that path is 'a/b/c'. In SQL it should be like SELECT * FROM df1 WHERE udf(path) IN (SELECT path FROM df2) where udf is user defined function

DataFrame filtering based on second Dataframe

穿精又带淫゛_ 提交于 2020-01-13 14:04:12
问题 Using Spark SQL, I have two dataframes, they are created from one, such as: df = sqlContext.createDataFrame(...); df1 = df.filter("value = 'abc'"); //[path, value] df2 = df.filter("value = 'qwe'"); //[path, value] I want to filter df1, if part of its 'path' is any path in df2. So if df1 has row with path 'a/b/c/d/e' I would find out if in df2 is a row that path is 'a/b/c'. In SQL it should be like SELECT * FROM df1 WHERE udf(path) IN (SELECT path FROM df2) where udf is user defined function

Can I change the nullability of a column in my Spark dataframe?

筅森魡賤 提交于 2020-01-13 11:22:53
问题 I have a StructField in a dataframe that is not nullable. Simple example: import pyspark.sql.functions as F from pyspark.sql.types import * l = [('Alice', 1)] df = sqlContext.createDataFrame(l, ['name', 'age']) df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True)) df.schema.fields which returns: [StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)] Notice that the field foo is not nullable. Problem is that (for reasons I