spark-dataframe

Apache Spark DataSet API : head(n:Int) vs take(n:Int)

老子叫甜甜 提交于 2020-08-23 03:45:46
问题 Apache Spark Dataset API has two methods i.e, head(n:Int) and take(n:Int) . Dataset.Scala source contains def take(n: Int): Array[T] = head(n) Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result? 回答1: I have experimented & found that head(n) and take(n) gives exactly same replica output. Both produces output in the form of ROW object only. DF.head(2) [Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1

Spark: Read an inputStream instead of File

Deadly 提交于 2020-08-22 09:27:20
问题 I'm using SparkSQL in a Java application to do some processing on CSV files using Databricks for parsing. The data I am processing comes from different sources (Remote URL, local file, Google Cloud Storage), and I'm in the habit of turning everything into an InputStream so that I can parse and process data without knowing where it came from. All the documentation I've seen on Spark reads files from a path, e.g. SparkConf conf = new SparkConf().setAppName("spark-sandbox").setMaster("local");

Rename written CSV file Spark

偶尔善良 提交于 2020-06-27 03:52:09
问题 I'm running spark 2.1 and I want to write a csv with results into Amazon S3. After repartitioning the csv file has kind of a long kryptic name and I want to change that into a specific filename. I'm using the databricks lib for writing into S3. dataframe .repartition(1) .write .format("com.databricks.spark.csv") .option("header", "true") .save("folder/dataframe/") Is there a way to rename the file afterwards or even save it directly with the correct name? I've already looked for solutions and

Trying to use map on a Spark DataFrame

こ雲淡風輕ζ 提交于 2020-06-24 22:24:07
问题 I recently started experimenting with both Spark and Java. I initially went through the famous WordCount example using RDD and everything went as expected. Now I am trying to implement my own example but using DataFrames and not RDDs. So I am reading a dataset from a file with DataFrame df = sqlContext.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("delimiter", ";") .option("header", "true") .load(inputFilePath); and then I try to select a specific column

how to convert generic rdd to dataframe?

不羁的心 提交于 2020-05-31 03:56:06
问题 I am writing a method that takes an rdd and saves it as an avro file. The problem is that if I use a specific type than I can do .toDF() but I cannot call .toDF() on a generic rdd! Here is an example: case class Person(name: String) def f(x: RDD[Person]) = x.toDF() def g[T](x: RDD[T]) = x.toDF() f(p) //works g(p) //fails!! Does anyone know why I can't call .toDF() on a generic rdd and if there is any way around it? 回答1: If you are using Spark 2, import org.apache.spark.sql.Encoder def g[T: