
Error while reading very large files with spark csv package

问题 We are trying to read a 3 gb file which has multiple new line character in one its column using spark-csv and univocity 1.5.0 parser, but the file is getting split in the multiple column in some row on the basis of newline character. This scenario is occurring in case of large file. We are using spark 1.6.1 and scala 2.10 Following code i'm using for reading the file : .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .option("mode",

Spark DataFrame handing empty String in OneHotEncoder

问题 I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder , the application crashes with error requirement failed: Cannot have an empty string for name. . Is there a way I can get around this? I could reproduce the error in the example provided on Spark ml page: val df = sqlContext.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), //<- original example has "a" here (4, "a"), (5, "c") )).toDF("id", "category") val

Scala: Spark SQL to_date(unix_timestamp) returning NULL

问题 Spark Version: spark-2.0.1-bin-hadoop2.7 Scala: 2.11.8 I am loading a raw csv into a DataFrame. In csv, although the column is support to be in date format, they are written as 20161025 instead of 2016-10-25. The parameter date_format includes string of column names that need to be converted to yyyy-mm-dd format. In the following code, I first loaded the csv of Date column as StringType via the schema , and then I check if the date_format is not empty, that is there are columns that need to

How to force inferSchema for CSV to consider integers as dates (with “dateFormat” option)?

问题 I use Spark 2.2.0 I am reading a csv file as follows: val dataFrame ="inferSchema", "true") .option("header", true) .option("dateFormat", "yyyyMMdd") .csv(pathToCSVFile) There is one date column in this file, and all records has a value equal to 20171001 for this particular column. The issue is that spark is inferring that that the type of this column is integer rather than date . When I remove the "inferSchema" option, the type of that column is string . There is no null

Custom schema in spark-csv throwing error in spark 1.4.1

问题 I trying to process CSV file using spark -csv package in spark-shell in spark 1.4.1. scala> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext scala> import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql.hive.orc._ scala> import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}; import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType} scala> val hiveContext = new org.apache.spark.sql

Programmatically generate the schema AND the data for a dataframe in Apache Spark

问题 I would like to dynamically generate a dataframe containing a header record for a report, so creating a dataframe from the value of the string below: val headerDescs : String = "Name,Age,Location" val headerSchema = StructType(headerDescs.split(",").map(fieldName => StructField(fieldName, StringType, true))) However now I want to do the same for the data (which is in effect the same data i.e. the metadata). I create an RDD : val headerRDD = sc.parallelize(headerDescs.split(",")) I then

Parse Micro/Nano Seconds timestamp in spark-csv Dataframe reader : Inconsistent results

问题 I'm trying to read a csv file which has timestamps till nano seconds. sample content of file TestTimestamp.csv- spark- 2.4.0, scala - 2.11.11 /** * TestTimestamp.csv - * 101,2019-SEP-23 AM * */ Tried to read it using timestampFormat = "yyyy-MMM-dd aaa" val dataSchema = StructType(Array(StructField("ID", DoubleType, true), StructField("Created_TS", TimestampType, true))) val data ="csv") .option("header", "false") .option("inferSchema",

add header and column to dataframe spark

问题 hi guys i've got a dataframe on which i want to add a header and a first column manually here is the dataframe import org.apache.spark.sql.SparkSession val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate() val df ="header",true).option("inferSchema",true).csv("C:\\gg.csv").cache() the content of the dataframe 12,13,14 11,10,5 3,2,45 the expected output is define,col1,col2,col3 c1,12,13,14 c2,11,10,5 c3,3,2,45 Any help would be appreciated.

About how to create a custom org.apache.spark.sql.types.StructType schema object starting from a json file programmatically

问题 i have to create a custom org.apache.spark.sql.types.StructType schema object with the info from a json file, the json file can be anything, so i have parametriced it within a property file. This is how it looks the property file: //ruta al esquema del fichero output (por defecto se infiere el esquema del Parquet destino). Si existe, el esquema será en formato JSON, aplicable a DataFrame (ver StructType.fromJson) schema.parquet=/Users/XXXX/Desktop/generated_schema.json writing.mode=overwrite

i have to create a custom org.apache.spark.sql.types.StructType schema object with the info from a json file, the json file can be anything, so i have parametriced it within a property file. This is how it looks the property file: //ruta al esquema del fichero output (por defecto se infiere el esquema del Parquet destino). Si existe, el esquema será en formato JSON, aplicable a DataFrame (ver StructType.fromJson) schema.parquet=/Users/XXXX/Desktop/generated_schema.json writing.mode=overwrite separator=; header=false The file generated_schema.json looks like: {"type" : "struct","fields" : [ {