spark-dataframe

Spark 2.0.0: SparkR CSV Import

左心房为你撑大大i 提交于 2021-01-27 06:44:00
问题 I am trying to read a csv file into SparkR (running Spark 2.0.0) - & trying to experiment with the newly added features. Using RStudio here. I am getting an error while "reading" the source file. My code: Sys.setenv(SPARK_HOME = "C:/spark-2.0.0-bin-hadoop2.6") library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) sparkR.session(master = "local[*]", appName = "SparkR") df <- loadDF("F:/file.csv", "csv", header = "true") I get an error at at the loadDF function. The

Alternatives for Spark Dataframe's count() API

情到浓时终转凉″ 提交于 2020-12-12 11:39:28
问题 I'm using Spark with Java connector to process my data. One of the essential operations I need to do with the data is to count the number of records (row) within a data frame. I tried df.count() but the execution time is extremely slow (30-40 seconds for 2-3M records). Also, due to the system's requirement, I don't want to use df.rdd().countApprox() API because we need the exact count number. Could somebody give me a suggestion of any alternatives that return exactly the same result as df

Spark DataFrame serialized as invalid json

一笑奈何 提交于 2020-12-06 05:52:51
问题 TL;DR : When I dump a Spark DataFrame as json, I always end up with something like {"key1": "v11", "key2": "v21"} {"key1": "v12", "key2": "v22"} {"key1": "v13", "key2": "v23"} which is invalid json. I can manually edit the dumped file to get something I can parse: [ {"key1": "v11", "key2": "v21"}, {"key1": "v12", "key2": "v22"}, {"key1": "v13", "key2": "v23"} ] but I'm pretty sure I'm missing something that would let me avoid this manual edit. I just don't now what. More details : I have a

Need to Know Partitioning Details in Dataframe Spark

喜夏-厌秋 提交于 2020-12-06 04:37:49
问题 I am trying to read from DB2 database on base of a query. The result set of the query is about 20 - 40 million records. The partition of the DF is done based of a column which is integer. My question is that, once data is loaded how can I check how many records were created per partition. Basically what I want to check is if data skew is happening or not? How can I check the record counts per partition? 回答1: You can for instance map over the partitions and determine their sizes: val rdd = sc