spark-dataframe | 易学教程

Spark 2.0.0: SparkR CSV Import

阅读更多关于 Spark 2.0.0: SparkR CSV Import

问题 I am trying to read a csv file into SparkR (running Spark 2.0.0) - & trying to experiment with the newly added features. Using RStudio here. I am getting an error while "reading" the source file. My code: Sys.setenv(SPARK_HOME = "C:/spark-2.0.0-bin-hadoop2.6") library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) sparkR.session(master = "local[*]", appName = "SparkR") df <- loadDF("F:/file.csv", "csv", header = "true") I get an error at at the loadDF function. The

Alternatives for Spark Dataframe's count() API

阅读更多关于 Alternatives for Spark Dataframe's count() API

问题 I'm using Spark with Java connector to process my data. One of the essential operations I need to do with the data is to count the number of records (row) within a data frame. I tried df.count() but the execution time is extremely slow (30-40 seconds for 2-3M records). Also, due to the system's requirement, I don't want to use df.rdd().countApprox() API because we need the exact count number. Could somebody give me a suggestion of any alternatives that return exactly the same result as df

Spark DataFrame serialized as invalid json

阅读更多关于 Spark DataFrame serialized as invalid json

问题 TL;DR : When I dump a Spark DataFrame as json, I always end up with something like {"key1": "v11", "key2": "v21"} {"key1": "v12", "key2": "v22"} {"key1": "v13", "key2": "v23"} which is invalid json. I can manually edit the dumped file to get something I can parse: [ {"key1": "v11", "key2": "v21"}, {"key1": "v12", "key2": "v22"}, {"key1": "v13", "key2": "v23"} ] but I'm pretty sure I'm missing something that would let me avoid this manual edit. I just don't now what. More details : I have a

Need to Know Partitioning Details in Dataframe Spark

阅读更多关于 Need to Know Partitioning Details in Dataframe Spark

问题 I am trying to read from DB2 database on base of a query. The result set of the query is about 20 - 40 million records. The partition of the DF is done based of a column which is integer. My question is that, once data is loaded how can I check how many records were created per partition. Basically what I want to check is if data skew is happening or not? How can I check the record counts per partition? 回答1: You can for instance map over the partitions and determine their sizes: val rdd = sc

pyspark: counter part of like() method in dataframe

阅读更多关于 pyspark: counter part of like() method in dataframe

来源： https://stackoverflow.com/questions/44177236/pyspark-counter-part-of-like-method-in-dataframe

Processing Hive Lookup tables in Spark vs Spark Broadcast variables

阅读更多关于 Processing Hive Lookup tables in Spark vs Spark Broadcast variables

来源： https://stackoverflow.com/questions/41113781/processing-hive-lookup-tables-in-spark-vs-spark-broadcast-variables

Processing Hive Lookup tables in Spark vs Spark Broadcast variables

阅读更多关于 Processing Hive Lookup tables in Spark vs Spark Broadcast variables

来源： https://stackoverflow.com/questions/41113781/processing-hive-lookup-tables-in-spark-vs-spark-broadcast-variables

How to assign a String variable to a dataframe name

阅读更多关于 How to assign a String variable to a dataframe name

来源： https://stackoverflow.com/questions/50284760/how-to-assign-a-string-variable-to-a-dataframe-name

Nullable field is changed upon writing a Spark Dataframe

阅读更多关于 Nullable field is changed upon writing a Spark Dataframe

来源： https://stackoverflow.com/questions/39697193/nullable-field-is-changed-upon-writing-a-spark-dataframe

Nullable field is changed upon writing a Spark Dataframe

阅读更多关于 Nullable field is changed upon writing a Spark Dataframe

来源： https://stackoverflow.com/questions/39697193/nullable-field-is-changed-upon-writing-a-spark-dataframe