rdd | 易学教程

Is there an API function to display “Fraction Cached” for an RDD?

阅读更多关于 Is there an API function to display “Fraction Cached” for an RDD?

问题 On the Storage tab of the PySparkShell application UI ([server]:8088) I can see information about an RDD I am using. One of the column is Fraction Cached . How can I retrieve this percentage programatically? I can use getStorageLevel() to get some information about RDD caching but not Fraction Cached . Do I have to calculate it myself? 回答1: SparkContext.getRDDStorageInfo is probably the thing you're looking for. It returns an Array of RDDInfo which provides information about: Memory size.

How to convert an Iterable to an RDD

阅读更多关于 How to convert an Iterable to an RDD

问题 To be more specific, how can i convert a scala.Iterable to a org.apache.spark.rdd.RDD ? I have an RDD of (String, Iterable[(String, Integer)]) and i want this to be converted into an RDD of (String, RDD[String, Integer]) , so that i can apply a reduceByKey function to the internal RDD . e.g i have an RDD where key is 2-lettered prefix of a person's name and the value is List of pairs of Person name and hours that they spent in an event my RDD is : ("To", List(("Tom",50),("Tod","30"),("Tom",70

Replace groupByKey with reduceByKey in Spark

阅读更多关于 Replace groupByKey with reduceByKey in Spark

问题 Hello I often need to use groupByKey in my code but I know it's a very heavy operation. Since I'm working to improve performance I was wondering if my approach to remove all groupByKey calls is efficient. I was used to create an RDD from another RDD and creating pair of type (Int, Int) rdd1 = [(1, 2), (1, 3), (2 , 3), (2, 4), (3, 5)] and since I needed to obtain something like this: [(1, [2, 3]), (2 , [3, 4]), (3, [5])] what I used was out = rdd1.groupByKey but since this approach might be

Difference between loading a csv file into RDD and Dataframe in spark

阅读更多关于 Difference between loading a csv file into RDD and Dataframe in spark

问题 I am not sure if this specific question is asked earlier or not. could be a possible duplicate but I was not able to find a use case persisting to this. As we know that we can load a csv file directly to dataframe and can load it into RDD also and then convert that RDD to dataframe later. RDD = sc.textFile("pathlocation") we can apply some Map, filter and other operations on this RDD and can convert it into dataframe. Also we can create a dataframe directly reading a csv file Dataframe =

How can I get a distinct RDD of dicts in PySpark?

阅读更多关于 How can I get a distinct RDD of dicts in PySpark?

问题 I have an RDD of dictionaries, and I'd like to get an RDD of just the distinct elements. However, when I try to call rdd.distinct() PySpark gives me the following error TypeError: unhashable type: 'dict' at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD

Spark: Task not serializable Exception in forEach loop in Java

阅读更多关于 Spark: Task not serializable Exception in forEach loop in Java

问题 I'm trying to iterate over JavaPairRDD and perform some calculations with keys and values of JavaPairRDD. Then output result for each JavaPair into processedData list. What I already tried: make variables, that I use inside of lambda function static. make methods, that I call from lambda foreach loop static. added implements Serializable Here is my code: List<String> processedData = new ArrayList<>(); JavaPairRDD<WebLabGroupObject, Iterable<WebLabPurchasesDataObject>> groupedByWebLabData

Does Spark write intermediate shuffle outputs to disk

阅读更多关于 Does Spark write intermediate shuffle outputs to disk

问题 I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149: Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle , even if it was not explicitly persisted. This is

Does Spark write intermediate shuffle outputs to disk

阅读更多关于 Does Spark write intermediate shuffle outputs to disk

How to apply multiple filters in a for loop for pyspark

阅读更多关于 How to apply multiple filters in a for loop for pyspark

问题 I am trying to apply a filter on several columns on an rdd. I want to pass in a list of indices as a parameter to specify which ones to filter on, but pyspark only applies the last filter. I've broken down the code into some simple test cases and tried the non-looped version and they work. test_input = [('0', '00'), ('1', '1'), ('', '22'), ('', '3')] rdd = sc.parallelize(test_input, 1) # Index 0 needs to be longer than length 0 # Index 1 needs to be longer than length 1 for i in [0,1]: rdd =

How to apply multiple filters in a for loop for pyspark

阅读更多关于 How to apply multiple filters in a for loop for pyspark