View RDD contents in Python Spark?

后端未结

关注

 6  903

Running a simple app in pyspark.

f = sc.textFile(\"README.md\")
wc = f.flatMap(lambda x: x.split(\' \')).map(lambda x: (x, 1)).reduceByKey(add)

相关标签:

6条回答

遥遥无期

2020-11-29 04:19
If you want to see the contents of RDD then yes collect is one option, but it fetches all the data to driver so there can be a problem
```
<rdd.name>.take(<num of elements you want to fetch>)
```
Better if you want to see just a sample

Running foreach and trying to print, I dont recommend this because if you are running this on cluster then the print logs would be local to the executor and it would print for the data accessible to that executor. print statement is not changing the state hence it is not logically wrong. To get all the logs you will have to do something like
```
**Pseudocode**
collect
foreach print
```
But this may result in job failure as collecting all the data on driver may crash it. I would suggest using take command or if u want to analyze it then use sample collect on driver or write to file and then analyze it.
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2020-11-29 04:20

By latest document, you can use rdd.collect().foreach(println) on the driver to display all, but it may cause memory issues on the driver, best is to use rdd.take(desired_number)

https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html

To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).

0 讨论(0)
发布评论:

提交评论
- 加载中...
无人及你

2020-11-29 04:22
Try this:
```
data = f.flatMap(lambda x: x.split(' '))
map = data.map(lambda x: (x, 1))
mapreduce = map.reduceByKey(lambda x,y: x+y)
result = mapreduce.collect()
```
Please note that when you run collect(), the RDD - which is a distributed data set is aggregated at the driver node and is essentially converted to a list. So obviously, it won't be a good idea to collect() a 2T data set. If all you need is a couple of samples from your RDD, use take(10).
0 讨论(0)
发布评论:

提交评论
- 加载中...
轻奢々

2020-11-29 04:25
In Spark 2.0 (I didn't tested with earlier versions). Simply:
```
print myRDD.take(n)
```
Where n is the number of lines and myRDD is wc in your case.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一个人的身影

2020-11-29 04:26
You can simply collect the entire RDD (which will return a list of rows) and print said list:
```
print(wc.collect())
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2020-11-29 04:32
This error is because print isn't a function in Python 2.6.

You can either define a helper UDF that performs the print, or use the __future__ library to treat print as a function:
```
>>> from operator import add
>>> f = sc.textFile("README.md")
>>> def g(x):
...     print x
...
>>> wc.foreach(g)
```
or
```
>>> from __future__ import print_function
>>> wc.foreach(print)
```
However, I think it would be better to use collect() to bring the RDD contents back to the driver, because foreach executes on the worker nodes and the outputs may not necessarily appear in your driver / shell (it probably will in local mode, but not when running on a cluster).
```
>>> for x in wc.collect():
...     print x
```
0 讨论(0)
发布评论:

提交评论
- 加载中...