How to skip more then one lines of header in RDD in Spark

后端未结

关注

 3  1613

攒了一身酷 2021-01-14 16:15

Data in my first RDD is like

Now the first 3 integers are some counters that I need to bro

3条回答

悲&欢浪女 (楼主)

2021-01-14 16:27

Imports for Python 2
```
from __future__ import print_function
```

Prepare dummy data:

s = "1253\n545553\n12344896\n1 2 1\n1 43 2\n1 46 1\n1 53 2"
with open("file.txt", "w") as fw: fw.write(s)

Read raw input:
```
raw = sc.textFile("file.txt")
```

Extract header:

header = raw.take(3)
print(header)
### [u'1253', u'545553', u'12344896']

Filter lines:

using zipWithIndex

content = raw.zipWithIndex().filter(lambda kv: kv[1] > 2).keys()
print(content.first())
## 1 2 1

using mapPartitionsWithIndex

from itertools import islice

content = raw.mapPartitionsWithIndex(
    lambda i, iter: islice(iter, 3, None) if i == 0 else iter)

print(content.first())
## 1 2 1

NOTE: All credit goes to pzecevic and Sean Owen (see linked sources).

0 讨论(0)

查看其它3个回答