PySpark: read, map and reduce from multiline record textfile with newAPIHadoopFile

坚强是说给别人听的谎言 提交于 2019-12-02 07:03:26

Personally I would:

  • extend delimiter with ::

    sheet = sc.newAPIHadoopFile(
        path,
        'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
        'org.apache.hadoop.io.LongWritable',
        'org.apache.hadoop.io.Text',
        conf={'textinputformat.record.delimiter': 'Time\tMHist::'}
    )
    
  • drop keys:

    values = sheet.values()
    
  • filter out empty entries

    non_empty = values.filter(lambda x:  x)
    
  • split:

    grouped_lines = non_empty.map(str.splitlines)
    
  • separate keys and values:

    from operator import itemgetter
    
    pairs = grouped_lines.map(itemgetter(0, slice(1, None)))
    
  • and finally split values:

    pairs.flatMapValues(lambda xs: [x.split("\t") for x in xs])
    

All of that can done with a single function of course:

import dateutil.parser

def process(pair):
    _, content = pair
    clean = [x.strip() for x in content.strip().splitlines()]
    if not clean:
        return []
    k, vs = clean[0], clean[1:]
    for v in vs:
        try:
            ds, x = v.split("\t")
            yield k, (dateutil.parser.parse(ds), float(x))  # or int(x)
        except ValueError:
            pass

sheet.flatMap(process)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!