Spark reading python3 pickle as input

前端 未结 1 1632
失恋的感觉
失恋的感觉 2021-01-11 17:11

My data are available as sets of Python 3 pickled files. Most of them are serialization of Pandas DataFrames.

I\'d like to start using Spark because I n

相关标签:
1条回答
  • 2021-01-11 17:48

    A lot depends on the data itself. Generally speaking Spark doesn't perform particularly well when it has to read large, not splittable files. Nevertheless you can try to use binaryFiles method and combine it with the standard Python tools. Lets start with a dummy data:

    import tempfile
    import pandas as pd
    import numpy as np
    
    outdir = tempfile.mkdtemp()
    
    for i in range(5):
        pd.DataFrame(
            np.random.randn(10, 2), columns=['foo', 'bar']
        ).to_pickle(tempfile.mkstemp(dir=outdir)[1])
    

    Next we can read it using bianryFiles method:

    rdd = sc.binaryFiles(outdir)
    

    and deserialize individual objects:

    import pickle
    from io import BytesIO
    
    dfs = rdd.values().map(lambda p: pickle.load(BytesIO(p)))
    dfs.first()[:3]
    
    ##         foo       bar
    ## 0 -0.162584 -2.179106
    ## 1  0.269399 -0.433037
    ## 2 -0.295244  0.119195
    

    One important note is that it typically requires significantly more memory than a simple methods like textFile.

    Another approach is to parallelize only the paths and use libraries which can read directly from a distributed file system like hdfs3. This typically means lower memory requirements at the price of a significantly worse data locality.

    Considering these two facts it is typically better to serialize your data in a format which can be loaded with a higher granularity.

    Note:

    SparkContext provides pickleFile method, but the name can be misleading. It can be used to read SequenceFiles containing pickle objects not the plain Python pickles.

    0 讨论(0)
提交回复
热议问题