CUDF error processing a large number of parquet files

后端 未结 1 840
走了就别回头了
走了就别回头了 2021-01-16 04:01

I have 2000 parquet files in a directory. Each parquet file is roughly 20MB in size. The compression used is SNAPPY. Each parquet file has rows that look like the following:

1条回答
  •  说谎
    说谎 (楼主)
    2021-01-16 04:38

    cuDF is a single GPU library. 2000 files of 20 MB would be about 40 GB of data, which is more than you can fit in memory in a single V100 GPU.

    For workflows that require more a single GPU, cuDF relies on Dask. The following example illustrates how you could use cuDF + Dask to read data into distributed GPU memory with multiple GPUs in a single node. This doesn't answer your debugging question, but should hopefully solve your problem.

    First, I use a few lines of code to create a Dask cluster of two GPUs.

    from dask.distributed import Client
    from dask_cuda import LocalCUDACluster
    import dask_cudf
    
    cluster = LocalCUDACluster() # by default use all GPUs in the node. I have two.
    client = Client(cluster)
    client
    # The print output of client:
    # 
    # Client
    # Scheduler: tcp://127.0.0.1:44764
    # Dashboard: http://127.0.0.1:8787/status
    
    # Cluster
    # Workers: 2
    # Cores: 2
    # Memory: 404.27 GB
    
    

    Next I'll create a couple of parquet files for this example.

    import os
    
    import cudf
    from cudf.datasets import randomdata
    
    if not os.path.exists('example_output'):
        os.mkdir('example_output')
    
    for x in range(2):
        df = randomdata(nrows=10000,
                    dtypes={'a':int, 'b':str, 'c':str, 'd':int},
                    seed=12)
        df.to_parquet('example_output/df')
    

    Let's look at the memory on each of my GPUs with nvidia-smi.

    nvidia-smi
    Thu Sep 26 19:13:46 2019       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
    | N/A   51C    P0    29W /  70W |   6836MiB / 15079MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla T4            On   | 00000000:D8:00.0 Off |                    0 |
    | N/A   47C    P0    28W /  70W |   5750MiB / 15079MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+
    

    Notice the two values. 6836 MB on GPU 0 and 5750 MB on GPU 1 (I happen to have unrelated data already in memory on these GPUs). Now let's read our entire directory of two parquet files with Dask cuDF and then persist it. Persisting it forces computation -- Dask execution is lazy so just calling read_parquet only adds a task to the task graph. ddf is a Dask DataFrame.

    ddf = dask_cudf.read_parquet('example_output/df')
    ddf = ddf.persist()
    

    Now let's look at nvidia-smi again.

    Thu Sep 26 19:13:52 2019       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
    | N/A   51C    P0    29W /  70W |   6938MiB / 15079MiB |      2%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla T4            On   | 00000000:D8:00.0 Off |                    0 |
    | N/A   47C    P0    28W /  70W |   5852MiB / 15079MiB |      2%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+
    

    Dask handles distributing our data across both GPUs for us.

    0 讨论(0)
提交回复
热议问题