Apache-Beam: Read parquet files from nested HDFS directories

问题

How could I read all parquet files stored in HDFS using Apache-Beam 2.13.0 python sdk with direct runner if the directory structure is the following:

data/
├── a
│   ├── file_1.parquet
│   └── file_2.parquet
└── b
    ├── file_3.parquet
    └── file_4.parquet

I tried beam.io.ReadFromParquet and hdfs://data/*/*:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions


HDFS_HOSTNAME = 'my-hadoop-master-node.com'
HDFS_PORT = 50070
HDFS_USER = "my-user-name"

pipeline_options = PipelineOptions(hdfs_host=HDFS_HOSTNAME, hdfs_port=HDFS_PORT, hdfs_user=HDFS_USER)
input_file_hdfs_parquet = "hdfs://data/*/*"

p = beam.Pipeline(options=pipeline_options)
lines = p | 'ReadMyFile' >> beam.io.ReadFromParquet(input_file_hdfs_parquet)

_ = p.run()

I'm running into the following error:

IOErrorTraceback (most recent call last)
...
IOError: No files found based on the file pattern hdfs://data/*/*

Using input_file_hdfs_parquet = "hdfs://data/a/*", I'm able to read all files within the a directory.

来源：https://stackoverflow.com/questions/56834077/apache-beam-read-parquet-files-from-nested-hdfs-directories

标签

apache-beam

apache-beam-io

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!