问题
How could I read all parquet files stored in HDFS using Apache-Beam 2.13.0 python sdk with direct runner if the directory structure is the following:
data/
├── a
│ ├── file_1.parquet
│ └── file_2.parquet
└── b
├── file_3.parquet
└── file_4.parquet
I tried beam.io.ReadFromParquet
and hdfs://data/*/*
:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
HDFS_HOSTNAME = 'my-hadoop-master-node.com'
HDFS_PORT = 50070
HDFS_USER = "my-user-name"
pipeline_options = PipelineOptions(hdfs_host=HDFS_HOSTNAME, hdfs_port=HDFS_PORT, hdfs_user=HDFS_USER)
input_file_hdfs_parquet = "hdfs://data/*/*"
p = beam.Pipeline(options=pipeline_options)
lines = p | 'ReadMyFile' >> beam.io.ReadFromParquet(input_file_hdfs_parquet)
_ = p.run()
I'm running into the following error:
IOErrorTraceback (most recent call last)
...
IOError: No files found based on the file pattern hdfs://data/*/*
Using
input_file_hdfs_parquet = "hdfs://data/a/*"
, I'm able to read all files within the a
directory.
来源:https://stackoverflow.com/questions/56834077/apache-beam-read-parquet-files-from-nested-hdfs-directories