Accessing google cloud storage using hadoop FileSystem api

前端 未结 1 1010
天涯浪人
天涯浪人 2021-01-22 19:54

From my machine, I\'ve configured the hadoop core-site.xml to recognize the gs:// scheme and added gcs-connector-1.2.8.jar as a Hadoop lib. I can run <

1条回答
  •  闹比i
    闹比i (楼主)
    2021-01-22 19:55

    As to your first question, "expected" is questionable, but I think I can at least explain. When FileSystem.get() is used the default FileSystem is returned and by default that is HDFS. My guess is that the HDFS client (DistributedFileSystem) has code to prepend scheme + authority automatically to all files in the filesystem.

    Instead of using FileSystem.get(conf), try

    FileSystem gcsFs = new Path("gs://mybucket/").getFS(conf)
    

    On disadvantages, I could probably argue that if you end up needing to access the object-store directly then you'll end up writing code to interact with the storage APIs directly anyways (and there are things that do not translate very well to the Hadoop FS API, e.g., object composition, complex object write preconditions other than simple object overwrite protection, etc).

    I am admittedly biased (working on the team), but if you're intending to use GCS from Hadoop Map/Reduce, from Spark, etc, the GCS connector for Hadoop should be a fairly safe bet.

    0 讨论(0)
提交回复
热议问题