Is it possible to save files in Hadoop without saving them in local file system?

后端 未结 3 795
滥情空心
滥情空心 2021-01-15 17:30

Is it possible to save files in Hadoop without saving them in local file system? I would like to do something like shown below however I would like to save file directly in

相关标签:
3条回答
  • 2021-01-15 17:46

    Here's how to download a file directly to HDFS with Pydoop:

    import os
    import requests
    import pydoop.hdfs as hdfs
    
    
    def dl_to_hdfs(url, hdfs_path):
        r = requests.get(url, stream=True)
        with hdfs.open(hdfs_path, 'w') as f:
            for chunk in r.iter_content(chunk_size=1024):
                f.write(chunk)
    
    
    URL = "https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tar.xz"
    dl_to_hdfs(URL, os.path.basename(URL))
    

    The above snippet works for a generic URL. If you already have the file as a Django UploadedFile, you can probably use its .chunks method to iterate through the data.

    0 讨论(0)
  • 2021-01-15 18:01

    Hadoop has REST APIs that allow you to create files via WebHDFS.

    So you could write your own create based on the REST APIs using a python library like requests for doing the HTTP. However, there are also several python libraries that support Hadoop/HDFS and already use the REST APIs or that use the RPC mechanism via libhdfs.

    • pydoop
    • hadoopy
    • snakebite
    • pywebhdfs
    • hdfscli
    • pyarrow

    Just make sure you look for how to create a file rather than having the python library call hdfs dfs -put or hadoop fs -put.

    See the following for more information:

    • pydoop vs hadoopy - hadoop python client
    • List all files in HDFS Python without pydoop
    • A Guide to Python Frameworks for Hadoop
    • Native Hadoop file system (HDFS) connectivity in Python
    • PyArrow
    • https://github.com/pywebhdfs/pywebhdfs
    • https://github.com/spotify/snakebite
    • https://crs4.github.io/pydoop/api_docs/hdfs_api.html
    • https://hdfscli.readthedocs.io/en/latest/
    • WebHDFS REST API:Create and Write to a File
    0 讨论(0)
  • 2021-01-15 18:07

    Python is installed in your Linux. It can access only local files. It cannot directly access files in HDFS.

    In order to save/put the files directly to HDFS, you need to use any of these below:

    • Spark: Use Dstream for streaming files

    • Kafka: matter of setting up configuration file. Best for streaming data.

    • Flume: set up configuration file. Best for static files.

    0 讨论(0)
提交回复
热议问题