Writing data to Hadoop

前端 未结 8 1544
悲&欢浪女
悲&欢浪女 2020-12-13 10:58

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS\'s put command to

相关标签:
8条回答
  • 2020-12-13 11:06

    For the problem of loading the data I needed to put into HDFS, I choose to turn the problem around.

    Instead of uploading the files to HDFS from the server where they resided, I wrote a Java Map/Reduce job where the mapper read the file from the file server (in this case via https), then write it directly to HDFS (via the Java API).

    The list of files is read from the input. I then have an external script that populates a file with the list of files to fetch, uploads the file into HDFS (using hadoop dfs -put), then start the map/reduce job with a decent number of mappers.

    This gives me excellent transfer performance, since multiple files are read/written at the same time.

    Maybe not the answer you were looking for, but hopefully helpful anyway :-).

    0 讨论(0)
  • 2020-12-13 11:11

    You can now also try to use Talend, which includes components for Hadoop integration.

    0 讨论(0)
  • 2020-12-13 11:18

    About 2 years after my last answer, there are now two new alternatives - Hoop/HttpFS, and WebHDFS.

    Regarding Hoop, it was first announced in Cloudera's blog and can be downloaded from a github repository. I have managed to get this version to talk successfully to at least Hadoop 0.20.1, it can probably talk to slightly older versions as well.

    If you're running Hadoop 0.23.1 which at time of writing still is not released, Hoop is instead part of Hadoop as its own component, the HttpFS. This work was done as part of HDFS-2178. Hoop/HttpFS can be a proxy not only to HDFS, but also to other Hadoop-compatible filesystems such as Amazon S3.

    Hoop/HttpFS runs as its own standalone service.

    There's also WebHDFS which runs as part of the NameNode and DataNode services. It also provides a REST API which, if I understand correctly, is compatible with the HttpFS API. WebHDFS is part of Hadoop 1.0 and one of its major features is that it provides data locality - when you're making a read request, you will be redirected to the WebHDFS component on the datanode where the data resides.

    Which component to choose depends a bit on your current setup and what needs you have. If you need a HTTP REST interface to HDFS now and you're running a version that does not include WebHDFS, starting with Hoop from the github repository seems like the easiest option. If you are running a version that includes WebHDFS, I would go for that unless you need some of the features Hoop has that WebHDFS lacks (access to other filesystems, bandwidth limitation, etc.)

    0 讨论(0)
  • 2020-12-13 11:20

    It seems there is a dedicated page now for this at http://wiki.apache.org/hadoop/MountableHDFS:

    These projects (enumerated below) allow HDFS to be mounted (on most flavors of Unix) as a standard file system using the mount command. Once mounted, the user can operate on an instance of hdfs using standard Unix utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find', 'grep', or use standard Posix libraries like open, write, read, close from C, C++, Python, Ruby, Perl, Java, bash, etc.

    Later it describes these projects

    • contrib/fuse-dfs is built on fuse, some C glue, libhdfs and the hadoop-dev.jar
    • fuse-j-hdfs is built on fuse, fuse for java, and the hadoop-dev.jar
    • hdfs-fuse - a google code project is very similar to contrib/fuse-dfs
    • webdav - hdfs exposed as a webdav resource mapR - contains a closed source hdfs compatible file system that supports read/write NFS access
    • HDFS NFS Proxy - exports HDFS as NFS without use of fuse. Supports Kerberos and re-orders writes so they are written to hdfs sequentially.

    I haven't tried any of these, but I will update the answer soon as I have the same need as the OP

    0 讨论(0)
  • 2020-12-13 11:21

    You can also use HadoopDrive (http://hadoopdrive.effisoft.eu). It's a Windows shell extension.

    0 讨论(0)
  • 2020-12-13 11:22

    you can try mounting HDFS on your machine(call it machine_X) where you are executing your code and machine_X should have infiniband connectivity with the HDFS Check this out, https://wiki.apache.org/hadoop/MountableHDFS

    0 讨论(0)
提交回复
热议问题