How to do CopyMerge in Hadoop 3.0?

前端 未结 4 646
半阙折子戏
半阙折子戏 2020-12-29 09:05

I know hadoop version 2.7\'s FileUtil has the copyMerge function that merges multiple files into a new one.

But t

相关标签:
4条回答
  • 2020-12-29 09:33

    I had the same question and had to re-implement copyMerge (in PySpark though, but using the same API calls as original copyMerge).

    Have no idea why there is no equivalent functionality in Hadoop 3. We have to merge files from an HDFS directory over to an HDFS file very often.

    Here's implementation in pySpark I referenced above https://github.com/Tagar/stuff/blob/master/copyMerge.py

    0 讨论(0)
  • 2020-12-29 09:42

    As FileUtil.copyMerge() has been deprecated and removed from the API starting version 3, a simple solution consists in re-implementing it ourselves.

    Here is the Java original implementation from previous versions.

    Here is a Scala rewrite:

    import scala.util.Try
    import org.apache.hadoop.conf.Configuration
    import org.apache.hadoop.fs.{FileSystem, Path}
    import org.apache.hadoop.io.IOUtils
    import java.io.IOException
    
    def copyMerge(
        srcFS: FileSystem, srcDir: Path,
        dstFS: FileSystem, dstFile: Path,
        deleteSource: Boolean, conf: Configuration
    ): Boolean = {
    
      if (dstFS.exists(dstFile))
        throw new IOException(s"Target $dstFile already exists")
    
      // Source path is expected to be a directory:
      if (srcFS.getFileStatus(srcDir).isDirectory()) {
    
        val outputFile = dstFS.create(dstFile)
        Try {
          srcFS
            .listStatus(srcDir)
            .sortBy(_.getPath.getName)
            .collect {
              case status if status.isFile() =>
                val inputFile = srcFS.open(status.getPath())
                Try(IOUtils.copyBytes(inputFile, outputFile, conf, false))
                inputFile.close()
            }
        }
        outputFile.close()
    
        if (deleteSource) srcFS.delete(srcDir, true) else true
      }
      else false
    }
    
    0 讨论(0)
  • 2020-12-29 09:46

    FileUtil#copyMerge method has been removed. See details for the major change:

    https://issues.apache.org/jira/browse/HADOOP-12967

    https://issues.apache.org/jira/browse/HADOOP-11392

    You can use getmerge

    Usage: hadoop fs -getmerge [-nl]

    Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file. -skip-empty-file can be used to avoid unwanted newline characters in case of empty files.

    Examples:

    hadoop fs -getmerge -nl /src /opt/output.txt
    hadoop fs -getmerge -nl /src/file1.txt /src/file2.txt /output.txt
    

    Exit Code: Returns 0 on success and non-zero on error.

    https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#getmerge

    0 讨论(0)
  • 2020-12-29 09:46

    This should work

    /** Copy all files in a directory to one output file (merge). */
        public static boolean copyMerge(FileSystem srcFS, Path srcDir,
                                        FileSystem dstFS, Path dstFile,
                                        boolean deleteSource,
                                        Configuration conf, String addString) throws IOException {
            dstFile = checkDest(srcDir.getName(), dstFS, dstFile, false);
    
            if (!srcFS.getFileStatus(srcDir).isDirectory())
                return false;
    
            OutputStream out = dstFS.create(dstFile);
    
            try {
                FileStatus contents[] = srcFS.listStatus(srcDir);
                Arrays.sort(contents);
                for (int i = 0; i < contents.length; i++) {
                    if (contents[i].isFile()) {
                        InputStream in = srcFS.open(contents[i].getPath());
                        try {
                            IOUtils.copyBytes(in, out, conf, false);
                            if (addString!=null)
                                out.write(addString.getBytes("UTF-8"));
    
                        } finally {
                            in.close();
                        }
                    }
                }
            } finally {
                out.close();
            }
    
    
            if (deleteSource) {
                return srcFS.delete(srcDir, true);
            } else {
                return true;
            }
        }
    
        private static Path checkDest(String srcName, FileSystem dstFS, Path dst,
                                      boolean overwrite) throws IOException {
            if (dstFS.exists(dst)) {
                FileStatus sdst = dstFS.getFileStatus(dst);
                if (sdst.isDirectory()) {
                    if (null == srcName) {
                        throw new IOException("Target " + dst + " is a directory");
                    }
                    return checkDest(null, dstFS, new Path(dst, srcName), overwrite);
                } else if (!overwrite) {
                    throw new IOException("Target " + dst + " already exists");
                }
            }
            return dst;
        }
    
    0 讨论(0)
提交回复
热议问题