Encoding of filenames containing non-latin characters while extracting from .tar.gz packed by Ant tar task

问题

I'm building a tar.gz archive using Ant:

<tar destfile="${linux86.zip.file}" compression="gzip" longfile="gnu">
    <tarfileset dir="${work.dir}/data" dirmode="755" filemode="755"  
                prefix="${app.folder}/data"/>
</tar>

Archive is built on Windows. After being extracted on Ubuntu 12 files with names containing non-latin (for example, cyrillic) characters have broken names.

Is there any way to fix or work around that?

回答1:

No. Tar archives support only ascii filenames. See this question: Creating tar archive with national characters in Java. I think you need another format or tool, with more modern design.

Note that zip task has encoding attribute, maybe this format will work?

回答2:

I have found solution there, HUGE thanks to Jarekczek, but I didn't get decoded names right. I fixed the script as follow:

#!/usr/bin/env python

# Huge thanks to https://superuser.com/questions/60379/how-can-i-create-a-zip-tgz-in-linux-such-that-windows-has-proper-filenames#190786
# and http://stackoverflow.com/questions/12456560/encoding-of-filenames-containing-non-latin-characters-while-extracting-from-tar
import tarfile
import codecs
import sys

def recover(name):
    return codecs.decode(name, 'cp1251')

for tar_filename in sys.argv[1:]:
    tar = tarfile.open(name=tar_filename, mode='r', bufsize=16*1024)
    updated = []
    for m in tar.getmembers():
        m.name = recover(m.name)
        updated.append(m)
    tar.extractall(members=updated)
    tar.close()

What I did is to decode names from Windows to utf using Python's standart library codecs and command line interface to feed it the name(s) of archives.

回答3:

I have found some interesting information in Ant's developer mailing list (30 Jun 2009, 01 Jul 2009) and in ASF Bugzilla (36851, 53811). The problem is old and well-known, it has not been fixed mainly for ideological reasons because not all untar implementations support that.

Patch mentioned in Bugzilla issue has been applied in revision 1350857. There is a constructor with name of encoding for entry name in tar:

public TarOutputStream(OutputStream os, String encoding) { ... }

But it is never used in Tar task though. So I made an encoding attribute in Tar task, rebuilt Ant from modified sources and used UTF-8 as encoding of entry names.

Extraction tested under Ubuntu 11/12 and Mandriva.

来源：https://stackoverflow.com/questions/12456560/encoding-of-filenames-containing-non-latin-characters-while-extracting-from-tar

标签

Linux

ant

encoding

tar