Best way to convert text files between character sets?

后端 未结 21 2041
再見小時候
再見小時候 2020-11-22 04:42

What is the fastest, easiest tool or method to convert text files between character sets?

Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.

相关标签:
21条回答
  • 2020-11-22 05:01

    Stand-alone utility approach

    iconv -f ISO-8859-1 -t UTF-8 in.txt > out.txt
    
    -f ENCODING  the encoding of the input
    -t ENCODING  the encoding of the output
    

    You don't have to specify either of these arguments. They will default to your current locale, which is usually UTF-8.

    0 讨论(0)
  • 2020-11-22 05:01

    Oneliner using find, with automatic character set detection

    The character encoding of all matching text files gets detected automatically and all matching text files are converted to utf-8 encoding:

    $ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} \;
    

    To perform these steps, a sub shell sh is used with -exec, running a one-liner with the -c flag, and passing the filename as the positional argument "$1" with -- {}. In between, the utf-8 output file is temporarily named converted.

    Whereby file -bi means:

    • -b, --brief Do not prepend filenames to output lines (brief mode).

    • -i, --mime Causes the file command to output mime type strings rather than the more traditional human readable ones. Thus it may say for example text/plain; charset=us-ascii rather than ASCII text. The sed command cuts this to only us-ascii as is required by iconv.

    The find command is very useful for such file management automation. Click here for more find galore.

    0 讨论(0)
  • 2020-11-22 05:02

    Try Notepad++

    On Windows I was able to use Notepad++ to do the conversion from ISO-8859-1 to UTF-8. Click "Encoding" and then "Convert to UTF-8".

    0 讨论(0)
  • 2020-11-22 05:04

    Try VIM

    If you have vim you can use this:

    Not tested for every encoding.

    The cool part about this is that you don't have to know the source encoding

    vim +"set nobomb | set fenc=utf8 | x" filename.txt
    

    Be aware that this command modify directly the file


    Explanation part!

    1. + : Used by vim to directly enter command when opening a file. Usualy used to open a file at a specific line: vim +14 file.txt
    2. | : Separator of multiple commands (like ; in bash)
    3. set nobomb : no utf-8 BOM
    4. set fenc=utf8 : Set new encoding to utf-8 doc link
    5. x : Save and close file
    6. filename.txt : path to the file
    7. " : qotes are here because of pipes. (otherwise bash will use them as bash pipe)
    0 讨论(0)
  • 2020-11-22 05:04

    In powershell:

    function Recode($InCharset, $InFile, $OutCharset, $OutFile)  {
        # Read input file in the source encoding
        $Encoding = [System.Text.Encoding]::GetEncoding($InCharset)
        $Text = [System.IO.File]::ReadAllText($InFile, $Encoding)
        
        # Write output file in the destination encoding
        $Encoding = [System.Text.Encoding]::GetEncoding($OutCharset)    
        [System.IO.File]::WriteAllText($OutFile, $Text, $Encoding)
    }
    
    Recode Windows-1252 "$pwd\in.txt" utf8 "$pwd\out.txt" 
    

    For a list of supported encoding names:

    https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding

    0 讨论(0)
  • 2020-11-22 05:05

    DOS/Windows: use Code page

    chcp 65001>NUL
    type ascii.txt > unicode.txt
    

    Command chcp can be used to change the code page. Code page 65001 is Microsoft name for UTF-8. After setting code page, the output generated by following commands will be of code page set.

    0 讨论(0)
提交回复
热议问题