Batch convert latin-1 files to utf-8 using iconv

夙愿已清 提交于 2019-11-27 04:11:07

问题


I'm having this one PHP project on my OSX which is in latin1 -encoding. Now I need to convert files to UTF8. I'm not much a shell coder and I tried something I found from internet:

mkdir new  
for a in `ls -R *`; do iconv -f iso-8859-1 -t utf-8 <"$a" >new/"$a" ; done

But that does not create the directory structure and it gives me heck load of errors when run. Can anyone come up with neat solution?


回答1:


You shouldn't use ls like that and a for loop is not appropriate either. Also, the destination directory should be outside the source directory.

mkdir /path/to/destination
find . -type f -exec iconv -f iso-8859-1 -t utf-8 "{}" -o /path/to/destination/"{}" \;

No need for a loop. The -type f option includes files and excludes directories.

Edit:

The OS X version of iconv doesn't have the -o option. Try this:

find . -type f -exec bash -c 'iconv -f iso-8859-1 -t utf-8 "{}" > /path/to/destination/"{}"' \;



回答2:


Some good answers, but I found this a lot easier in my case with a nested directory of hundreds of files to convert:

WARNING: This will write the files in place, so make a backup

$ vim $(find . -type f)

# in vim, go into command mode (:)
:set nomore
:bufdo set fileencoding=utf8 | w



回答3:


This converts all files with the .php filename extension - in the current directory and its subdirectories - preserving the directory structure:

    find . -name "*.php" -exec sh -c "iconv -f ISO-8859-1 -t UTF-8 {} > {}.utf8"  \; -exec mv "{}".utf8 "{}" \;

Notes:

To get a list of files that will be targeted beforehand, just run the command without the -exec flags (like this: find . -name "*.php"). Making a backup is a good idea.

Using sh like this allows piping and redirecting with -exec, which is necessary because not all versions of iconv support the -o flag.

Adding .utf8 to the filename of the output and then removing it might seem strange but it is necessary. Using the same name for output and input files can cause the following problems:

  • For large files (around 30 KB in my experience) it causes core dump (or termination by signal 7)

  • Some versions of iconv seem to create the output-file before they read the input file, which means that if the input and output files have the same name, the input file is overwritten with an empty file before it is read.




回答4:


To convert a complete directory tree recursively from iso-8859-1 to utf-8 including the creation of subdirectories none of the short solutions above worked for me because the directory structure was not created in the target. Based on Dennis Williamsons answer I came up with the following solution:

find . -type f -exec bash -c 't="/tmp/dest"; mkdir -p "$t/`dirname {}`"; iconv -f iso-8859-1 -t utf-8 "{}" > "$t/{}"' \;

It will create a clone of the current directory subtree in /tmp/dest (adjust to your needs) including all subdirectories and with all iso-8859-1 files converted to utf-8. Tested on macosx.

Btw: Check your file encodings with:

file -I file.php

to get the encoding information.

Hope this helps.




回答5:


I create the following script that (i) backups all tex files in directory "converted", (ii) checks the encoding of every tex file, and (iii) converts to UTF-8 only the tex files in the ISO-8859-1 encoding.

FILES=*.tex
for f in $FILES
do
  filename="${f%.*}"
  echo -n "$f"
#file -I $f
  if file -I $f | grep -wq "iso-8859-1"
  then
    mkdir -p converted
    cp $f ./converted
    iconv -f ISO-8859-1 -t UTF-8 $f > "${filename}_utf8.tex"
    mv "${filename}_utf8.tex" $f
    echo ": CONVERTED TO UTF-8."
  else
    echo ": UTF-8 ALREADY."
  fi
done



回答6:


If all the files you have to convert are .php you could use the following, which is recursive by default:

for a in $(find . -name "*.php"); do iconv -f iso-8859-1 -t utf-8 <"$a" >new/"$a" ; done

I believe your errors were due to the fact that ls -R also produces an output that might not be recognized by iconv as a valid filename, something like ./my/dir/structure:




回答7:


On unix.stackexchange.com a similar question was asked, and user manatwork suggested recode which does the trick very nicely.

I've been using it to convert ucs-2 to utf-8 in place

recode ucs-2..utf-8 *.txt



回答8:


Use mkdir -p "${a%/*}"; before iconv.

Note that you are using a potentially dangerous for construct when there are spaces in filenames, see http://porkmail.org/era/unix/award.html.




回答9:


Everything's fine with the above answers, but if this is a "mixed" project, i.e. there are already UTF8 files, then we may get into trouble, therefore here's my solution, I'm checking file encoding first.

#!/bin/bash
# file name: to_utf8

# current encoding:
encoding=$(file -i "$1" | sed "s/.*charset=\(.*\)$/\1/")

if [  "${encoding}" = "iso-8859-1" ] || [ "${encoding}" = "iso-8859-2" ]; 
then
echo "recoding from ${encoding} to UTF-8 file : $1"
recode ISO-8859-2..UTF-8 "$1"
fi

#example:
#find . -name "*.php" -exec to_utf8 {} \;



回答10:


Using the answers of Dennis Williamson and Alberto Zaccagni, I came up with the following script that converts all files of the specified file type from all subdirectories. The output is then collected in one folder that is given by /path/to/destination

mkdir /path/to/destination
for a in $(find . -name "*.php"); 
do 
        filename=$(basename $a);
        echo $filename
        iconv -f iso-8859-1 -t utf-8 <"$a" >"/path/to/destination/$filename"; 
done

The function basename returns the filename without the path of the file.

Alternative (user interactive): Now I also created a user interactive script that lets you decide whether you want to overwrite the old files or just rename them. Additional thanks go to tbsalling

for a in $(find . -name "*.tex");
do
        iconv -f iso-8859-1 -t utf-8 <"$a" >"$a".utf8 ;
done
echo "Should the original files be replaced (Y/N)?"
read replace
if [ "$replace" == "Y" ]; then
    echo "Original files have been replaced."
    for a in $(find . -name "*.tex.utf8");
        do
            file_no_suffix=$(basename -s .tex.utf8 "$a");
            directory=$(dirname "$a");
            mv "$a" "$directory"/"$file_no_suffix".tex;
        done
else
        echo "Original files have been converted and converted files were saved with suffix '.utf8'"
fi

Have fun with this and I would be grateful for any comments to improve it, thanks!




回答11:


find . -iname "*.php" | xargs -I {} echo "iconv -f ISO-8859-1 -t UTF-8 \"{}\" > \"{}-utf8.php\""


来源:https://stackoverflow.com/questions/4544669/batch-convert-latin-1-files-to-utf-8-using-iconv

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!