Is there an iconv with //TRANSLIT equivalent in java?

问题

Is there a way to achieve transliteration of characters between charsets in java? something similar to the unix command (or similar php function):

iconv -f UTF-8 -t ASCII//TRANSLIT < some_doc.txt  > new_doc.txt

preferably operating on strings, not having anything to do with files

I know you can can change encodings with the String constructor, but that doesn't handle transliteration of characters that aren't in the resulting charset.

回答1:

I'm not aware of any libraries that do exactly what iconv purports to do (which doesn't seem very well defined). However, you can use "normalization" in Java to do things like remove accents from characters. This process is well defined by Unicode standards.

I think NFKD (compatibility decomposition) followed by a filtering of non-ASCII characters might get you close to what you want. Obviously, this is a lossy process; you can never recover all of the information that was in the original string, so be careful.

/* Decompose original "accented" string to basic characters. */
String decomposed = Normalizer.normalize(accented, Normalizer.Form.NFKD);
/* Build a new String with only ASCII characters. */
StringBuilder buf = new StringBuilder();
for (int idx = 0; idx < decomposed.length(); ++idx) {
  char ch = decomposed.charAt(idx);
  if (ch < 128)
    buf.append(ch);
}
String filtered = buf.toString();

With the filtering used here, you might render some strings unreadable. For example, a string of Chinese characters would be filtered away completely because none of them have an ASCII representation (this is more like iconv's //IGNORE).

Overall, it would be safer to build your own lookup table of valid character substitutions, or at least of combining characters (accents and things) that are safe to strip. The best solution depends on the range of input characters you expect to handle.

回答2:

One solution is to execute execute iconv as an external process. It will certainly offend purists. It depends on presence of iconv on the system but it works and does exactly what you want:

public static String utfToAscii(String input) throws IOException {
    Process p = Runtime.getRuntime().exec("iconv -f UTF-8 -t ASCII//TRANSLIT");
    BufferedWriter bwo = new BufferedWriter(new OutputStreamWriter(p.getOutputStream()));
    BufferedReader bri = new BufferedReader(new InputStreamReader(p.getInputStream()));
    bwo.write(input,0,input.length());
    bwo.flush();
    bwo.close();
    String line  = null;
    StringBuilder stringBuilder = new StringBuilder();
    String ls = System.getProperty("line.separator");
    while( ( line = bri.readLine() ) != null ) {
        stringBuilder.append( line );
        stringBuilder.append( ls );
    }
    bri.close();
    try {
        p.waitFor();
    } catch ( InterruptedException e ) {
    }
    return stringBuilder.toString();
}

回答3:

Let's start with a slight variation of Ericson's answer and build more //TRANSLIT features on it:

Decompose chars to gain ASCII-`String`

public class Translit {

    private static final Charset US_ASCII = Charset.forName("US-ASCII");
    private static String toAscii(final String input) {
        final CharsetEncoder charsetEncoder = US_ASCII.newEncoder();
        final char[] decomposed = Normalizer.normalize(input, Normalizer.Form.NFKD).toCharArray();
        final StringBuilder sb = new StringBuilder(decomposed.length);

        for (int i = 0; i < decomposed.length; ) {
            final int codePoint = Character.codePointAt(decomposed, i);
            final int charCount = Character.charCount(codePoint);

            if(charsetEncoder.canEncode(CharBuffer.wrap(decomposed, i, charCount))) {
                sb.append(decomposed, i, charCount);
            }

            i += charCount;
        }
        return sb.toString();
    }


    public static void main(String[] args) {
        final String a = "Michèleäöüß";
        System.out.println(a + " => " + toAscii(a));
        System.out.println(a.toUpperCase() + " => " + toAscii(a.toUpperCase()));
    }
}

While this should behave the same for US-ASCII this solution is easier to adopt for different target encodings. (As characters are decomposed first this does not necessarily yield better results for other encodings though)

The function is safe for supplementary code points (which is a bit overkill for ASCII as target, but may reduce head-aches if another target encoding is chosen).

Also note, that a regular Java-String is returned; if you need an ASCII-byte[] you still need to convert it (but as we ensured there are no offending characters...).

And this is how you could extend it to more character-sets:

Replace or decompose characters to gain a `String` encodeable in supplied `Charset`

import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
import java.text.Normalizer;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

/**
 * Created for http://stackoverflow.com/a/22841035/1266906
 */
public class Translit {
    public static final Charset                  US_ASCII     = Charset.forName("US-ASCII");
    public static final Charset                  ISO_8859_1   = Charset.forName("ISO-8859-1");
    public static final Charset                  UTF_8        = Charset.forName("UTF-8");
    public static final HashMap<Integer, String> REPLACEMENTS = new ReplacementBuilder().put('„', '"')
                                                                                              .put('“', '"')
                                                                                              .put('”', '"')
                                                                                              .put('″', '"')
                                                                                              .put('€', "EUR")
                                                                                              .put('ß', "ss")
                                                                                              .put('•', '*')
                                                                                              .getMap();

    private static String toCharset(final String input, Charset charset) {
        return toCharset(input, charset, Collections.<Integer, String>emptyMap());
    }

    private static String toCharset(final String input,
                                    Charset charset,
                                    Map<? super Integer, ? extends String> replacements) {
        final CharsetEncoder charsetEncoder = charset.newEncoder();
        return toCharset(input, charsetEncoder, replacements);
    }

    private static String toCharset(String input,
                                    CharsetEncoder charsetEncoder,
                                    Map<? super Integer, ? extends String> replacements) {
        char[] data = input.toCharArray();
        final StringBuilder sb = new StringBuilder(data.length);

        for (int i = 0; i < data.length; ) {
            final int codePoint = Character.codePointAt(data, i);
            final int charCount = Character.charCount(codePoint);

            CharBuffer charBuffer = CharBuffer.wrap(data, i, charCount);
            if (charsetEncoder.canEncode(charBuffer)) {
                sb.append(data, i, charCount);
            } else if (replacements.containsKey(codePoint)) {
                sb.append(toCharset(replacements.get(codePoint), charsetEncoder, replacements));
            } else {
                // Only perform NFKD Normalization after ensuring the original character is invalid as this is a irreversible process
                final char[] decomposed = Normalizer.normalize(charBuffer, Normalizer.Form.NFKD).toCharArray();
                for (int j = 0; j < decomposed.length; ) {
                    int decomposedCodePoint = Character.codePointAt(decomposed, j);
                    int decomposedCharCount = Character.charCount(decomposedCodePoint);

                    if (charsetEncoder.canEncode(CharBuffer.wrap(decomposed, j, decomposedCharCount))) {
                        sb.append(decomposed, j, decomposedCharCount);
                    } else if (replacements.containsKey(decomposedCodePoint)) {
                        sb.append(toCharset(replacements.get(decomposedCodePoint), charsetEncoder, replacements));
                    }

                    j += decomposedCharCount;
                }
            }

            i += charCount;
        }
        return sb.toString();
    }


    public static void main(String[] args) {
        final String a = "Michèleäöüß€„“”″•";
        System.out.println(a + " => " + toCharset(a, US_ASCII));
        System.out.println(a + " => " + toCharset(a, ISO_8859_1));
        System.out.println(a + " => " + toCharset(a, UTF_8));

        System.out.println(a + " => " + toCharset(a, US_ASCII, REPLACEMENTS));
        System.out.println(a + " => " + toCharset(a, ISO_8859_1, REPLACEMENTS));
        System.out.println(a + " => " + toCharset(a, UTF_8, REPLACEMENTS));
    }

    public static class MapBuilder<K, V> {

        private final HashMap<K, V> map;

        public MapBuilder() {
            map = new HashMap<K, V>();
        }

        public MapBuilder<K, V> put(K key, V value) {
            map.put(key, value);
            return this;
        }

        public HashMap<K, V> getMap() {
            return map;
        }
    }

    public static class ReplacementBuilder extends MapBuilder<Integer, String> {
        public ReplacementBuilder() {
            super();
        }

        @Override
        public ReplacementBuilder put(Integer input, String replacement) {
            super.put(input, replacement);
            return this;
        }

        public ReplacementBuilder put(Integer input, char replacement) {
            return this.put(input, String.valueOf(replacement));
        }

        public ReplacementBuilder put(char input, String replacement) {
            return this.put((int) input, replacement);
        }

        public ReplacementBuilder put(char input, char replacement) {
            return this.put((int) input, String.valueOf(replacement));
        }
    }
}

I would strongly recommend building an extensive replacement-table as the simple example already shows how you otherwise might lose desired information like €. For ASCII this implementation is of course a bit slower as decomposition is only done on demand and the StringBuilder now may need to grow to hold the replacements.

GNU's iconv uses the replacements listed in translit.def to perform a //TRANSLIT-conversion and you can use a method like this if you want to use it as replacement-map:

Import original `//TRANSLIT`-replacements

private static Map<Integer, String> readReplacements() {
    HashMap<Integer, String> map = new HashMap<>();
    InputStream stream = Translit.class.getResourceAsStream("/translit.def");
    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(stream, UTF_8));
    Pattern pattern = Pattern.compile("^([0-9A-Fa-f]+)\t(.?[^\t]*)\t#(.*)$");
    try {
        String line;
        while ((line = bufferedReader.readLine()) != null) {
            if (line.charAt(0) != '#') {
                Matcher matcher = pattern.matcher(line);
                if (matcher.find()) {
                    map.put(Integer.valueOf(matcher.group(1), 16), matcher.group(2));
                }
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
    return map;
}

来源：https://stackoverflow.com/questions/5806690/is-there-an-iconv-with-translit-equivalent-in-java

标签

java

iconv