Transliteration script for linux shell

邮差的信 提交于 2019-12-07 04:38:00

问题


I have multiple .txt files containing text in an alphabet; I want to transliterate the text into an other alphabet; some characters of alphabet1 are 1:1 with those of alphabet2 (i.e. a becomes e), whereas others are 1:2 (i.e. x becomes ch).

I would like to do this using a simple script for the Linux shell.

With tr or sed I can convert 1:1 characters:

sed -f y/abcdefghijklmnopqrstuvwxyz/nopqrstuvwxyzabcdefghijklm/

a will become n, b will become o et cetera (a Caesar's cipher, I think)

But how can I deal with 1:2 characters?


回答1:


Not an answer, just to show a briefer, idiomatic way to populate the table[] array from @konsolebox's answer as discussed in the related comments:

BEGIN {
    split("a  e b", old)
    split("x ch o", new)
    for (i in old)
        table[old[i]] = new[i]
    FS = OFS = ""
}

so the mapping of old to new chars is clearly shown in that the char in the first split() is mapped to the char(s) below it and for any other mapping you want you just need to change the string(s) in the split(), not change 26-ish explicit assignments to table[].

You can even create a general script to do mappings and just pass in the old and new strings as variables:

BEGIN {
    split(o, old)
    split(n, new)
    for (i in old)
        table[old[i]] = new[i]
    FS = OFS = ""
}

then in shell anything like this:

old="a  e b"
new="x ch o"
awk -v o="$old" -v b="$new" -f script.awk file

and you can protect yourself from your own mistakes populating the strings, e.g.:

BEGIN {
    numOld = split(o, old)
    numNew = split(n, new)

    if (numOld != numNew) {
        printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
        exit 1
    }

    for (i=1; i <= numOld; i++) {
        if (old[i] in table) {
            printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
            exit 1
        }
        if (newvals[new[i]]++) {
            printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
        }
        table[old[i]] = new[i]
    }
}

Wouldn't it be good to know if you wrote that b maps to x and then later mistakenly wrote that b maps to y? The above really is the best way to do this but your call of course.

Here's one complete solution as discussed in the comments below

BEGIN {
    numOld = split("a  e b", old)
    numNew = split("x ch o", new)

    if (numOld != numNew) {
        printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
        exit 1
    }

    for (i=1; i <= numOld; i++) {
        if (old[i] in table) {
            printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
            exit 1
        }
        if (newvals[new[i]]++) {
            printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
        }
        map[old[i]] = new[i]
    }

    FS = OFS = ""
}
{
    for (i = 1; i <= NF; ++i) {
        if ($i in map) {
            $i = map[$i]
        }
    }
    print
}

I renamed the table array as map just because iMHO that better represents the purpose of the array.

save the above in a file script.awk and run it as awk -f script.awk inputfile




回答2:


Using Awk:

#!/usr/bin/awk -f
BEGIN {
    FS = OFS = ""
    table["a"] = "e"
    table["x"] = "ch"
    # and so on...
}
{
    for (i = 1; i <= NF; ++i) {
        if ($i in table) {
            $i = table[$i]
        }
    }
}
1

Usage:

awk -f script.awk file

Test:

# echo "the quick brown fox jumps over the lazy dog" | awk -f script.awk
the quick brown foch jumps over the lezy dog



回答3:


This can be done quite concisely using a Perl one-liner:

perl -pe '%h=(a=>"xy",c=>"z"); s/(.)/defined $h{$1} ? $h{$1} : $1/eg'

or equivalently (thanks jaypal):

perl -pe '%h=(a=>"xy",c=>"z"); s|(.)|$h{$1}//=$1|eg'

%h is a hash containing the characters (keys) and their substitutions (values). s is the substitution command (as in sed). The g modifier means that the substitution is global and the e means that the replacement part is evaluated as an expression. It captures each character one by one and substitutes them with the value in the hash if it exists, otherwise keeps the original value. The -p switch means that each line in the input is automatically printed.

Testing it out:

$ perl -pe '%h=(a=>"xy",c=>"z"); s|(.)|$h{$1}//=$1|eg' <<<"abc"
xybz



回答4:


Using sed.

Write a file transliterate.sed containing:

s/a/e/g
s/x/ch/g

and then run from your command line to get the transliterated output.txt from input.txt:

sed -f transliterate.sed input.txt > output.txt

If you need this more often consider adding #!/bin/sed -f as first line and making your file executable with chmod 744 transliterate.sed as described at the Wikipedia page for sed.



来源:https://stackoverflow.com/questions/25338470/transliteration-script-for-linux-shell

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!