I\'ve got two word lists, an example:
list 1 list 2
foot fuut
barj kijo
foio fuau
fuim fuami
kwim kwami
lnun lnun
kizm kazm
My final solution is to use the mosesdecoder. I split the words into single characters and used them as parallel corpus and used the extracted model. I compared Sursilvan and Vallader.
export IRSTLM=$HOME/rumantsch/mosesdecoder/tools/irstlm
export PATH=$PATH:$IRSTLM/bin
rm -rf corpus giza.* model
array=("sur" "val")
for i in "${array[@]}"; do
cp "raw.$i" "splitted.$i"
sed -i 's/ /@/g' "splitted.$i"
sed -i 's/./& /g' "splitted.$i"
add-start-end.sh < "splitted.$i" > "compiled.$i"
build-lm.sh -i "compiled.$i" -t ./tmp -p -o "compiled.lm.$i"
compile-lm --text yes "compiled.lm.$i.gz" "compiled.arpa.$i"
done
../scripts/training/train-model.perl --first-step 1 --last-step 5 -root-dir . -corpus splitted -f sur -e val -lm 0:3:$PWD/compiled.arpa.sur -extract-options "--SentenceId" -external-bin-dir ../tools/bin/
$HOME/rumantsch/mosesdecoder/scripts/../bin/extract $HOME/rumantsch/mosesdecoder/rumantsch/splitted.val $HOME/rumantsch/mosesdecoder/rumantsch/splitted.sur $HOME/rumantsch/mosesdecoder/rumantsch/model/aligned.grow-diag-final $HOME/rumantsch/mosesdecoder/rumantsch/model/extract 7 --SentenceId --GZOutput
zcat model/extract.sid.gz | awk -F '[ ][|][|][|][ ]' '$1!=$2{print $1, "|", $2}' | sort | uniq -c | sort -nr | head -n 10 > results