I\'ve got two word lists, an example:
list 1 list 2
foot fuut
barj kijo
foio fuau
fuim fuami
kwim kwami
lnun lnun
kizm kazm
My final solution is to use the mosesdecoder. I split the words into single characters and used them as parallel corpus and used the extracted model. I compared Sursilvan and Vallader.
export IRSTLM=$HOME/rumantsch/mosesdecoder/tools/irstlm
export PATH=$PATH:$IRSTLM/bin
rm -rf corpus giza.* model
array=("sur" "val")
for i in "${array[@]}"; do
cp "raw.$i" "splitted.$i"
sed -i 's/ /@/g' "splitted.$i"
sed -i 's/./& /g' "splitted.$i"
add-start-end.sh < "splitted.$i" > "compiled.$i"
build-lm.sh -i "compiled.$i" -t ./tmp -p -o "compiled.lm.$i"
compile-lm --text yes "compiled.lm.$i.gz" "compiled.arpa.$i"
done
../scripts/training/train-model.perl --first-step 1 --last-step 5 -root-dir . -corpus splitted -f sur -e val -lm 0:3:$PWD/compiled.arpa.sur -extract-options "--SentenceId" -external-bin-dir ../tools/bin/
$HOME/rumantsch/mosesdecoder/scripts/../bin/extract $HOME/rumantsch/mosesdecoder/rumantsch/splitted.val $HOME/rumantsch/mosesdecoder/rumantsch/splitted.sur $HOME/rumantsch/mosesdecoder/rumantsch/model/aligned.grow-diag-final $HOME/rumantsch/mosesdecoder/rumantsch/model/extract 7 --SentenceId --GZOutput
zcat model/extract.sid.gz | awk -F '[ ][|][|][|][ ]' '$1!=$2{print $1, "|", $2}' | sort | uniq -c | sort -nr | head -n 10 > results
You can use edit distance algorithms, for example, Levenshtein distance. It may be necessary to make some minor changes in algorithms to register exact modifications.
Edit distance algorithms and Levenshtein distance like as LCS method are Beneficial. But they are used to change one word to another one, by these methods you can find out how to change one word to another word with minimum amount of changes. But they aren't useful to find minimum amount of changes in two dictionaries.
The longest common subsequence (LCS) problem is to find the longest subsequence common to all sequences in a set of sequences (often just two). Note that subsequence is different from a substring, see substring vs. subsequence. It is a classic computer science problem, the basis of file comparison programs such as diff, and has applications in bioinformatics.
By using LCS or any other methods,for each word in list1, find that how that word changes to another one in list 2. for example,between foot & feet: LCS=FT ,difference = oo=>ee. You should make a bipartite graph that the first part makes from difference words, and the second part makes from list1. Each node in the second part is connected to its own related difference in the first part.
I'll suppose length and total part of words are limited.
We can model this problem with graphs. Assign each change to one node. e.g e → i (which determines one of a changes) is label for one node.
for example,If all of the operation of the form p→q is set to one part in bipartite graph and total difference for each pair of word's is equal one and define Edge collection's that Edge uv
connects vertex U
to V
if word(u) (in second part) for changing to correct word needs V's changes.You have a maximum 25*26 node in first part (for this example).
if In your case length>=1 ,so you need set a limit. I'll assume maximum part of change to each word is equal 3.and so we have 3*35K maximum node in first part. Now you can solve problem by choose set of node in first part that can be covered maximum word in second part(your chose should occur word change to correct word).
Find minimum vertex cut in this graph, to find set of node that they can supply your request.and repeat cut vertex algorithm's until get good answer.
you can use some sort of bounds to reduce the size of graph, e.g remove all nodes in the first part which has degree 1
, because this nodes are connected to exactly one word, so they seems to be useless.
This graph simulation can solve your problem. With current problem statement, this bounds of algorithm works properly.
for example:
It is example for bounds in this graph(remove all node in operation part that they have 1 edge):
1-remove node 4 because it is only connected to (nod),then remove node(nod) 2-remove node 2 because it is only connected to (sghoul),then remove node(sghoul) 3-now remove node 3 because it is only connected to (goud)[sghoul is removed last step],then remove node(goud)
and now you have one operation oo=>ee and you can choose this.
I'll think more and try to edit this text.