Awk: What wrong with CJK characters? #Korean

问题

Given a .txt files with space-separated words such as:

But where is Esope the holly Bastard
But where is 생 지 옥 이 군
지 옥 이
지 옥
지
我 是 你 的 爸 爸 ！
爸 爸 ！ ！ ！
你 不 會 的 ！

And the Awk function :

cat /pathway/to/your/file.txt | tr ' ' '\n' | sort | uniq -c | awk '{print $2" "$1}'

I get the following output in my console which is invalid for korean words (valid for english and Chinese space-separated words)

생 16
Bastard 1
But 2
Esope 1
holly 1
is 2
the 1
where 2
不 1
你 2
我 1
是 1
會 1
爸 4
的 2

How to get it works for korean words ? Note: I actually have 300.000 lines and near 2 millions words.

EDIT: Used answer:

$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort > myfileout.txt

回答1:

A single awk script can handle this easily and will be far more efficient than your current pipeline:

$ awk '{a[$1]++}END{for(k in a)print k,a[k]}' RS=" |\n" file 
옥 3
Bastard 1
！ 5
爸 4
군 1
지 4
But 2
會 1
你 2
the 1
是 1
不 1
이 2
Esope 1
的 2
holly 1
where 2
생 1
我 1
is 2

If you want to store the results into another file you can use redirection like:

$ awk '{a[$1]++}END{for(k in a)print k,a[k]}' RS=" |\n" file > outfile

来源：https://stackoverflow.com/questions/15599781/awk-what-wrong-with-cjk-characters-korean

标签

awk

cjk

word-frequency

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!