making a list of traditional Chinese characters from a string

喜你入骨 提交于 2019-12-25 06:21:27

问题


I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.

I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:

首映鼓掌10分鐘 評語指不及《花樣年華》 該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本 增減20處 趙本山香港戲分被刪 在柏林影展放映的《一代宗師》版本 教李小龍武功 葉問決戰散打王

另一增加的戲分是開場時葉問(梁朝偉飾)

My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.

I have not gotten anywhere near this point yet.

I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.

However, I run into trouble when I try to make a list of each character on a particular line.

I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem. How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

My code looks like the following

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import codecs

wordfile = open('Chinese_example.txt', 'r')

output = open('Chinese_output_python.txt', 'w')

LINES = wordfile.readlines()

Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.

A_LINE = list(LINES[0])

output.write(A_LINE[0])

回答1:


I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :

from re import compile as _Re

_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split

def split_unicode_chrs( text ):
  return [ chr for chr in _unicode_chr_splitter( text ) if chr ]



回答2:


to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.

my_new_list = list(unicode(LINE[0].decode('utf8')));



来源:https://stackoverflow.com/questions/14803834/making-a-list-of-traditional-chinese-characters-from-a-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!