NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected
问题 I am trying to Tokenize text using RegexpTokenizer. Code: from nltk.tokenize import RegexpTokenizer #from nltk.tokenize import word_tokenize line = "U.S.A Count U.S.A. Sec.of U.S. Name:Dr.John Doe J.Doe 1.11 1,000 10--20 10-20" pattern = '[\d|\.|\,]+|[A-Z][\.|A-Z]+\b[\.]*|[\w]+|\S' tokenizer = RegexpTokenizer(pattern) print tokenizer.tokenize(line) #print word_tokenize(line) Output: ['U', '.', 'S', '.', 'A', 'Count', 'U', '.', 'S', '.', 'A', '.', 'Sec', '.', 'of', 'U', '.', 'S', '.', 'Name',