tokenize

NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

佐手、 提交于 2019-12-22 07:09:07
问题 I am trying to Tokenize text using RegexpTokenizer. Code: from nltk.tokenize import RegexpTokenizer #from nltk.tokenize import word_tokenize line = "U.S.A Count U.S.A. Sec.of U.S. Name:Dr.John Doe J.Doe 1.11 1,000 10--20 10-20" pattern = '[\d|\.|\,]+|[A-Z][\.|A-Z]+\b[\.]*|[\w]+|\S' tokenizer = RegexpTokenizer(pattern) print tokenizer.tokenize(line) #print word_tokenize(line) Output: ['U', '.', 'S', '.', 'A', 'Count', 'U', '.', 'S', '.', 'A', '.', 'Sec', '.', 'of', 'U', '.', 'S', '.', 'Name',

NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

安稳与你 提交于 2019-12-22 07:07:42
问题 I am trying to Tokenize text using RegexpTokenizer. Code: from nltk.tokenize import RegexpTokenizer #from nltk.tokenize import word_tokenize line = "U.S.A Count U.S.A. Sec.of U.S. Name:Dr.John Doe J.Doe 1.11 1,000 10--20 10-20" pattern = '[\d|\.|\,]+|[A-Z][\.|A-Z]+\b[\.]*|[\w]+|\S' tokenizer = RegexpTokenizer(pattern) print tokenizer.tokenize(line) #print word_tokenize(line) Output: ['U', '.', 'S', '.', 'A', 'Count', 'U', '.', 'S', '.', 'A', '.', 'Sec', '.', 'of', 'U', '.', 'S', '.', 'Name',

C++ reading a text file with conditional statements

≯℡__Kan透↙ 提交于 2019-12-21 21:37:12
问题 I am trying to read lines in a text file, tokenize the line and then go on and do the same to the next line in a switch and break block, but after my program reaches the first break, it exits the loop and ignores the rest of the file. ifstream in("test.txt"); string line,buffer; unsigned int firstLetter = 0; //istringstream iss; if( in.is_open() ) { istringstream iss; while(getline(in,line)) { iss.str(line); char firstChar = line.at( firstLetter ); switch ( firstChar ) { case 'D': while

Python re.split() vs nltk word_tokenize and sent_tokenize

我只是一个虾纸丫 提交于 2019-12-20 12:36:02
问题 I was going through this question. Am just wondering whether NLTK would be faster than regex in word/sentence tokenization. 回答1: The default nltk.word_tokenize() is using the Treebank tokenizer that emulates the tokenizer from the Penn Treebank tokenizer. Do note that str.split() doesn't achieve tokens in the linguistics sense, e.g.: >>> sent = "This is a foo, bar sentence." >>> sent.split() ['This', 'is', 'a', 'foo,', 'bar', 'sentence.'] >>> from nltk import word_tokenize >>> word_tokenize

Solr: exact phrase query with a EdgeNGramFilterFactory

淺唱寂寞╮ 提交于 2019-12-20 10:47:02
问题 In Solr (3.3), is it possible to make a field letter-by-letter searchable through a EdgeNGramFilterFactory and also sensitive to phrase queries? By example, I'm looking for a field that, if containing "contrat informatique", will be found if the user types: contrat informatique contr informa "contrat informatique" "contrat info" Currently, I made something like this: <fieldtype name="terms" class="solr.TextField"> <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory"

tokenizing and converting to pig latin

六眼飞鱼酱① 提交于 2019-12-20 07:28:33
问题 This looks like homework stuff but please be assured that it isn't homework. Just an exercise in the book we use in our c++ course, I'm trying to read ahead on pointers.. The exercise in the book tells me to split a sentence into tokens and then convert each of them into pig latin then display them.. pig latin here is basically like this: ball becomes allboy in piglatin.. boy becomes oybay.. take the first letter out, put it at the end then add "ay".. so far this is what i have: #include

Python Tokenization

自闭症网瘾萝莉.ら 提交于 2019-12-20 03:52:24
问题 I am new with Python and I have a Tokenization assignment The Input is a .txt file with sentences and output is .txt file with Tokens, and When I say Token i mean: simple word, ',' , '!' , '?' , '.' ' " ' I have this function: Input: Elemnt is a word with or without Punctuation, could be word like: Hi or said: or said" StrForCheck : is an array of Punctuation that i want to separate from the words TokenFile: is my output file def CheckIfSEmanExist(Elemnt,StrForCheck, TokenFile):

Is SQLite on Android built with the ICU tokenizer enabled for FTS?

血红的双手。 提交于 2019-12-19 06:39:16
问题 Like the title says: can we use ...USING fts3(tokenizer icu th_TH, ...) . If we can, does anyone know what locales are suported, and whether it varies by platform version? 回答1: No, only tokenizer=porter When I specify tokenizer=icu, I get "android.database.sqlite.SQLiteException: unknown tokenizer: icu" Also, this link hints that if Android didn't compile it in by default, it will not be available http://sqlite.phxsoftware.com/forums/t/2349.aspx 回答2: For API Level 21 or up, I tested and found

How can I split a string of a mathematical expressions in python?

£可爱£侵袭症+ 提交于 2019-12-19 05:22:00
问题 I made a program which convert infix to postfix in python. The problem is when I introduce the arguments. If i introduce something like this: (this will be a string) ( ( 73 + ( ( 34 - 72 ) / ( 33 - 3 ) ) ) + ( 56 + ( 95 - 28 ) ) ) it will split it with .split() and the program will work correctly. But I want the user to be able to introduce something like this: ((73 + ( (34- 72 ) / ( 33 -3) )) + (56 +(95 - 28) ) ) As you can see I want that the blank spaces can be trivial but the program

NLTK regexp tokenizer not playing nice with decimal point in regex

自作多情 提交于 2019-12-19 03:22:42
问题 I'm trying to write a text normalizer, and one of the basic cases that needs to be handled is turning something like 3.14 to three point one four or three point fourteen . I'm currently using the pattern \$?\d+(\.\d+)?%? with nltk.regexp_tokenize , which I believe should handle numbers as well as currency and percentages. However, at the moment, something like $23.50 is handled perfectly (it parses to ['$23.50'] ), but 3.14 is parsing to ['3', '14'] - the decimal point is being dropped. I've