tokenize | 易学教程

NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

阅读更多关于 NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

问题 I am trying to Tokenize text using RegexpTokenizer. Code: from nltk.tokenize import RegexpTokenizer #from nltk.tokenize import word_tokenize line = "U.S.A Count U.S.A. Sec.of U.S. Name:Dr.John Doe J.Doe 1.11 1,000 10--20 10-20" pattern = '[\d|\.|\,]+|[A-Z][\.|A-Z]+\b[\.]*|[\w]+|\S' tokenizer = RegexpTokenizer(pattern) print tokenizer.tokenize(line) #print word_tokenize(line) Output: ['U', '.', 'S', '.', 'A', 'Count', 'U', '.', 'S', '.', 'A', '.', 'Sec', '.', 'of', 'U', '.', 'S', '.', 'Name',

NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

阅读更多关于 NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

C++ reading a text file with conditional statements

阅读更多关于 C++ reading a text file with conditional statements

问题 I am trying to read lines in a text file, tokenize the line and then go on and do the same to the next line in a switch and break block, but after my program reaches the first break, it exits the loop and ignores the rest of the file. ifstream in("test.txt"); string line,buffer; unsigned int firstLetter = 0; //istringstream iss; if( in.is_open() ) { istringstream iss; while(getline(in,line)) { iss.str(line); char firstChar = line.at( firstLetter ); switch ( firstChar ) { case 'D': while

Python re.split() vs nltk word_tokenize and sent_tokenize

阅读更多关于 Python re.split() vs nltk word_tokenize and sent_tokenize

问题 I was going through this question. Am just wondering whether NLTK would be faster than regex in word/sentence tokenization. 回答1: The default nltk.word_tokenize() is using the Treebank tokenizer that emulates the tokenizer from the Penn Treebank tokenizer. Do note that str.split() doesn't achieve tokens in the linguistics sense, e.g.: >>> sent = "This is a foo, bar sentence." >>> sent.split() ['This', 'is', 'a', 'foo,', 'bar', 'sentence.'] >>> from nltk import word_tokenize >>> word_tokenize

Solr: exact phrase query with a EdgeNGramFilterFactory

阅读更多关于 Solr: exact phrase query with a EdgeNGramFilterFactory

问题 In Solr (3.3), is it possible to make a field letter-by-letter searchable through a EdgeNGramFilterFactory and also sensitive to phrase queries? By example, I'm looking for a field that, if containing "contrat informatique", will be found if the user types: contrat informatique contr informa "contrat informatique" "contrat info" Currently, I made something like this: <fieldtype name="terms" class="solr.TextField"> <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory"

tokenizing and converting to pig latin

阅读更多关于 tokenizing and converting to pig latin

问题 This looks like homework stuff but please be assured that it isn't homework. Just an exercise in the book we use in our c++ course, I'm trying to read ahead on pointers.. The exercise in the book tells me to split a sentence into tokens and then convert each of them into pig latin then display them.. pig latin here is basically like this: ball becomes allboy in piglatin.. boy becomes oybay.. take the first letter out, put it at the end then add "ay".. so far this is what i have: #include

Python Tokenization

阅读更多关于 Python Tokenization

问题 I am new with Python and I have a Tokenization assignment The Input is a .txt file with sentences and output is .txt file with Tokens, and When I say Token i mean: simple word, ',' , '!' , '?' , '.' ' " ' I have this function: Input: Elemnt is a word with or without Punctuation, could be word like: Hi or said: or said" StrForCheck : is an array of Punctuation that i want to separate from the words TokenFile: is my output file def CheckIfSEmanExist(Elemnt,StrForCheck, TokenFile):

Is SQLite on Android built with the ICU tokenizer enabled for FTS?

阅读更多关于 Is SQLite on Android built with the ICU tokenizer enabled for FTS?

问题 Like the title says: can we use ...USING fts3(tokenizer icu th_TH, ...) . If we can, does anyone know what locales are suported, and whether it varies by platform version? 回答1: No, only tokenizer=porter When I specify tokenizer=icu, I get "android.database.sqlite.SQLiteException: unknown tokenizer: icu" Also, this link hints that if Android didn't compile it in by default, it will not be available http://sqlite.phxsoftware.com/forums/t/2349.aspx 回答2: For API Level 21 or up, I tested and found

How can I split a string of a mathematical expressions in python?

阅读更多关于 How can I split a string of a mathematical expressions in python?

问题 I made a program which convert infix to postfix in python. The problem is when I introduce the arguments. If i introduce something like this: (this will be a string) ( ( 73 + ( ( 34 - 72 ) / ( 33 - 3 ) ) ) + ( 56 + ( 95 - 28 ) ) ) it will split it with .split() and the program will work correctly. But I want the user to be able to introduce something like this: ((73 + ( (34- 72 ) / ( 33 -3) )) + (56 +(95 - 28) ) ) As you can see I want that the blank spaces can be trivial but the program

NLTK regexp tokenizer not playing nice with decimal point in regex

阅读更多关于 NLTK regexp tokenizer not playing nice with decimal point in regex

问题 I'm trying to write a text normalizer, and one of the basic cases that needs to be handled is turning something like 3.14 to three point one four or three point fourteen . I'm currently using the pattern \$?\d+(\.\d+)?%? with nltk.regexp_tokenize , which I believe should handle numbers as well as currency and percentages. However, at the moment, something like $23.50 is handled perfectly (it parses to ['$23.50'] ), but 3.14 is parsing to ['3', '14'] - the decimal point is being dropped. I've