tokenize | 易学教程

Division/RegExp conflict while tokenizing Javascript [duplicate]

阅读更多关于 Division/RegExp conflict while tokenizing Javascript [duplicate]

This question already has an answer here: When parsing Javascript, what determines the meaning of a slash? 5 answers I'm writing a simple javascript tokenizer which detects basic types: Word, Number, String, RegExp, Operator, Comment and Newline. Everything is going fine but I can't understand how to detect if the current character is RegExp delimiter or division operator. I'm not using regular expressions because they are too slow. Does anybody know the mechanism of detecting it? Thanks. You can tell by what the preceding token is is in the stream. Go through each token that your lexer emits

Replacing all tokens based on properties file with ANT

阅读更多关于 Replacing all tokens based on properties file with ANT

I'm pretty sure this is a simple question to answer and ive seen it asked before just no solid answers. I have several properties files that are used for different environments, i.e xxxx-dev, xxxx-test, xxxx-live The properties files contain something like: server.name=dummy_server_name server.ip=127.0.0.1 The template files im using look something like: <...> <server name="@server.name@" ip="@server.ip@"/> </...> The above is a really primitive example, but im wondering if there is a way to just tell ANT to replace all tokens based on the properties file, rather than having to hardcode a

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

阅读更多关于 How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example: import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size)) vect.fit(string) print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size)) outputs: 4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it'] The punctuation is removed: how to include them as separate tokens? You should

How to avoid NLTK's sentence tokenizer splitting on abbreviations?

阅读更多关于 How to avoid NLTK's sentence tokenizer splitting on abbreviations?

I'm currently using NLTK for language processing, but I have encountered a problem of sentence tokenizing. Here's the problem: Assume I have a sentence: "Fig. 2 shows a U.S.A. map." When I use punkt tokenizer, my code looks like this: from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() abbreviation = ['U.S.A', 'fig'] punkt_param.abbrev_types = set(abbreviation) tokenizer = PunktSentenceTokenizer(punkt_param) tokenizer.tokenize('Fig. 2 shows a U.S.A. map.') It returns this: ['Fig. 2 shows a U.S.A.', 'map.'] The tokenizer can't detect the

Split a string into an array in C++ [duplicate]

阅读更多关于 Split a string into an array in C++ [duplicate]

问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: How to split a string in C++? I have an input file of data and each line is an entry. in each line each "field" is seperated by a white space " " so I need to split the line by space. other languages have a function called split (C#, PHP etc) but I cant find one for C++. How can I achieve this? Here is my code that gets the lines: string line; ifstream in(file); while(getline(in, line)){ // Here I would like to

C++ tokenize a string using a regular expression

阅读更多关于 C++ tokenize a string using a regular expression

I'm trying to learn myself some C++ from scratch at the moment. I'm well-versed in python, perl, javascript but have only encountered C++ briefly, in a classroom setting in the past. Please excuse the naivete of my question. I would like to split a string using a regular expression but have not had much luck finding a clear, definitive, efficient and complete example of how to do this in C++. In perl this is action is common, and thus can be accomplished in a trivial manner, /home/me$ cat test.txt this is aXstringYwith, some problems and anotherXY line with similar issues /home/me$ cat test

Tokenization of Arabic words using NLTK

阅读更多关于 Tokenization of Arabic words using NLTK

问题 I'm using NLTK word_tokenizer to split a sentence into words. I want to tokenize this sentence: في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء The code I'm writing is: import re import nltk lex = u" في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء" wordsArray = nltk.word_tokenize(lex) print " ".join(wordsArray) The problem is that the word_tokenize function doesn't split by words. Instead, it splits by letters

What is more efficient a switch case or an std::map

阅读更多关于 What is more efficient a switch case or an std::map

问题 I'm thinking about the tokenizer here. Each token calls a different function inside the parser. What is more efficient: A map of std::functions/boost::functions A switch case 回答1: STL Map that comes with visual studio 2008 will give you O(log(n)) for each function call since it hides a tree structure beneath. With modern compiler (depending on implementation) , A switch statement will give you O(1) , the compiler translates it to some kind of lookup table. So in general , switch is faster.

Tokenize a string and include delimiters in C++

阅读更多关于 Tokenize a string and include delimiters in C++

I'm tokening with the following, but unsure how to include the delimiters with it. void Tokenize(const string str, vector<string>& tokens, const string& delimiters) { int startpos = 0; int pos = str.find_first_of(delimiters, startpos); string strTemp; while (string::npos != pos || string::npos != startpos) { strTemp = str.substr(startpos, pos - startpos); tokens.push_back(strTemp.substr(0, strTemp.length())); startpos = str.find_first_not_of(delimiters, pos); pos = str.find_first_of(delimiters, startpos); } } The C++ String Toolkit Library (StrTk) has the following solution: std::string str =

ElasticSearch Analyzer and Tokenizer for Emails

阅读更多关于 ElasticSearch Analyzer and Tokenizer for Emails

I could not find a perfect solution either in Google or ES for the following situation, hope someone could help here. Suppose there are five email addresses stored under field "email": 1. {"email": "john.doe@gmail.com"} 2. {"email": "john.doe@gmail.com, john.doe@outlook.com"} 3. {"email": "hello-john.doe@outlook.com"} 4. {"email": "john.doe@outlook.com} 5. {"email": "john@yahoo.com"} I want to fulfill the following searching scenarios: [Search -> Receive] "john.doe@gmail.com" -> 1,2 "john.doe@outlook.com" -> 2,4 "john@yahoo.com" -> 5 "john.doe" -> 1,2,3,4 "john" -> 1,2,3,4,5 "gmail.com" -> 1,2