tokenize | 易学教程

How to match regex expression and get precedent words

阅读更多关于 How to match regex expression and get precedent words

问题 I use regex to match certain expressions within a text. assume I want to match a number, or numbers separated by commas -including or not spaces-, all within parenthesis in a text. (in reality the matches are more complex including spaces etc) I do the following: import re pattern =re.compile(r"(\()([0-9]+(,)?( )?)+(\))") matches = pattern.findall(content) matches is a list with the matches, for i,match in enumerate(matches): print(i,match) Example text: Lorem ipsum dolor sit amet (12,16) ,

Tokenize Thai sentence with ICUTokenizer JAVA

阅读更多关于 Tokenize Thai sentence with ICUTokenizer JAVA

问题 I am trying the below code to get all the tokens fro the thai sentence. It throws exception. Can anyone point me to tokenize thai in JAVA? import org.apache.lucene.analysis.Analyzer.TokenStreamComponents; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.icu.ICUNormalizer2Filter; import org.apache.lucene.analysis.icu.segmentation.ICUTokenizer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

ValueError: cannot reshape array of size 3800 into shape (1,200)

阅读更多关于 ValueError: cannot reshape array of size 3800 into shape (1,200)

问题 I am trying to apply word embedding on tweets. I was trying to create a vector for each tweet by taking the average of the vectors of the words present in the tweet as follow: def word_vector(tokens, size): vec = np.zeros(size).reshape((1, size)) count = 0. for word in tokens: try: vec += model_w2v[word].reshape((1, size)) count += 1. except KeyError: # handling the case where the token is not in vocabulary continue if count != 0: vec /= count return vec Next, when I try to Prepare word2vec

Flesch-Kincaid readability test in python

阅读更多关于 Flesch-Kincaid readability test in python

问题 I need help with this problem I'm having. I need to write a function that returns a FRES (Flesch reading-ease test) from a text. Given the formula: In other words my task is to turn this formula into a python function. this is the code from the previous question I had: import nltk import collections nltk.download('punkt') nltk.download('gutenberg') nltk.download('brown') nltk.download('averaged_perceptron_tagger') nltk.download('universal_tagset') import re VC = re.compile('[aeiou]+[^aeiou]+'

C tokenize polynomial coefficients

阅读更多关于 C tokenize polynomial coefficients

问题 I'm trying to put the coefficients of polynomials from a char array into an int array I have this: char string[] = "-4x^0 + x^1 + 4x^3 - 3x^4"; and can tokenize it by the space into -4x^0 x^1 4x^3 3x^4 So I am trying to get: -4, 1, 4, 3 into an int array int *coefficient; coefficient = new int[counter]; p = strtok(copy, " +"); int a; while (p) { int z = 0; while (p[z] != 'x') z++; char temp[z]; strncpy(temp[z], p, z); coefficient[a] = atoi(temp); p = strtok(NULL, " +"); a++; } However, Im

How to tokenize, scan or split this string of email addresses

阅读更多关于 How to tokenize, scan or split this string of email addresses

问题 For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid. Here is an example of a valid input: "name@domain.com,Sixpack, Joe 1 <name@domain.com>, Sixpack, Joe 2 <name@domain.com> ;Sixpack, Joe, 3<name@domain.com> , nameFoo@domain.com,nameBar@domain.com;nameBaz@domain.com;" So there are two basic

Drop a table originally created with 'unknown tokenizer'?

阅读更多关于 Drop a table originally created with 'unknown tokenizer'?

问题 I have a sqlite3 database. A single table inside this DB can't be dropped, the error message says unknown tokenizer: mm . I tried it directly with the command DROP TABLE tablename; inside the newest SQLiteSpy v1.9.11 and also within .NET code and the official sqlite NuGet package v 1.0.103. How can I drop a table where the tokenizer is unknown? 回答1: The documentation says: For each FTS virtual table in a database, three to five real (non-virtual) tables are created to store the underlying

How to stop . being treated as a separator in SQLite FTS4

阅读更多关于 How to stop . being treated as a separator in SQLite FTS4

问题 I want to be able to search for numbers like 2.3 using FTS4 in SQLite, but the . is being treated as a token boundary. Short of writing a full bespoke tokenizer is there any other way of excluding the . from the list of token boundary characters? Being able to search for decimal numbers seems like a common use case, but I can't find anything relevant on SO / Google. My best solution at present is to replace all . chars in the text with a known (long) string of letters and substitute

String tokenizer in c

阅读更多关于 String tokenizer in c

问题 the following code will break down the string command using space i.e " " and a full stop i.e. "." What if i want to break down command using the occurrence of both the space and full stop (at the same time) and not each by themselves e.g. a command like: 'hello .how are you' will be broken into the pieces (ignoring the quotes) [hello] [how are you today] char *token2 = strtok(command, " ."); 回答1: You can do it pretty easily with strstr : char *strstrtok(char *str, char *delim) { static char

Boost split not traversing inside of parenthesis or braces

阅读更多关于 Boost split not traversing inside of parenthesis or braces

问题 I try to split the following text: std::string text="1,2,3,max(4,5,6,7),array[8,9],10,page{11,12},13"; I have the following code: std::vector<std::string> found_list; boost::split(found_list,text,boost::is_any_of(",")) But my desired output is: 1 2 3 max(4,5,6,7) array[8,9] 10 page{11,12} 13 Regarding parentheses and braces, how to implement it? 回答1: You want to parse a grammar . Since you tagged with boost let me show you using Boost Spirit: Live On Coliru #include <boost/spirit/include/qi