tokenize

How to match regex expression and get precedent words

淺唱寂寞╮ 提交于 2019-12-11 07:53:58
问题 I use regex to match certain expressions within a text. assume I want to match a number, or numbers separated by commas -including or not spaces-, all within parenthesis in a text. (in reality the matches are more complex including spaces etc) I do the following: import re pattern =re.compile(r"(\()([0-9]+(,)?( )?)+(\))") matches = pattern.findall(content) matches is a list with the matches, for i,match in enumerate(matches): print(i,match) Example text: Lorem ipsum dolor sit amet (12,16) ,

Tokenize Thai sentence with ICUTokenizer JAVA

徘徊边缘 提交于 2019-12-11 07:18:02
问题 I am trying the below code to get all the tokens fro the thai sentence. It throws exception. Can anyone point me to tokenize thai in JAVA? import org.apache.lucene.analysis.Analyzer.TokenStreamComponents; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.icu.ICUNormalizer2Filter; import org.apache.lucene.analysis.icu.segmentation.ICUTokenizer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

ValueError: cannot reshape array of size 3800 into shape (1,200)

狂风中的少年 提交于 2019-12-11 06:47:41
问题 I am trying to apply word embedding on tweets. I was trying to create a vector for each tweet by taking the average of the vectors of the words present in the tweet as follow: def word_vector(tokens, size): vec = np.zeros(size).reshape((1, size)) count = 0. for word in tokens: try: vec += model_w2v[word].reshape((1, size)) count += 1. except KeyError: # handling the case where the token is not in vocabulary continue if count != 0: vec /= count return vec Next, when I try to Prepare word2vec

Flesch-Kincaid readability test in python

£可爱£侵袭症+ 提交于 2019-12-11 06:00:01
问题 I need help with this problem I'm having. I need to write a function that returns a FRES (Flesch reading-ease test) from a text. Given the formula: In other words my task is to turn this formula into a python function. this is the code from the previous question I had: import nltk import collections nltk.download('punkt') nltk.download('gutenberg') nltk.download('brown') nltk.download('averaged_perceptron_tagger') nltk.download('universal_tagset') import re VC = re.compile('[aeiou]+[^aeiou]+'

C tokenize polynomial coefficients

我与影子孤独终老i 提交于 2019-12-11 05:47:23
问题 I'm trying to put the coefficients of polynomials from a char array into an int array I have this: char string[] = "-4x^0 + x^1 + 4x^3 - 3x^4"; and can tokenize it by the space into -4x^0 x^1 4x^3 3x^4 So I am trying to get: -4, 1, 4, 3 into an int array int *coefficient; coefficient = new int[counter]; p = strtok(copy, " +"); int a; while (p) { int z = 0; while (p[z] != 'x') z++; char temp[z]; strncpy(temp[z], p, z); coefficient[a] = atoi(temp); p = strtok(NULL, " +"); a++; } However, Im

How to tokenize, scan or split this string of email addresses

随声附和 提交于 2019-12-11 04:42:44
问题 For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid. Here is an example of a valid input: "name@domain.com,Sixpack, Joe 1 <name@domain.com>, Sixpack, Joe 2 <name@domain.com> ;Sixpack, Joe, 3<name@domain.com> , nameFoo@domain.com,nameBar@domain.com;nameBaz@domain.com;" So there are two basic

Drop a table originally created with 'unknown tokenizer'?

馋奶兔 提交于 2019-12-11 04:37:59
问题 I have a sqlite3 database. A single table inside this DB can't be dropped, the error message says unknown tokenizer: mm . I tried it directly with the command DROP TABLE tablename; inside the newest SQLiteSpy v1.9.11 and also within .NET code and the official sqlite NuGet package v 1.0.103. How can I drop a table where the tokenizer is unknown? 回答1: The documentation says: For each FTS virtual table in a database, three to five real (non-virtual) tables are created to store the underlying

How to stop . being treated as a separator in SQLite FTS4

戏子无情 提交于 2019-12-11 03:36:24
问题 I want to be able to search for numbers like 2.3 using FTS4 in SQLite, but the . is being treated as a token boundary. Short of writing a full bespoke tokenizer is there any other way of excluding the . from the list of token boundary characters? Being able to search for decimal numbers seems like a common use case, but I can't find anything relevant on SO / Google. My best solution at present is to replace all . chars in the text with a known (long) string of letters and substitute

String tokenizer in c

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-11 03:13:06
问题 the following code will break down the string command using space i.e " " and a full stop i.e. "." What if i want to break down command using the occurrence of both the space and full stop (at the same time) and not each by themselves e.g. a command like: 'hello .how are you' will be broken into the pieces (ignoring the quotes) [hello] [how are you today] char *token2 = strtok(command, " ."); 回答1: You can do it pretty easily with strstr : char *strstrtok(char *str, char *delim) { static char

Boost split not traversing inside of parenthesis or braces

拟墨画扇 提交于 2019-12-11 03:05:39
问题 I try to split the following text: std::string text="1,2,3,max(4,5,6,7),array[8,9],10,page{11,12},13"; I have the following code: std::vector<std::string> found_list; boost::split(found_list,text,boost::is_any_of(",")) But my desired output is: 1 2 3 max(4,5,6,7) array[8,9] 10 page{11,12} 13 Regarding parentheses and braces, how to implement it? 回答1: You want to parse a grammar . Since you tagged with boost let me show you using Boost Spirit: Live On Coliru #include <boost/spirit/include/qi