tokenize

Is SQLite on Android built with the ICU tokenizer enabled for FTS?

吃可爱长大的小学妹 提交于 2019-12-01 04:01:08
Like the title says: can we use ...USING fts3(tokenizer icu th_TH, ...) . If we can, does anyone know what locales are suported, and whether it varies by platform version? No, only tokenizer=porter When I specify tokenizer=icu, I get "android.database.sqlite.SQLiteException: unknown tokenizer: icu" Also, this link hints that if Android didn't compile it in by default, it will not be available http://sqlite.phxsoftware.com/forums/t/2349.aspx Gordon Liang For API Level 21 or up, I tested and found that ICU tokenizer is already available. However to support 90%+ devices, some work-around can be

Elasticsearch “pattern_replace”, replacing whitespaces while analyzing

岁酱吖の 提交于 2019-12-01 00:11:58
Basically I want to remove all whitespaces and tokenize the whole string as a single token. (I will use nGram on top of that later on.) This is my index settings: "settings": { "index": { "analysis": { "filter": { "whitespace_remove": { "type": "pattern_replace", "pattern": " ", "replacement": "" } }, "analyzer": { "meliuz_analyzer": { "filter": [ "lowercase", "whitespace_remove" ], "type": "custom", "tokenizer": "standard" } } } } Instead of "pattern": " " , I tried "pattern": "\\u0020" and \\s , too. But when I analyze the text "beleza na web", it still creates three separate tokens: "beleza

NLTK regexp tokenizer not playing nice with decimal point in regex

泄露秘密 提交于 2019-11-30 23:08:56
I'm trying to write a text normalizer, and one of the basic cases that needs to be handled is turning something like 3.14 to three point one four or three point fourteen . I'm currently using the pattern \$?\d+(\.\d+)?%? with nltk.regexp_tokenize , which I believe should handle numbers as well as currency and percentages. However, at the moment, something like $23.50 is handled perfectly (it parses to ['$23.50'] ), but 3.14 is parsing to ['3', '14'] - the decimal point is being dropped. I've tried adding a pattern separate \d+.\d+ to my regexp, but that didn't help (and shouldn't my current

Elasticsearch “pattern_replace”, replacing whitespaces while analyzing

左心房为你撑大大i 提交于 2019-11-30 18:24:52
问题 Basically I want to remove all whitespaces and tokenize the whole string as a single token. (I will use nGram on top of that later on.) This is my index settings: "settings": { "index": { "analysis": { "filter": { "whitespace_remove": { "type": "pattern_replace", "pattern": " ", "replacement": "" } }, "analyzer": { "meliuz_analyzer": { "filter": [ "lowercase", "whitespace_remove" ], "type": "custom", "tokenizer": "standard" } } } } Instead of "pattern": " " , I tried "pattern": "\\u0020" and

Java Lucene NGramTokenizer

◇◆丶佛笑我妖孽 提交于 2019-11-30 17:39:59
I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokenized. In fact I only see two methods in the NGramTokenizer class that return String Objects. Here is the code that I have: Reader reader = new StringReader("This is a test string"); NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3); Where are the ngrams that were tokenized? How can I get the output in Strings/Words? I want my output to be like: This, is, a, test, string, This is, is a, a test, test string, This

Tokenizing an infix string in Java

醉酒当歌 提交于 2019-11-30 17:20:23
问题 I'm implementing the Shunting Yard Algorithm in Java, as a side project to my AP Computer Science class. I've implemented a simple one in Javascript, with only basic arithmetic expressions (addition, subtraction, multiplication, division, exponentiation). To split that into an array, what I did was find each of the operators ( +-*/^ ), as well as numbers and parentheses, and I put a space around them, and then I split it into an array. For example, the infix string 4+(3+2) would be made into

Can not use ICUTokenizerFactory in Solr

自闭症网瘾萝莉.ら 提交于 2019-11-30 15:42:50
I am trying to use ICUTokenizerFactory in Solr schema. This is how I have defined field and fieldType . <fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory"/> </analyzer> </fieldType> <field name="fld_icu" type="text_icu" indexed="true" stored="true"/> And, when I start Solr, I am get this error Plugin init failure for [schema.xml] fieldType "text_icu": Plugin init failure for [schema.xml] analyzer/tokenizer: Error loading class 'solr.ICUTokenizerFactory' I have searched in for that with no success. I don't know if

Using boost::tokenizer with string delimiters

心不动则不痛 提交于 2019-11-30 13:56:19
问题 I've been looking boost::tokenizer, and I've found that the documentation is very thin. Is it possible to make it tokenize a string such as "dolphin--monkey--baboon" and make every word a token, as well as every double dash a token? From the examples I've only seen single character delimiters being allowed. Is the library not advanced enough for more complicated delimiters? 回答1: It looks like you will need to write your own TokenizerFunction to do what you want. 回答2: using iter_split allows

Why is n+++n valid while n++++n is not?

淺唱寂寞╮ 提交于 2019-11-30 11:41:52
In Java, the expression: n+++n Appears to evaluate as equivalent to: n++ + n Despite the fact that +n is a valid unary operator with higher precedence than the arithmetic + operator in n + n . So the compiler appears to be assuming that the operator cannot be the unary operator and resolving the expression. However, the expression: n++++n Does not compile, even though there is a single valid possibility for it to be resolved as: n++ + +n ++n and +n are specified as having the same precedence, so why does the compiler resolve the seeming ambiguity in n+++n in favour of the arithmetic + but does

Java StringTokenizer.nextToken() skips over empty fields

戏子无情 提交于 2019-11-30 11:25:19
I am using a tab (/t) as delimiter and I know there are some empty fields in my data e.g.: one->two->->three Where -> equals the tab. As you can see an empty field is still correctly surrounded by tabs. Data is collected using a loop : while ((strLine = br.readLine()) != null) { StringTokenizer st = new StringTokenizer(strLine, "\t"); String test = st.nextToken(); ... } Yet Java ignores this "empty string" and skips the field. Is there a way to circumvent this behaviour and force java to read in empty fields anyway? There is a RFE in the Sun's bug database about this StringTokenizer issue with