tokenize

Tokenizer vs token filters

£可爱£侵袭症+ 提交于 2019-11-29 21:49:45
I'm trying to implement autocomplete using Elasticsearch thinking that I understand how to do it... I'm trying to build multi-word (phrase) suggestions by using ES's edge_n_grams while indexing crawled data. What is the difference between a tokenizer and a token_filter - I've read the docs on these but still need more understanding on them.... For instance is a token_filter what ES uses to search against user input? Is a tokenizer what ES uses to make tokens? What is a token? Is it possible for ES to create multi-word suggestions using any of these things? A tokenizer will split the whole

Why is n+++n valid while n++++n is not?

丶灬走出姿态 提交于 2019-11-29 17:38:48
问题 In Java, the expression: n+++n Appears to evaluate as equivalent to: n++ + n Despite the fact that +n is a valid unary operator with higher precedence than the arithmetic + operator in n + n . So the compiler appears to be assuming that the operator cannot be the unary operator and resolving the expression. However, the expression: n++++n Does not compile, even though there is a single valid possibility for it to be resolved as: n++ + +n ++n and +n are specified as having the same precedence,

Java StringTokenizer.nextToken() skips over empty fields

↘锁芯ラ 提交于 2019-11-29 17:32:37
问题 I am using a tab (/t) as delimiter and I know there are some empty fields in my data e.g.: one->two->->three Where -> equals the tab. As you can see an empty field is still correctly surrounded by tabs. Data is collected using a loop : while ((strLine = br.readLine()) != null) { StringTokenizer st = new StringTokenizer(strLine, "\t"); String test = st.nextToken(); ... } Yet Java ignores this "empty string" and skips the field. Is there a way to circumvent this behaviour and force java to read

How to prevent splitting specific words or phrases and numbers in NLTK?

99封情书 提交于 2019-11-29 16:15:06
I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like "run in my family" ,"30 minute walk" or "4x a day" from splitting at the time of tokenizing words in NLTK? They should not result in: ['runs','in','my','family','4x','a','day'] For example: Yes 20-30 minutes a day on my bike, it works great!! gives: ['yes','20-30','minutes','a','day','on','my','bike',',','it','works','great'] I want '20-30 minutes' to be treated as a single word. How can I get this behavior>? You will be hard pressed to preserve n-grams of

Difference between StandardTokenizerFactory and KeywordTokenizerFactory in Solr?

跟風遠走 提交于 2019-11-29 13:29:43
I am new to Solr.I want to know when to use StandardTokenizerFactory and KeywordTokenizerFactory ? I read the docs on Apache Wiki, but I am not getting it. Can anybody explain the difference between StandardTokenizerFactory and KeywordTokenizerFactory ? Jayendra StandardTokenizerFactory :- It tokenizes on whitespace, as well as strips characters Documentation :- Splits words at punctuation characters, removing punctuations. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token. In that case, the whole token

How do I tokenize this string in Ruby?

混江龙づ霸主 提交于 2019-11-29 12:21:19
问题 I have this string: %{Children^10 Health "sanitation management"^5} And I want to convert it to tokenize this into an array of hashes: [{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}] I'm aware of StringScanner and the Syntax gem but I can't find enough code examples for both. Any pointers? 回答1: For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example

Why does SSIS TOKEN function fail to count adjacent column delimiters?

醉酒当歌 提交于 2019-11-29 11:32:14
I ran into a problem with SQL Server Integration Services 2012's new string function in the Expression Editor called TOKEN(). This is supposed to help you parse a delimited record. If the record comes out of a flat file, you can do this with the Flat File Source. In this case, I am dealing with old delimited import records that were stored as strings in a database VARCHAR field. Now they need to be extracted, massaged, and re-exported as delimited strings. For example: 1^Apple^0001^01/01/2010^Anteater^A1 2^Banana^0002^03/15/2010^Bear^B2 3^Cranberry^0003^4/15/2010^Crow^C3 If these strings are

Does PL/SQL have an equivalent StringTokenizer to Java's?

冷暖自知 提交于 2019-11-29 11:24:34
I use java.util.StringTokenizer for simple parsing of delimited strings in java. I have a need for the same type of mechanism in pl/sql. I could write it, but if it already exists, I would prefer to use that. Anyone know of a pl/sql implementation? Some useful alternative? PL/SQL does include a basic one for comma separated lists ( DBMS_UTILITY.COMMA_TO_TABLE ). Example: DECLARE lv_tab_length BINARY_INTEGER; lt_array DBMS_UTILITY.lname_array; BEGIN DBMS_UTILITY.COMMA_TO_TABLE( list => 'one,two,three,four' , tablen => lv_tab_length , tab => lt_array ); DBMS_OUTPUT.PUT_LINE( 'lv_tab_length = ['|

Tokenizing unicode using nltk

冷暖自知 提交于 2019-11-29 11:20:00
问题 I have textfiles that use utf-8 encoding that contain characters like 'ö', 'ü', etc. I would like to parse the text form these files, but I can't get the tokenizer to work properly. If I use standard nltk tokenizer: f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk' text = f.read() f.close items = text.decode('utf8') a = nltk.word_tokenize(items) Output: [u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6', u'k', u'r', u'\xe4', u'\xe4', u'k'] Punkt tokenizer seems to do

Pythonic way to implement a tokenizer

僤鯓⒐⒋嵵緔 提交于 2019-11-29 07:37:21
问题 I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice? I've implemented a tokenizer before in C and in Java so I'm fine with the theory, I'd just like to ensure I'm following pythonic styles and best practices. Listing Token Types: In Java, for example, I would have a list of fields like so: public static final int TOKEN_INTEGER = 0 But, obviously, there's no way (I think) to declare a constant variable in Python, so I could just replace this