tokenize

Tokenizer vs token filters

时光毁灭记忆、已成空白 提交于 2019-11-28 17:39:33
问题 I'm trying to implement autocomplete using Elasticsearch thinking that I understand how to do it... I'm trying to build multi-word (phrase) suggestions by using ES's edge_n_grams while indexing crawled data. What is the difference between a tokenizer and a token_filter - I've read the docs on these but still need more understanding on them.... For instance is a token_filter what ES uses to search against user input? Is a tokenizer what ES uses to make tokens? What is a token? Is it possible

Tokenizer, Stop Word Removal, Stemming in Java

时间秒杀一切 提交于 2019-11-28 17:05:20
I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system. For example: "The big fat cat, said 'your funniest guy i know' to the kangaroo..." the tokenizer would remove the punctuation and return an ArrayList of words the stop word remover would remove words like "the", "to", etc the stemmer would reduce each word the their 'root', for example 'funniest' would become funny Many thanks in advance. jitter AFAIK Lucene can do what you want. With StandardAnalyzer and StopAnalyzer you can to the stop

How does a parser (for example, HTML) work?

倾然丶 夕夏残阳落幕 提交于 2019-11-28 16:21:55
For argument's sake lets assume a HTML parser. I've read that it tokenizes everything first, and then parses it. What does tokenize mean? Does the parser read every character each, building up a multi dimensional array to store the structure? For example, does it read a < and then begin to capture the element, and then once it meets a closing > (outside of an attribute) it is pushed onto a array stack somewhere? I'm interested for the sake of knowing (I'm curious). If I were to read through the source of something like HTML Purifier , would that give me a good idea of how HTML is parsed? First

Parsing pipe delimited string into columns?

℡╲_俬逩灬. 提交于 2019-11-28 12:26:38
I have a column with pipe separated values such as: '23|12.1| 450|30|9|78|82.5|92.1|120|185|52|11' I want to parse this column to fill a table with 12 corresponding columns: month1, month2, month3...month12. So month1 will have the value 23, month2 the value 12.1 etc... Is there a way to parse it by a loop or delimeter instead of having to separate one value at a time using substr? Thanks. You can use regexp_substr (10g+): SQL> SELECT regexp_substr('23|12.1| 450|30|9|', '[^|]+', 1, 1) c1, 2 regexp_substr('23|12.1| 450|30|9|', '[^|]+', 1, 2) c2, 3 regexp_substr('23|12.1| 450|30|9|', '[^|]+', 1,

Document-term matrix in R - bigram tokenizer not working

混江龙づ霸主 提交于 2019-11-28 08:21:30
问题 I am trying to make 2 document-term matrices for a corpus, one with unigrams and one with bigrams. However, the bigram matrix is currently just identical to the unigram matrix, and I'm not sure why. The code: docs<-Corpus(DirSource("data", recursive=TRUE)) # Get the document term matrices BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) dtm_unigram <- DocumentTermMatrix(docs, control = list(tokenize="words", removePunctuation = TRUE, stopwords = stopwords(

What are all the Japanese whitespace characters?

帅比萌擦擦* 提交于 2019-11-28 07:13:29
问题 I need to split a string and extract words separated by whitespace characters.The source may be in English or Japanese. English whitespace characters include tab and space, and Japanese text uses these too. (IIRC, all widely-used Japanese character sets are supersets of US-ASCII.) So the set of characters I need to use to split my string includes normal ASCII space and tab. But, in Japanese, there is another space character, commonly called a 'full-width space'. According to my Mac's

Difference between StandardTokenizerFactory and KeywordTokenizerFactory in Solr?

十年热恋 提交于 2019-11-28 07:02:56
问题 I am new to Solr.I want to know when to use StandardTokenizerFactory and KeywordTokenizerFactory ? I read the docs on Apache Wiki, but I am not getting it. Can anybody explain the difference between StandardTokenizerFactory and KeywordTokenizerFactory ? 回答1: StandardTokenizerFactory :- It tokenizes on whitespace, as well as strips characters Documentation :- Splits words at punctuation characters, removing punctuations. However, a dot that's not followed by whitespace is considered part of a

How to handle multi-line comments in a live syntax highlighter?

与世无争的帅哥 提交于 2019-11-28 05:58:35
问题 I'm writing my own text editor with syntax highlighting in Java, and at the moment it simply parses and highlights the current line every time the user enters a single character. While presumably not the most efficient way, it's good enough and doesn't cause any noticeable performance issues. In pseudo-Java, this would be the core concept of my code: public void textUpdated(String wholeText, int updateOffset, int updateLength) { int lineStart = getFirstLineStart(wholeText, updateOffset); int

Split string with PowerShell and do something with each token

六眼飞鱼酱① 提交于 2019-11-28 05:38:53
I want to split each line of a pipe on spaces, and then print each token on its own line. I realise that I can get this result using: (cat someFileInsteadOfAPipe).split(" ") But I want more flexibility. I want to be able to do just about anything with each token. (I used to use AWK on Unix, and I'm trying to get the same functionality.) I currently have: echo "Once upon a time there were three little pigs" | %{$data = $_.split(" "); Write-Output "$($data[0]) and whatever I want to output with it"} Which, obviously, only prints the first token. Is there a way for me to for-each over the tokens,

Why does SSIS TOKEN function fail to count adjacent column delimiters?

前提是你 提交于 2019-11-28 05:01:36
问题 I ran into a problem with SQL Server Integration Services 2012's new string function in the Expression Editor called TOKEN(). This is supposed to help you parse a delimited record. If the record comes out of a flat file, you can do this with the Flat File Source. In this case, I am dealing with old delimited import records that were stored as strings in a database VARCHAR field. Now they need to be extracted, massaged, and re-exported as delimited strings. For example: 1^Apple^0001^01/01/2010