tokenize | 易学教程

what does regular in regex/“regular expression” mean?

阅读更多关于 what does regular in regex/“regular expression” mean?

What does the "regular" in the phrase "regular expression" mean? I have heard that regexes were regular at one time, but no more The regular in regular expression comes from that it matches a regular language . The concept of regular expressions used in formal language theory is quite different from what engines like PCRE call regular expressions. PCRE and other similar engines have features like lookahead , conditionals and recursion , which make them able to match non-regular languages. It comes from regular language. This is part of formal language theory. Check out the Chomsky hierarchy

bash parse filename

阅读更多关于 bash parse filename

问题 Is there any way in bash to parse this filename : $file = dos1-20120514104538.csv.3310686 into variables like $date = 2012-05-14 10:45:38 and $id = 3310686 ? Thank you 回答1: All of this can be done with Parameter Expansion. Please read about it in the bash manpage. $ file='dos1-20120514104538.csv.3310686' $ date="${file#*-}" # Use Parameter Expansion to strip off the part before '-' $ date="${date%%.*}" # Use PE again to strip after the first '.' $ id="${file##*.}" # Use PE to get the id as

Difference between WhitespaceTokenizerFactory and StandardTokenizerFactory

阅读更多关于 Difference between WhitespaceTokenizerFactory and StandardTokenizerFactory

I am new to Solr. By reading Solr's wiki, I don't understand the differences between WhitespaceTokenizerFactory and StandardTokenizerFactory. What's their real difference? They differ in how they split the analyzed text into tokens. The StandardTokenizer does this based on the following (taken from lucene javadoc): Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not

Tokenizing large (>70MB) TXT file using Python NLTK. Concatenation & write data to stream errors

阅读更多关于 Tokenizing large (>70MB) TXT file using Python NLTK. Concatenation & write data to stream errors

问题 First of all, I am new to python/nltk so my apologies if the question is too basic. I have a large file that I am trying to tokenize; I get memory errors. One solution I've read about is to read the file one line at a time, which makes sense, however, when doing that, I get the error cannot concatenate 'str' and 'list' objects . I am not sure why that error is displayed since (after reading the file, I check its type and it is in fact a string. I have tried to split the 7MB files into 4

How to index a postgres table by name, when the name can be in any language?

阅读更多关于 How to index a postgres table by name, when the name can be in any language?

问题 I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe): lower(location_name) LIKE '%cafe%' as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on gin(to_tsvector('simple', location_name)) and searching with (to_tsvector('simple',location

C - Determining which delimiter used - strtok()

阅读更多关于 C - Determining which delimiter used - strtok()

问题 Let's say I'm using strtok() like this.. char *token = strtok(input, ";-/"); Is there a way to figure out which token actually gets used? For instance, if the inputs was something like: Hello there; How are you? / I'm good - End Can I figure out which delimiter was used for each token? I need to be able to output a specific message, depending on the delimiter that followed the token. 回答1: Important: strtok is not re-entrant, you should use strtok_r instead of it. You can do it by saving a

Split delimited strings into distinct columns in R dataframe

阅读更多关于 Split delimited strings into distinct columns in R dataframe

I need a fast and concise way to split string literals in a data framte into a set of columns. Let's say I have this data frame data <- data.frame(id=c(1,2,3), tok1=c("a, b, c", "a, a, d", "b, d, e"), tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta") ) (pls note the different delimiters among columns) The number of string columns is usually not known in advance (altough I can try to discover the whole set of cases if I've no alternatives) I need two data frames like those: tok1.occurrences: +----+---+---+---+---+---+ | id | a | b | c | d | e | +----+---+---+---+---+---+ | 1 | 1 | 1 |

nltk sentence tokenizer, consider new lines as sentence boundary

阅读更多关于 nltk sentence tokenizer, consider new lines as sentence boundary

问题 I am using nltk's PunkSentenceTokenizer to tokenize a text to a set of sentences. However, the tokenizer doesn't seem to consider new paragraph or new lines as a new sentence. >>> from nltk.tokenize.punkt import PunktSentenceTokenizer >>> tokenizer = PunktSentenceTokenizer() >>> tokenizer.tokenize('Sentence 1 \n Sentence 2. Sentence 3.') ['Sentence 1 \n Sentence 2.', 'Sentence 3.'] >>> tokenizer.span_tokenize('Sentence 1 \n Sentence 2. Sentence 3.') [(0, 24), (25, 36)] I would like it to to

Elasticsearch custom analyzer for hyphens, underscores, and numbers

阅读更多关于 Elasticsearch custom analyzer for hyphens, underscores, and numbers

Admittedly, I'm not that well versed on the analysis part of ES. Here's the index layout: { "mappings": { "event": { "properties": { "ipaddress": { "type": "string" }, "hostname": { "type": "string", "analyzer": "my_analyzer", "fields": { "raw": { "type": "string", "index": "not_analyzed" } } } } } }, "settings": { "analysis": { "filter": { "my_filter": { "type": "word_delimiter", "preserve_original": true } }, "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": ["lowercase", "my_filter"] } } } } } You can see that I've attempted to use a custom analyzer for

How to create a Tokenizing Control for UWP as known from Outlook when using To, Cc and Bcc

阅读更多关于 How to create a Tokenizing Control for UWP as known from Outlook when using To, Cc and Bcc

There is a great article about how to write a Tokenizing Control for WPF here: Tokenizing control – convert text to tokens But how is this accomplished in an UWP App? The Windows 10 UWP Mail client does this just fine, so I know that it is possible. But how? Tokenizing is super useful for To/CC/BCC input areas, as we know it from Outlook and lately from the Windows 10 UWP Mail client. I suspect that RichTextBlock or maybe RichEditBox combined with AutoSuggestBox could be part of the answer, but in the WPF example above FlowDocument is used and FlowDocumet is not supported in UWP. I haven't