tokenize

StreamTokenizer splits up 001_to_003 into two tokens; how can I prevent it from doing so?

守給你的承諾、 提交于 2020-01-05 04:06:30
问题 Java's StreamTokenizer seems to be too greedy in identifying numbers. It is relatively light on configuration options, and I haven't found a way to make it do what I want. The following test passes, IMO showing a bug in the implementation; what I'd really like is for the second token to be identified as a word "20001_to_30000". Any ideas? public void testBrokenTokenizer() throws Exception { final String query = "foo_bah 20001_to_30000"; StreamTokenizer tok = new StreamTokenizer(new

Superpower: match a string with tokenizer only if it begins a line

匆匆过客 提交于 2020-01-04 06:25:10
问题 When tokenizing in superpower, how to match a string only if it is the first thing in a line (note: this is a different question than this one) ? For example, assume I have a language with only the following 4 characters (' ', ':', 'X', 'Y'), each of which is a token. There is also a 'Header' token to capture cases of the following regex pattern /^[XY]+:/ (any number of Xs and Ys followed by a colon, only if they start the line). Here is a quick class for testing (the 4th test-case fails):

Rails plugin for generating unique links?

拈花ヽ惹草 提交于 2020-01-02 12:01:40
问题 There are many places in my application where I need to generate links with unique tokens (foo.com/g6Ce7sDygw or whatever). Each link may be associated with some session data and would take the user to some specific controller/action. Does anyone know of a gem/plugin that does this? It's easy enough to implement, but would be cleaner without having to write it from scratch for each app. 回答1: I needed the same think, you need and I implemented it by myself. I don't know about any plugin that

'IDENTIFIER' rule also consumes keyword in ANTLR Lexer grammar

扶醉桌前 提交于 2020-01-02 05:24:07
问题 While working on Antlr 3.5 grammar for Java parsing noticed that ' IDENTIFIER ' rule consumes few Keywords in ANTLR Lexer grammar. The Lexer grammar is lexer grammar JavaLexer; options { //k=8; language=Java; filter=true; //backtrack=true; } @lexer::header { package java; } @lexer::members { public ArrayList<String> keywordsList = new ArrayList<String>(); } V_DECLARATION : ( ((MODIFIERS)=>tok1=MODIFIERS WS+)? tok2=TYPE WS+ var=V_DECLARATOR WS* ) {...}; fragment V_DECLARATOR : ( tok=IDENTIFIER

what does regular in regex/“regular expression” mean?

删除回忆录丶 提交于 2020-01-02 01:25:27
问题 What does the "regular" in the phrase "regular expression" mean? I have heard that regexes were regular at one time, but no more 回答1: The regular in regular expression comes from that it matches a regular language. The concept of regular expressions used in formal language theory is quite different from what engines like PCRE call regular expressions. PCRE and other similar engines have features like lookahead, conditionals and recursion, which make them able to match non-regular languages.

How do I use NLTK's default tokenizer to get spans instead of strings?

北城以北 提交于 2020-01-02 00:56:09
问题 NLTK's default tokenizer, nltk.word_tokenizer, chains two tokenizers, a sentence tokenizer and then a word tokenizer that operates on sentences. It does a pretty good job out of the box. >>> nltk.word_tokenize("(Dr. Edwards is my friend.)") ['(', 'Dr.', 'Edwards', 'is', 'my', 'friend', '.', ')'] I'd like to use this same algorithm except to have it return tuples of offsets into the original string instead of string tokens. By offset I mean 2-ples that can serve as indexes into the original

Split delimited strings into distinct columns in R dataframe

拜拜、爱过 提交于 2020-01-01 19:25:14
问题 I need a fast and concise way to split string literals in a data framte into a set of columns. Let's say I have this data frame data <- data.frame(id=c(1,2,3), tok1=c("a, b, c", "a, a, d", "b, d, e"), tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta") ) (pls note the different delimiters among columns) The number of string columns is usually not known in advance (altough I can try to discover the whole set of cases if I've no alternatives) I need two data frames like those: tok1

How to create a Tokenizing Control for UWP as known from Outlook when using To, Cc and Bcc

此生再无相见时 提交于 2020-01-01 17:01:07
问题 There is a great article about how to write a Tokenizing Control for WPF here: Tokenizing control – convert text to tokens But how is this accomplished in an UWP App? The Windows 10 UWP Mail client does this just fine, so I know that it is possible. But how? Tokenizing is super useful for To/CC/BCC input areas, as we know it from Outlook and lately from the Windows 10 UWP Mail client. I suspect that RichTextBlock or maybe RichEditBox combined with AutoSuggestBox could be part of the answer,

Splitting chinese document into sentences [closed]

99封情书 提交于 2020-01-01 11:50:32
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . I have to split Chinese text into multiple sentences. I tried the Stanford DocumentPreProcessor. It worked quite well for English but not for Chinese. Please can you let me know any good sentence splitters for Chinese preferably in Java or Python. 回答1: Using some regex tricks in Python (c.f. a modified regex of

Tokenizing strings using regular expression in Javascript

时光怂恿深爱的人放手 提交于 2020-01-01 11:48:34
问题 Suppose I've a long string containing newlines and tabs as: var x = "This is a long string.\n\t This is another one on next line."; So how can we split this string into tokens, using regular expression? I don't want to use .split(' ') because I want to learn Javascript's Regex. A more complicated string could be this: var y = "This @is a #long $string. Alright, lets split this."; Now I want to extract only the valid words out of this string, without special characters, and punctuation, i.e I