tokenize | 易学教程

Tokenizing math expression with functions in C#

阅读更多关于 Tokenizing math expression with functions in C#

问题 I figured this would be easy to find, but I haven't been successful. I need to be able to tokenize the following expression (4 + 5) + myfunc('two words', 3, 5) into ( 4 + 5 + myfunc ( 'two words' , 3 , 5 ) It seems like this is probably a common need, however I haven't been able to find any good documentation on this out there. Is this something I could do using regex? Anybody know of an existing way to do this? I'm using C#, but if you have the answer in another language, don't be shy.

How to make the tokinezer detect empty spaces while using strtok()

阅读更多关于 How to make the tokinezer detect empty spaces while using strtok()

问题 I am designing a c++ program, somewhere in the program i need to detect if there is a blank(empty token) next to the token used know eg. if(token1==start) { token2=strtok(NULL," "); if(token2==NULL) {LCCTR=0;} else {LCCTR=atoi(token2);} so in the previous peice token1 is pointing to start , and i want to check if there is anumber next to the start , so I used token2=strtok(NULL," ") to point to the next token but unfortunattly the strtok function cannot detect empty spaces so it gives me an

Reading input from a file in python 3.x

阅读更多关于 Reading input from a file in python 3.x

问题 Say you are reading input from a file structured like so P3 400 200 255 255 255 255 255 0 0 255 0 0 etc... But you want to account for any mistakes that may come from the input file as in P3 400 200 255 255 255 255 255 0 0 255 0 0 etc... I want to read in the first token 'P3' then the next two '400' '200' (height/width) the '255' and from here on, I want to read every token in and account for how they should be in groups of 3. I have the correct code to read this information but I can't seem

String tokenization in java (LARGE text)

阅读更多关于 String tokenization in java (LARGE text)

问题 I have this large text (read LARGE). I need to tokenize every word, delimit on every non-letter. I used StringTokenizer to read one word at a time. However, as I was researching how to write the delimiter string ("every non-letter") instead of doing something like: new StringTokenizer(text, "\" ();,.'[]{}!?:”“…\n\r0123456789 [etc etc]"); I found that everyone basically hates StringTokenizer (why?). So, what can I use instead? Dont suggest String.split as it will duplicate my large text. I

Preventing tokens from containing a space in Stanford CoreNLP

阅读更多关于 Preventing tokens from containing a space in Stanford CoreNLP

问题 Is there an option in Stanford CoreNLP's tokenizer to prevent tokens from containing a space? E.g. if the sentence is "my phone is 617 1555-6644", the substring "617 1555" should be Into two different tokens. I am aware of the option normalizeSpace: normalizeSpace: Whether any spaces in tokens (phone numbers, fractions get turned into U+00A0 (non-breaking space). It's dangerous to turn this off for most of our Stanford NLP software, which assumes no spaces in tokens. but I don't want tokens

Why is this function not breaking up this input string?

阅读更多关于 Why is this function not breaking up this input string?

问题 I'm trying to break up a string into "symbols" with C++ for further work. I haven't written anything in C++ for a long while, so forgive me if there is something inherently wrong with this code. The purpose of the symbolize() function below is to break up a string, such as "5+5", into a vector of strings, eg {"5","+","5"} . It's not working. If you think the code is too messy, please suggest a way to simplify it. Here's my code so far: #include <iostream> #include <string> #include <vector>

Regex tokenize issue

阅读更多关于 Regex tokenize issue

问题 I have strings input by the user and want to tokenize them. For that, I want to use regex and now have a problem with a special case. An example string is Test + "Hello" + "Good\"more" + "Escape\"This\"Test" or the C# equivalent @"Test + ""Hello"" + ""Good\""more"" + ""Escape\""This\""Test""" I am able to match the Test and + tokens, but not the ones contained by the ". I use the " to let the user specify that this is literally a string and not a special token. Now if the user wants to use

Parse/tokenize objective-c with objective-c (iPhone)

阅读更多关于 Parse/tokenize objective-c with objective-c (iPhone)

问题 What are the options available of parsing and/or tokenizing Objective-C on iPhone? Essentially I'm thinking of parsing/tokenizing enough to power syntax highlighting and autocompletion at somewhat the same level as Xcode does. 回答1: I know the topic is old, but this might help someone else. Apple already provides the (very nice) CFStringTokenizer, with support for multiple languages. Here's a good presentation on that, including sample code. In case tokenization is enough, that should do it.

how to identify a end of a sentence

阅读更多关于 how to identify a end of a sentence

问题 String x=" i am going to the party at 6.00 in the evening. are you coming with me?"; if i have the above string, i need that to be broken to sentences by using sentence boundry punctuations(like . and ?) but it should not split the sentence at 6 because of having an pointer there. is there a way to identify what is the correct sentence boundry place in java? i have tried using stringTokenizer in java.util pakage but it always break the sentence whenever it finds a pointer. Can someone suggest

How to separate tokens in line using Unix? [duplicate]

阅读更多关于 How to separate tokens in line using Unix? [duplicate]

问题 This question already has answers here : How split a file in words in unix command line? (11 answers) Closed 5 years ago . How to separate tokens in line using Unix? [in]: some sentences are like this. some sentences foo bar that [out:] some sentences are like this. some sentences foo bar that I could have done this in python as below, but is there any unix way to achieve the same output? >>> import codecs >>> outfile = codecs.open('outfile.txt','w','utf8') >>> intext = "some sentences are