tokenize | 易学教程

Defined C token file for flex?

阅读更多关于 Defined C token file for flex?

问题 I want to split a C file into tokens, not for compiling but for analyzing. I feel like this should be pretty straight-forward, and tried looking online for a defined tokens.l (or something similar) file for flex with all the C grammar already defined, but couldn't find anything. I was wondering if there are any sort of defined grammars floating around, or if perhaps I'm going about this all wrong? 回答1: Yes, there's at least one around. Edit: Since there are a few issues that doesn't handle,

RegEx Tokenizer to split a text into words, digits and punctuation marks

阅读更多关于 RegEx Tokenizer to split a text into words, digits and punctuation marks

问题 What I want to do is to split a text into his ultimate elements. For example: from nltk.tokenize import * txt = "A sample sentences with digits like 2.119,99 or 2,99 are awesome." regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\S+') ['A','sample','sentences','with','digits','like','2.199,99','or','2,99','are','awesome','.'] You can see it works fine. My Problem is: What happens if the digit is at the end of a text? txt = "Today it's 07.May 2011. Or 2.999." regexp_tokenize(txt, pattern='(?:(?!\d)

RegEx Tokenizer to split a text into words, digits and punctuation marks

阅读更多关于 RegEx Tokenizer to split a text into words, digits and punctuation marks

What I want to do is to split a text into his ultimate elements. For example: from nltk.tokenize import * txt = "A sample sentences with digits like 2.119,99 or 2,99 are awesome." regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\S+') ['A','sample','sentences','with','digits','like','2.199,99','or','2,99','are','awesome','.'] You can see it works fine. My Problem is: What happens if the digit is at the end of a text? txt = "Today it's 07.May 2011. Or 2.999." regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\S+') ['Today', 'it', "'s", '07.May', '2011.', 'Or', '2.999.'] The result should be: ['Today', 'it

Elasticsearch custom analyzer with ngram and without word delimiter on hyphens

阅读更多关于 Elasticsearch custom analyzer with ngram and without word delimiter on hyphens

I am trying to index strings that contain hyphens but do not contain spaces, periods or any other punctuation. I do not want to split up the words based on hyphens, instead I would like to have the hyphens be part of the indexed text. For example, my 6 text strings would be: magazineplayon magazineofhorses online-magazine best-magazine friend-of-magazines magazineplaygames I would like to be able to search these string for the text containing "play" or for the text starting with "magazine" . I have been able to use ngram to make the text containing "play" work properly. However, the hyphen is

Defined C token file for flex?

阅读更多关于 Defined C token file for flex?

I want to split a C file into tokens, not for compiling but for analyzing. I feel like this should be pretty straight-forward, and tried looking online for a defined tokens.l (or something similar) file for flex with all the C grammar already defined, but couldn't find anything. I was wondering if there are any sort of defined grammars floating around, or if perhaps I'm going about this all wrong? Yes, there's at least one around. Edit: Since there are a few issues that doesn't handle, perhaps it's worth looking at some (hand written) lexing code I wrote several years ago. This basically only

Split string by a substring

阅读更多关于 Split string by a substring

问题 I have following string: char str[] = "A/USING=B)"; I want to split to get separate A and B values with /USING= as a delimiter How can I do it? I known strtok() but it just split by one character as delimiter. 回答1: I known strtok() but it just split by one character as delimiter Nopes, it's not. As per the man page for strtok)() , ( emphasis mine ) char *strtok(char *str, const char *delim); [...] The delim argument specifies a set of bytes that delimit the tokens in the parsed string. [...]

split char string with multi-character delimiter in C

阅读更多关于 split char string with multi-character delimiter in C

问题 I want to split a char *string based on multiple-character delimiter. I know that strtok() is used to split a string but it works with single character delimiter. I want to split char *string based on a substring such as "abc" or any other sub-string. How that can be achieved? 回答1: Finding the point at which the desired sequence occurs is pretty easy: strstr supports that: char str[] = "this is abc a big abc input string abc to split up"; char *pos = strstr(str, "abc"); So, at that point, pos

C++/Boost split a string on more than one character

阅读更多关于 C++/Boost split a string on more than one character

问题 This is probably really simple once I see an example, but how do I generalize boost::tokenizer or boost::split to deal with separators consisting of more than one character? For example, with " _ _", neither of these standard splitting solutions seems to work : boost::tokenizer<boost::escaped_list_separator<string> > tk(myString, boost::escaped_list_separator<string>("", "____", "\"")); std::vector<string> result; for (string tmpString : tk) { result.push_back(tmpString); } or boost::split

Boost::tokenizer point separated, but also keeping empty fields

阅读更多关于 Boost::tokenizer point separated, but also keeping empty fields

问题 I have seen this question and mine is very similar to it, but it is different, so please do not mark it as duplicate. My question is: How do I get the empty fields from a string? I have a string like std::string s = "This.is..a.test"; and I want to get the fields <This> <is> <> <a> <test> . I have tried also typedef boost::char_separator<char> ChSep; typedef boost::tokenizer<ChSep> TknChSep; ChSep sep(".", ".", boost::keep_empty_tokens); TknChSep tok(s, sep); for (TknChSep::iterator beg = tok

C++/Boost split a string on more than one character

阅读更多关于 C++/Boost split a string on more than one character

This is probably really simple once I see an example, but how do I generalize boost::tokenizer or boost::split to deal with separators consisting of more than one character? For example, with " _ _", neither of these standard splitting solutions seems to work : boost::tokenizer<boost::escaped_list_separator<string> > tk(myString, boost::escaped_list_separator<string>("", "____", "\"")); std::vector<string> result; for (string tmpString : tk) { result.push_back(tmpString); } or boost::split(result, myString, "___"); boost::algorithm::split_regex( result, myString, regex( "___" ) ) ; you have to