tokenize

Superpower: match a string with parser only if it begins a line

大兔子大兔子 提交于 2019-12-04 06:22:17
问题 When parsing in superpower, how to match a string only if it is the first thing in a line? For example, I need to match the A colon in "A: Hello Goodbye\n" but not in "Goodbye A: Hello\n" 回答1: Using your example here, I would change your ActorParser and NodeParser definitions to this: public readonly static TokenListParser<Tokens, Node> ActorParser = from name in NameParser from colon in Token.EqualTo(Tokens.Colon) from text in TextParser select new Node { Actor = name + colon.ToStringValue()

Generating PHP code (from Parser Tokens)

流过昼夜 提交于 2019-12-03 17:37:11
问题 Is there any available solution for (re-)generating PHP code from the Parser Tokens returned by token_get_all? Other solutions for generating PHP code are welcome as well, preferably with the associated lexer/parser (if any). 回答1: If I'm not mistaken http://pear.php.net/package/PHP_Beautifier uses token_get_all() and then rewrites the stream. It uses heaps of methods like t_else and t_close_brace to output each token. Maybe you can hijack this for simplicity. 回答2: From my comment: Does anyone

NLTK tokenize - faster way?

旧巷老猫 提交于 2019-12-03 17:13:09
问题 I have a method that takes in a String parameter, and uses NLTK to break the String down to sentences, then into words. Afterwards, it converts each word into lowercase, and finally creates a dictionary of the frequency of each word. import nltk from collections import Counter def freq(string): f = Counter() sentence_list = nltk.tokenize.sent_tokenize(string) for sentence in sentence_list: words = nltk.word_tokenize(sentence) words = [word.lower() for word in words] for word in words: f[word]

Boost::tokenizer comma separated (c++)

半腔热情 提交于 2019-12-03 14:19:34
Should be an easy one for you guys..... I'm playing around with tokenizers using Boost and I want create a token that is comma separated. here is my code: string s = "this is, , , a test"; boost::char_delimiters_separator<char> sep(","); boost::tokenizer<boost::char_delimiters_separator<char>>tok(s, sep); for(boost::tokenizer<>::iterator beg= tok.begin(); beg!=tok.end(); ++beg) { cout << *beg << "\n"; } The output that I want is: This is a test What I am getting is: This is , , , a test UPDATED CapelliC You must give the separator to tokenizer! boost::tokenizer<boost::char_delimiters_separator

Best way to parse Space Separated Text

谁说胖子不能爱 提交于 2019-12-03 13:13:37
I have string like this /c SomeText\MoreText "Some Text\More Text\Lol" SomeText I want to tokenize it, however I can't just split on the spaces. I've come up with somewhat ugly parser that works, but I'm wondering if anyone has a more elegant design. This is in C# btw. EDIT: My ugly version, while ugly, is O(N) and may actually be faster than using a RegEx. private string[] tokenize(string input) { string[] tokens = input.Split(' '); List<String> output = new List<String>(); for (int i = 0; i < tokens.Length; i++) { if (tokens[i].StartsWith("\"")) { string temp = tokens[i]; int k = 0; for (k =

Basic NLP in CoffeeScript or JavaScript — Punkt tokenizaton, simple trained Bayes models — where to start? [closed]

拜拜、爱过 提交于 2019-12-03 13:03:33
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 10 months ago . My current web-app project calls for a little NLP: Tokenizing text into sentences, via Punkt and similar; Breaking down the longer sentences by subordinate clause (often it’s on commas except when it’s not) A Bayesian model fit for chunking paragraphs with an even feel, no

A string tokenizer in C++ that allows multiple separators

落花浮王杯 提交于 2019-12-03 13:01:25
Is there a way to tokenize a string in C++ with multiple separators? In C# I would have done: string[] tokens = "adsl, dkks; dk".Split(new [] { ",", " ", ";" }, StringSplitOptions.RemoveEmpty); Use boost::tokenizer. It supports multiple separators. In fact, you don't really even need boost::tokenizer. If all you want is a split, use boost::split. The documentation has an example: http://www.boost.org/doc/libs/1_42_0/doc/html/string_algo/usage.html#id1718906 Something like that will do: void tokenize_string(const std::string &original_string, const std::string &delimiters, std::vector<std:

Is there way to boost original term more while using Solr synonyms?

烂漫一生 提交于 2019-12-03 12:16:48
问题 For example I have synonyms laptop,netbook,notebook in index_synonyms.txt When user search for netbook I want to boost original text more then expanded by synonyms? Is there way to specify this in SynonymFilterFactory? For example use original term twice so his TF will be bigger 回答1: As far as I know, there is no way to do this with the existing SynonymFilterFactory. But following is a trick you can use to get this behavior. Let's say your field is called title . Create another field which is

C - Determining which delimiter used - strtok()

放肆的年华 提交于 2019-12-03 12:03:53
Let's say I'm using strtok() like this.. char *token = strtok(input, ";-/"); Is there a way to figure out which token actually gets used? For instance, if the inputs was something like: Hello there; How are you? / I'm good - End Can I figure out which delimiter was used for each token? I need to be able to output a specific message, depending on the delimiter that followed the token. Important: strtok is not re-entrant, you should use strtok_r instead of it. You can do it by saving a copy of the original string, and looking into offsets of the current token into that copy: char str[] = "Hello

tokenize a string keeping delimiters in Python

泪湿孤枕 提交于 2019-12-03 11:21:53
问题 Is there any equivalent to str.split in Python that also returns the delimiters? I need to preserve the whitespace layout for my output after processing some of the tokens. Example: >>> s="\tthis is an example" >>> print s.split() ['this', 'is', 'an', 'example'] >>> print what_I_want(s) ['\t', 'this', ' ', 'is', ' ', 'an', ' ', 'example'] Thanks! 回答1: How about import re splitter = re.compile(r'(\s+|\S+)') splitter.findall(s) 回答2: >>> re.compile(r'(\s+)').split("\tthis is an example") ['', '