tokenize

Elasticsearch aggregate on URL hostname

让人想犯罪 __ 提交于 2019-12-06 05:48:56
问题 I am indexing documents with a field containing a url: [ 'myUrlField' => 'http://google.com/foo/bar' ] Now what I´d like to get out of elasticsearch is an aggregation on the url field. curl -XGET 'http://localhost:9200/myIndex/_search?pretty' -d '{ "facets": { "groupByMyUrlField": { "terms": { "field": "myUrlField" } } } }' This is all well and good, but the default analyzer tokenizes the field so that each part of the url is a token, so I get hits for http , google.com , foo and bar . But

Using escaped_list_separator with boost split

感情迁移 提交于 2019-12-06 05:26:03
问题 I am playing around with the boost strings library and have just come across the awesome simplicity of the split method. string delimiters = ","; string str = "string, with, comma, delimited, tokens, \"and delimiters, inside a quote\""; // If we didn't care about delimiter characters within a quoted section we could us vector<string> tokens; boost::split(tokens, str, boost::is_any_of(delimiters)); // gives the wrong result: tokens = {"string", " with", " comma", " delimited", " tokens", "\

How to convert fields to rows in Pig?

社会主义新天地 提交于 2019-12-06 04:28:04
问题 I want to convert fields to rows in Pig. from input.txt 1 2 3 4 5 6 7 8 9 delimeter between fields is '\t'. to output.txt 1 2 3 4 ... but I must not use TOKENIZER because the content of fields might be a sentence. Please help me. Many Thanks. 回答1: I think alexeipab's answer is the right direction. Here is a simple example: > A = load 'input.txt'; > dump A (0,1,2,3,4,5,6,7,8,9) > B = foreach A generate FLATTEN(TOBAG(*)); > dump B (0) (1) (2) (3) (4) (5) (6) (7) (8) (9) 回答2: I ran into a very

Parsing URL string in Ruby

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-06 02:47:43
问题 I have a pretty simple string I want to parse in ruby and trying to find the most elegant solution. The string is of format /xyz/mov/exdaf/daeed.mov?arg1=blabla&arg2=3bla3bla What I would like to have is : string1: /xyz/mov/exdaf/daeed.mov string2: arg1=blabla&arg2=3bla3bla so basically tokenise on ? but can't find a good example. Any help would be appreciated. 回答1: Split the initial string on question marks. str.split("?") => ["/xyz/mov/exdaf/daeed.mov", "arg1=blabla&arg2=3bla3bla"] 回答2: I

Parsing Classes, Functions and Arguments in PHP

一笑奈何 提交于 2019-12-06 02:23:46
问题 I want to create a function which receives a single argument that holds the path to a PHP file and then parses the given file and returns something like this: class NameOfTheClass function Method1($arg1, $arg2, $arg2) private function Method2($arg1, $arg2, $arg2) public function Method2($arg1, $arg2, $arg2) abstract class AnotherClass function Method1($arg1, $arg2, $arg2) private function Method2($arg1, $arg2, $arg2) public function Method2($arg1, $arg2, $arg2) function SomeFunction($arg1,

c++ tokenize std string [duplicate]

﹥>﹥吖頭↗ 提交于 2019-12-06 01:59:34
问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: How do I tokenize a string in C++? Hello I was wondering how I would tokenize a std string with strtok string line = "hello, world, bye"; char * pch = strtok(line.c_str(),","); I get the following error error: invalid conversion from ‘const char*’ to ‘char*’ error: initializing argument 1 of ‘char* strtok(char*, const char*)’ I'm looking for a quick and easy approach to this as I don't think it requires much

Advanced tokenizer for a complex math expression

[亡魂溺海] 提交于 2019-12-06 00:52:30
I would like to tokenize a string that consists of integers,floats, operators, functions, variables and parentheses. The following example should brighten the essence of problem: Current state: String infix = 4*x+5.2024*(Log(x,y)^z)-300.12 Desired state: String tokBuf[0]=4 String tokBuf[1]=* String tokBuf[2]=x String tokBuf[3]=+ String tokBuf[4]=5.2024 String tokBuf[5]=* String tokBuf[6]=( String tokBuf[7]=Log String tokBuf[8]=( String tokBuf[9]=x String tokBuf[10]=, String tokBuf[11]=y String tokBuf[12]=) String tokBuf[13]=^ String tokBuf[14]=z String tokBuf[15]=) String tokBuf[16]=- String

Tokenize .htaccess files

走远了吗. 提交于 2019-12-05 16:31:44
Bet you didn't see this coming? ;) So, a project of mine requires that I specifically read and make sense out of .htaccess files . Sadly, searching on Google only yields the infinite woes of people trying to get their own .htaccess to work (sorry, couldn't resist the comment). Anyway, I'm a bit scared of trying to get this thing out of open-source projects that use it. See, in the past few weeks, I ended up wasting a lot of time trying to fix my issues with this strategy, only to find out that I did better to read RFCs & specs and build the thing my way. So, if you know about a library, or any

NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

梦想的初衷 提交于 2019-12-05 10:44:14
I am trying to Tokenize text using RegexpTokenizer. Code: from nltk.tokenize import RegexpTokenizer #from nltk.tokenize import word_tokenize line = "U.S.A Count U.S.A. Sec.of U.S. Name:Dr.John Doe J.Doe 1.11 1,000 10--20 10-20" pattern = '[\d|\.|\,]+|[A-Z][\.|A-Z]+\b[\.]*|[\w]+|\S' tokenizer = RegexpTokenizer(pattern) print tokenizer.tokenize(line) #print word_tokenize(line) Output: ['U', '.', 'S', '.', 'A', 'Count', 'U', '.', 'S', '.', 'A', '.', 'Sec', '.', 'of', 'U', '.', 'S', '.', 'Name', ':', 'Dr', '.', 'John', 'Doe', 'J', '.', 'Doe', '1.11', '1,000', '10', '-', '-', '20', '10', '-', '20']

Android and the CommaTokenizer

两盒软妹~` 提交于 2019-12-05 05:43:34
问题 I need a Tokenizer (for the AutoCompleteTextview) which can do the following: Two words must be recognized as such when separated by a blank character Two words must also be recognized as such when separated by a newline ("Enter" pressed) 1) is working, but how can I accomplish 2? public class SpaceTokenizer implements Tokenizer { @Override public int findTokenStart(CharSequence text, int cursor) { int i = cursor; while (i > 0 && (text.charAt(i - 1) != ' ')) { i--; } while (i < cursor &&