tokenize

Using escaped_list_separator with boost split

夙愿已清 提交于 2019-12-04 10:54:11
I am playing around with the boost strings library and have just come across the awesome simplicity of the split method. string delimiters = ","; string str = "string, with, comma, delimited, tokens, \"and delimiters, inside a quote\""; // If we didn't care about delimiter characters within a quoted section we could us vector<string> tokens; boost::split(tokens, str, boost::is_any_of(delimiters)); // gives the wrong result: tokens = {"string", " with", " comma", " delimited", " tokens", "\"and delimiters", " inside a quote\""} Which would be nice and concise... however it doesn't seem to work

How to convert fields to rows in Pig?

只愿长相守 提交于 2019-12-04 10:45:10
I want to convert fields to rows in Pig. from input.txt 1 2 3 4 5 6 7 8 9 delimeter between fields is '\t'. to output.txt 1 2 3 4 ... but I must not use TOKENIZER because the content of fields might be a sentence. Please help me. Many Thanks. I think alexeipab's answer is the right direction. Here is a simple example: > A = load 'input.txt'; > dump A (0,1,2,3,4,5,6,7,8,9) > B = foreach A generate FLATTEN(TOBAG(*)); > dump B (0) (1) (2) (3) (4) (5) (6) (7) (8) (9) mfunaro I ran into a very similar issues using Pig. What I ended up doing was writing a UDF, that would simply iterate through the

Tokenizing strings using regular expression in Javascript

試著忘記壹切 提交于 2019-12-04 10:36:59
Suppose I've a long string containing newlines and tabs as: var x = "This is a long string.\n\t This is another one on next line."; So how can we split this string into tokens, using regular expression? I don't want to use .split(' ') because I want to learn Javascript's Regex. A more complicated string could be this: var y = "This @is a #long $string. Alright, lets split this."; Now I want to extract only the valid words out of this string, without special characters, and punctuation, i.e I want these: var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on",

Is there a bi gram or tri gram feature in Spacy?

此生再无相见时 提交于 2019-12-04 10:08:50
The below code breaks the sentence into individual tokens and the output is as below "cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies" import en_core_web_sm nlp = en_core_web_sm.load() doc = nlp("Cloud computing is benefiting major manufacturing companies") for token in doc: print(token.text) What I would ideally want is, to read 'cloud computing' together as it is technically one word. Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi gram or Tri grams ? Spacy allows detection of noun chunks. So to parse your noun phrases as single

Tokenize, remove stop words using Lucene with Java

こ雲淡風輕ζ 提交于 2019-12-04 09:39:46
问题 I am trying to tokenize and remove stop words from a txt file with Lucene. I have this: public String removeStopWords(String string) throws IOException { Set<String> stopWords = new HashSet<String>(); stopWords.add("a"); stopWords.add("an"); stopWords.add("I"); stopWords.add("the"); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string)); tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords); StringBuilder sb = new StringBuilder();

Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

女生的网名这么多〃 提交于 2019-12-04 09:37:07
问题 I want to include hyphen words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on Stackoverflow, Github, its documentation and elsewhere, I also wrote a custom tokenizer as below. import re from spacy.tokenizer import Tokenizer prefix_re = re.compile(r'''^[\[\("']''') suffix_re = re.compile(r'''[\]\)"']$''') infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''') def custom_tokenizer(nlp): return Tokenizer(nlp.vocab, prefix_search

Splitting chinese document into sentences [closed]

雨燕双飞 提交于 2019-12-04 09:18:27
Closed. This question is off-topic. It is not currently accepting answers. Learn more . Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . I have to split Chinese text into multiple sentences. I tried the Stanford DocumentPreProcessor. It worked quite well for English but not for Chinese. Please can you let me know any good sentence splitters for Chinese preferably in Java or Python. Using some regex tricks in Python (c.f. a modified regex of Section 2.3 of http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf ): import re paragraph = u'\u70ed

Parsing URL string in Ruby

我怕爱的太早我们不能终老 提交于 2019-12-04 08:13:31
I have a pretty simple string I want to parse in ruby and trying to find the most elegant solution. The string is of format /xyz/mov/exdaf/daeed.mov?arg1=blabla&arg2=3bla3bla What I would like to have is : string1: /xyz/mov/exdaf/daeed.mov string2: arg1=blabla&arg2=3bla3bla so basically tokenise on ? but can't find a good example. Any help would be appreciated. Split the initial string on question marks. str.split("?") => ["/xyz/mov/exdaf/daeed.mov", "arg1=blabla&arg2=3bla3bla"] I think the best solution would be to use the URI module. (You can do things like URI.parse('your_uri_string').query

bad zip file error in POS tagging in NLTK in python

。_饼干妹妹 提交于 2019-12-04 06:38:42
问题 I am new to python and NLTK ..I want to do word tokenization and POS Tagging in this.I installed Nltk 3.0 in my Ubuntu 14.04 having a default python 2.7.6.First I tried to do tokenization of a simple sentence.But I am getting an error,telling that "BadZipfile: File is not a zip file".How to solve this???? ..One more doubt..i.e. i gave path as "/usr/share/nltk_data" when i installed Nltk data (using command line).Some of the pakages couldnt be installed due to some errors.But it shows other

Parsing Classes, Functions and Arguments in PHP

懵懂的女人 提交于 2019-12-04 06:32:15
I want to create a function which receives a single argument that holds the path to a PHP file and then parses the given file and returns something like this: class NameOfTheClass function Method1($arg1, $arg2, $arg2) private function Method2($arg1, $arg2, $arg2) public function Method2($arg1, $arg2, $arg2) abstract class AnotherClass function Method1($arg1, $arg2, $arg2) private function Method2($arg1, $arg2, $arg2) public function Method2($arg1, $arg2, $arg2) function SomeFunction($arg1, $arg2, $arg3) This function should return all the classes, methods and function that exist in the given