tokenize

PHP namespace removal / mapping and rewriting identifiers

爱⌒轻易说出口 提交于 2019-11-30 11:20:32
问题 I'm attempting to automate the removal of namespaces from a PHP class collection to make them PHP 5.2 compatible. (Shared hosting providers do not fancy rogue PHP 5.3 installations. No idea why. Also the code in question doesn't use any 5.3 feature additions, just that syntax. Autoconversion seems easier than doing it by hand or reimplementing the codebase.) For rewriting the *.php scripts I'm basically running over a tokenizer list. The identifier searching+merging is already complete. But I

How split a file in words in unix command line?

江枫思渺然 提交于 2019-11-30 10:59:08
I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into other file with a word per line. Example my file cotains: Hola mundo, hablo español y no sé si escribí bien la pregunta, ojalá me puedan entender y ayudar Adiós. The the output file should contains: Hola mundo hablo español ... Thank! Using tr: tr -s '[[:punct:][:space:]]' '\n' < file The simplest tool is fmt: fmt -1 <your-file fmt designed to break lines to fit the specified width and if you provide

How to prevent splitting specific words or phrases and numbers in NLTK?

久未见 提交于 2019-11-30 09:43:02
问题 I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like "run in my family" ,"30 minute walk" or "4x a day" from splitting at the time of tokenizing words in NLTK? They should not result in: ['runs','in','my','family','4x','a','day'] For example: Yes 20-30 minutes a day on my bike, it works great!! gives: ['yes','20-30','minutes','a','day','on','my','bike',',','it','works','great'] I want '20-30 minutes' to be

Tokenizing unicode using nltk

删除回忆录丶 提交于 2019-11-30 08:14:46
I have textfiles that use utf-8 encoding that contain characters like 'ö', 'ü', etc. I would like to parse the text form these files, but I can't get the tokenizer to work properly. If I use standard nltk tokenizer: f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk' text = f.read() f.close items = text.decode('utf8') a = nltk.word_tokenize(items) Output: [u'\ufeff', u'm', u'\xfc', u'sli', u'p', u'\xf6', u'\xf6', u'k', u'r', u'\xe4', u'\xe4', u'k'] Punkt tokenizer seems to do better: f = open('C:\Python26\text.txt', 'r') # text = 'müsli pöök rääk' text = f.read() f.close items =

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

久未见 提交于 2019-11-30 07:34:53
问题 I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example: import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size)) vect.fit(string) print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size)) outputs: 4-grams: [u'like python it pretty', u'python it pretty awesome', u

Pythonic way to implement a tokenizer

为君一笑 提交于 2019-11-30 06:54:55
I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice? I've implemented a tokenizer before in C and in Java so I'm fine with the theory, I'd just like to ensure I'm following pythonic styles and best practices. Listing Token Types: In Java, for example, I would have a list of fields like so: public static final int TOKEN_INTEGER = 0 But, obviously, there's no way (I think) to declare a constant variable in Python, so I could just replace this with normal variable declarations but that doesn't strike me as a great solution since the declarations

Split a string into an array in C++ [duplicate]

独自空忆成欢 提交于 2019-11-30 04:59:52
Possible Duplicate: How to split a string in C++? I have an input file of data and each line is an entry. in each line each "field" is seperated by a white space " " so I need to split the line by space. other languages have a function called split (C#, PHP etc) but I cant find one for C++. How can I achieve this? Here is my code that gets the lines: string line; ifstream in(file); while(getline(in, line)){ // Here I would like to split each line and put them into an array } Nawaz #include <sstream> //for std::istringstream #include <iterator> //for std::istream_iterator #include <vector> /

Tokenization of Arabic words using NLTK

微笑、不失礼 提交于 2019-11-30 03:08:57
I'm using NLTK word_tokenizer to split a sentence into words. I want to tokenize this sentence: في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء The code I'm writing is: import re import nltk lex = u" في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء" wordsArray = nltk.word_tokenize(lex) print " ".join(wordsArray) The problem is that the word_tokenize function doesn't split by words. Instead, it splits by letters so that the output is: "ف ي _ ب ي ت ن ا ك ل ش ي ل م ا ت ح ت ا ج ه ي ض ي ع ... ا د و ر ع ل ى ش ا ح ن ف

PHP namespace removal / mapping and rewriting identifiers

心不动则不痛 提交于 2019-11-29 23:52:31
I'm attempting to automate the removal of namespaces from a PHP class collection to make them PHP 5.2 compatible. (Shared hosting providers do not fancy rogue PHP 5.3 installations. No idea why. Also the code in question doesn't use any 5.3 feature additions, just that syntax. Autoconversion seems easier than doing it by hand or reimplementing the codebase.) For rewriting the *.php scripts I'm basically running over a tokenizer list. The identifier searching+merging is already complete. But I'm a bit confused now how to accomplish the actual rewriting. function rewrite($name, $namespace, $use)

Can not use ICUTokenizerFactory in Solr

守給你的承諾、 提交于 2019-11-29 23:47:30
问题 I am trying to use ICUTokenizerFactory in Solr schema. This is how I have defined field and fieldType . <fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory"/> </analyzer> </fieldType> <field name="fld_icu" type="text_icu" indexed="true" stored="true"/> And, when I start Solr, I am get this error Plugin init failure for [schema.xml] fieldType "text_icu": Plugin init failure for [schema.xml] analyzer/tokenizer: