tokenize

nltk sentence tokenizer, consider new lines as sentence boundary

江枫思渺然 提交于 2019-12-03 11:00:01
I am using nltk's PunkSentenceTokenizer to tokenize a text to a set of sentences. However, the tokenizer doesn't seem to consider new paragraph or new lines as a new sentence. >>> from nltk.tokenize.punkt import PunktSentenceTokenizer >>> tokenizer = PunktSentenceTokenizer() >>> tokenizer.tokenize('Sentence 1 \n Sentence 2. Sentence 3.') ['Sentence 1 \n Sentence 2.', 'Sentence 3.'] >>> tokenizer.span_tokenize('Sentence 1 \n Sentence 2. Sentence 3.') [(0, 24), (25, 36)] I would like it to to consider new lines as a boundary of sentences as well. Anyway to do this (I need to save the offsets too

Generating PHP code (from Parser Tokens)

二次信任 提交于 2019-12-03 07:26:59
Is there any available solution for (re-)generating PHP code from the Parser Tokens returned by token_get_all ? Other solutions for generating PHP code are welcome as well, preferably with the associated lexer/parser (if any). If I'm not mistaken http://pear.php.net/package/PHP_Beautifier uses token_get_all() and then rewrites the stream. It uses heaps of methods like t_else and t_close_brace to output each token. Maybe you can hijack this for simplicity. wen From my comment: Does anyone see a potential problem, if I simply write a large switch statement to convert tokens back to their string

How to parse / tokenize an SQL statement in Node.js [closed]

≡放荡痞女 提交于 2019-12-03 05:51:00
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed last year . I'm looking for a way to parse / tokenize SQL statement within a Node.js application, in order to: Tokenize all the "basics" SQL keywords defined in the ISO/IEC 9075 standard or here. Validate the SQL syntax. Find out what the query is gonna do (e.g. read or write?). Do you have any

NLTK tokenize - faster way?

与世无争的帅哥 提交于 2019-12-03 05:29:39
I have a method that takes in a String parameter, and uses NLTK to break the String down to sentences, then into words. Afterwards, it converts each word into lowercase, and finally creates a dictionary of the frequency of each word. import nltk from collections import Counter def freq(string): f = Counter() sentence_list = nltk.tokenize.sent_tokenize(string) for sentence in sentence_list: words = nltk.word_tokenize(sentence) words = [word.lower() for word in words] for word in words: f[word] += 1 return f I'm supposed to optimize the above code further to result in faster preprocessing time,

Is there a better (more modern) tool than lex/flex for generating a tokenizer for C++?

三世轮回 提交于 2019-12-03 04:42:46
问题 I recent added source file parsing to an existing tool that generated output files from complex command line arguments. The command line arguments got to be so complex that we started allowing them to be supplied as a file that was parsed as if it was a very large command line, but the syntax was still awkward. So I added the ability to parse a source file using a more reasonable syntax. I used flex 2.5.4 for windows to generate the tokenizer for this custom source file format, and it worked.

Tokenize, remove stop words using Lucene with Java

北慕城南 提交于 2019-12-03 04:41:16
I am trying to tokenize and remove stop words from a txt file with Lucene. I have this: public String removeStopWords(String string) throws IOException { Set<String> stopWords = new HashSet<String>(); stopWords.add("a"); stopWords.add("an"); stopWords.add("I"); stopWords.add("the"); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string)); tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords); StringBuilder sb = new StringBuilder(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream

Is there a tokenizer for a cpp file

强颜欢笑 提交于 2019-12-03 04:04:39
I have a cpp file with a huge class implementation. Now I have to modify the source file itself. For this, is there a library/api/tool that will tokenize this file for me and give me one token each time i request. My requirement is as below. OpenCPPFile() While (!EOF) token = GetNextToken(); process something based on this token EndWhile I am happy now Regards, AJ Boost.Wave offers a standard C++ lexer among many other tools like a standard preprocessor which are built on top of Boost.Spirit . Check the following sample in the boost directory: C:\boost\libs\wave\samples\lexed_tokens For

Word break in languages without spaces between words (e.g., Asian)?

假如想象 提交于 2019-12-03 03:53:11
问题 I'd like to make MySQL full text search work with Japanese and Chinese text, as well as any other language. The problem is that these languages and probably others do not normally have white space between words. Search is not useful when you must type the same sentence as is in the text. I can not just put a space between every character because English must work too. I would like to solve this problem with PHP or MySQL. Can I configure MySQL to recognize characters which should be their own

Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

风流意气都作罢 提交于 2019-12-03 03:28:30
I want to include hyphen words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on Stackoverflow, Github , its documentation and elsewhere , I also wrote a custom tokenizer as below. import re from spacy.tokenizer import Tokenizer prefix_re = re.compile(r'''^[\[\("']''') suffix_re = re.compile(r'''[\]\)"']$''') infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''') def custom_tokenizer(nlp): return Tokenizer(nlp.vocab, prefix_search=prefix_re.search, suffix_search=suffix_re.search, infix_finditer=infix_re.finditer, token_match=None

Basic NLP in CoffeeScript or JavaScript — Punkt tokenizaton, simple trained Bayes models — where to start? [closed]

若如初见. 提交于 2019-12-03 03:15:16
My current web-app project calls for a little NLP: Tokenizing text into sentences, via Punkt and similar; Breaking down the longer sentences by subordinate clause (often it’s on commas except when it’s not) A Bayesian model fit for chunking paragraphs with an even feel, no orphans or widows and minimal awkward splits (maybe) ... which much of that is a childishly easy task if you’ve got NLTK — which I do, sort of: the app backend is Django on Tornado; you’d think doing these things would be a non-issue. However, I’ve got to interactively provide the user feedback for which the tokenizers are