tokenize

How can lexing efficiency be improved?

旧城冷巷雨未停 提交于 2020-01-01 11:25:24
问题 In parsing a large 3 gigabyte file with DCG, efficiency is of importance. The current version of my lexer is using mostly the or predicate ;/2 but I read that indexing can help. Indexing is a technique used to quickly select candidate clauses of a predicate for a specific goal. In most Prolog systems, indexing is done (only) on the first argument of the head. If this argument is instantiated to an atom, integer, float or compound term with functor, hashing is used to quickly select all

Elasticsearch wildcard search on not_analyzed field

余生长醉 提交于 2020-01-01 09:24:15
问题 I have an index like following settings and mapping; { "settings":{ "index":{ "analysis":{ "analyzer":{ "analyzer_keyword":{ "tokenizer":"keyword", "filter":"lowercase" } } } } }, "mappings":{ "product":{ "properties":{ "name":{ "analyzer":"analyzer_keyword", "type":"string", "index": "not_analyzed" } } } } } I am struggling with making an implementation for wildcard search on name field. My example data like this; [ {"name": "SVF-123"}, {"name": "SVF-234"} ] When I perform following query;

Padding multiple character with space - python

三世轮回 提交于 2019-12-31 04:22:48
问题 In perl , I can do the following with will pad my punctuation symbols with spaces: s/([،;؛¿!"\])}»›”؟%٪°±©®।॥…])/ $1 /g;` In Python , I've tried this: >>> p = u'،;؛¿!"\])}»›”؟%٪°±©®।॥…' >>> text = u"this, is a sentence with weird» symbols… appearing everywhere¿" >>> for i in p: ... text = text.replace(i, ' '+i+' ') ... >>> text u'this, is a sentence with weird \xbb symbols \u2026 appearing everywhere \xbf ' >>> print text this, is a sentence with weird » symbols … appearing everywhere ¿ But

How do I tokenize input using Java's Scanner class and regular expressions?

穿精又带淫゛_ 提交于 2019-12-30 03:34:08
问题 Just for my own purposes, I'm trying to build a tokenizer in Java where I can define a regular grammar and have it tokenize input based on that. The StringTokenizer class is deprecated, and I've found a couple functions in Scanner that hint towards what I want to do, but no luck yet. Anyone know a good way of going about this? 回答1: The name "Scanner" is a bit misleading, because the word is often used to mean a lexical analyzer, and that's not what Scanner is for. All it is is a substitute

Int tokenizer

谁说我不能喝 提交于 2019-12-28 11:53:19
问题 I know there are string tokenizers but is there an "int tokenizer"? For example, I want to split the string "12 34 46" and have: list[0]=12 list[1]=34 list[2]=46 In particular, I'm wondering if Boost::Tokenizer does this. Although I couldn't find any examples that didn't use strings. 回答1: Yes there is: use a stream, e.g. a stringstream : stringstream sstr("12 34 46"); int i; while (sstr >> i) list.push_back(i); Alternatively, you can also use STL algorithms and/or iterator adapters combined

Tokenizing Error: java.util.regex.PatternSyntaxException, dangling metacharacter '*'

ε祈祈猫儿з 提交于 2019-12-27 23:36:56
问题 I am using split() to tokenize a String separated with * following this format: name*lastName*ID*school*age % name*lastName*ID*school*age % name*lastName*ID*school*age I'm reading this from a file named "entrada.al" using this code: static void leer() { try { String ruta="entrada.al"; File myFile = new File (ruta); FileReader fileReader = new FileReader(myFile); BufferedReader reader = new BufferedReader(fileReader); String line = null; while ((line=reader.readLine())!=null){ if (!(line

Tokenise text and create more rows for each row in dataframe

吃可爱长大的小学妹 提交于 2019-12-25 17:37:34
问题 I want to do this with python and pandas . Let's suppose that I have the following: file_id text 1 I am the first document. I am a nice document. 2 I am the second document. I am an even nicer document. and I finally want to have the following: file_id text 1 I am the first document 1 I am a nice document 2 I am the second document 2 I am an even nicer document So I want the text of each file to be splitted at every fullstop and to create new lines for each of the tokens of these texts. What

How to set delimiters for PTB tokenizer?

落爺英雄遲暮 提交于 2019-12-25 07:39:41
问题 I'm using StanfordCore NLP Library for my project.It uses PTB Tokenizer for tokenization.For a statement that goes like this- go to room no. #2145 or go to room no. *2145 tokenizer is splitting #2145 into two tokens: #,2145. Is there any way possible to set tokenizer so that it does't identify #,* like a delimiter? 回答1: A quick solution is to use this option: (command-line) -tokenize.whitespace (in Java code) props.setProperty("tokenize.whitespace", "true"); This will cause the tokenizer to

Mapping and indexing Path hierarchy in Elastic NEST to search with in directory paths

耗尽温柔 提交于 2019-12-25 04:27:43
问题 I need to search for files and folder with in specific directories. In order to do that, elastic asks us to create the analyzer and set the tokenizer to path_hierarchy PUT /fs { "settings": { "analysis": { "analyzer": { "paths": { "tokenizer": "path_hierarchy" } } } } } Then, create the mapping as illustrated below with two properties: name (holding the name of the file) and path (to store the directory path): PUT /fs/_mapping/file { "properties": { "name": { "type": "string", "index": "not

Stanford PTBTokenizer token's split delimiter

陌路散爱 提交于 2019-12-25 02:07:37
问题 There is a way to provide to the PTBTokenizer a set of delimiters characters to split a token ? i was testing the behaviour of this tokenizer and i've realized that there are some characters like the vertical bar '|' for which the tokenizer diviedes a substring into two token, and others like the slash or the hypen for which the tokenizer return a single token. 回答1: There's not any simple way to do this with the PTBTokenizer, no. You can do some pre-processing and post-processing to get what