tokenize

Find coresponding open/close brackets

余生长醉 提交于 2019-12-25 01:59:40
问题 A follow up on a previous question: PHP how to best edit an RTF File I believe I have a solution, but need some more help. I found that if I use merge fields in my template builder, my php code could find/replace fields that are in this pattern: "{\field}" The problem is, though, that I would need to find the whole string, remove all RTF tags, and compare the text left behind. The first step, though, is to find the full markup. And this is where I am stuck. I would need to be able to find the

Elasticsearch - combine terms into one

丶灬走出姿态 提交于 2019-12-25 01:17:31
问题 For example, Elasticsearch tokenizes The Whole Foods Market into terms [the, whole, foods, market] , is there a tokenizer or token filter can combine all the terms into one, in this example, I want a single term thewholefoodsmarket ? 来源: https://stackoverflow.com/questions/58716832/elasticsearch-combine-terms-into-one

Using nlp.pipe() with pre-segmented and pre-tokenized text with spaCy

倾然丶 夕夏残阳落幕 提交于 2019-12-24 19:39:01
问题 I am trying to tag and parse text that has already been split up in sentences and has already been tokenized. As an example: sents = [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']] The fastest approach to process batches of text is .pipe() . However, it is not clear to me how I can use that with pre-tokenized, and pre-segmented text. Performance is key here. I tried the following, but that threw an error docs = [nlp.tokenizer.tokens_from_list(sentence) for sentence in sents] nlp.tagger

XSLT tokenize - capturing the separators

荒凉一梦 提交于 2019-12-24 17:58:25
问题 here is a piece of code in XSL which tokenizes a text into fragments separated by interpunction and similar characters. I'd like to ask if there is a possibility to somehow capture the strings by which the text was tokenized, for example the comma or dot etc. <xsl:stylesheet version="2.0" exclude-result-prefixes="xs xdt err fn" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:err="http://www.w3

Ignore parentheses with string tokenizer?

孤街醉人 提交于 2019-12-24 15:52:06
问题 I have an input that looks like: (0 0 0) I would like to ignore the parenthesis and only add the numbers, in this case 0, to an arraylist. I am using scanner to read from a file and this is what I have so far transitionInput = data.nextLine(); st = new StringTokenizer(transitionInput,"()", true); while (st.hasMoreTokens()) { transition.add(st.nextToken(",")); } However, the output looks like this [(0 0 0)] I would like to ignore the parentheses 回答1: You are first using () as delimiters, then

How to distinguish between added sentences and altered sentences with difflib and nltk?

有些话、适合烂在心里 提交于 2019-12-24 15:32:46
问题 Downloading this page and making a very minor edit to it, changing the first 65 in this paragraph to 68 : I then run it through the following code to pull out the diffs. import bs4 from bs4 import BeautifulSoup import urllib2 import lxml.html as lh url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM' response = urllib2.urlopen(url) content = response.read() # get response as list of lines root = lh.fromstring(content) section1 = root.xpath("//div[@class = 'column-12']")

word_tokenize TypeError: expected string or buffer [closed]

孤街醉人 提交于 2019-12-24 12:42:35
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . When calling word_tokenize I get the following error: File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1322, in _slices_from_text for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or buffer I have a large text file (1500.txt) from which I want to remove

Create analyzer with Edge N Gram analyzer and char filter which replaces space with new line

ε祈祈猫儿з 提交于 2019-12-24 11:15:30
问题 I have below type of text coming in. foo bar , hello world etc. I created an analyzer using Edge NGram tokenizer and using the analyze api it creates below token. { "tokens": [ { "token": "f", "start_offset": 0, "end_offset": 1, "type": "word", "position": 1 }, { "token": "fo", "start_offset": 0, "end_offset": 2, "type": "word", "position": 2 }, { "token": "foo", "start_offset": 0, "end_offset": 3, "type": "word", "position": 3 }, { "token": "b", "start_offset": 4, "end_offset": 5, "type":

API design regarding file tokenization, object value setting and enums of token positions

淺唱寂寞╮ 提交于 2019-12-24 09:50:08
问题 In my Android application I need to read a large amount of data from a set of .ini files that were originally (and still are) deployed with a Windows application. The application is to do with a particular piece of hardware, and the basic purpose of the .ini files is to describe programming constants, real-time data, and UI elements regarding the hardware device. To move onto the specific Java design question I would like assistance on, here's an example of how a typical row of data from an

Tokenization based pattern replacement in web.config.token

时间秒杀一切 提交于 2019-12-24 08:10:07
问题 I am using Release Manager 2015 to deploy my application. I am using Microsoft's Extension Utilities pack to do this: Extension Utility Pack - Documentation This simply states: Tokenization based pattern replacement This task finds the pattern __<pattern>__ and replaces the same with the value from the variable with name <pattern>. Eg. If you have a variable defined as foo with value bar, on running this task on a file that contains __foo__ will be changed to bar. So in my web.config.token