tokenize | 易学教程

Find coresponding open/close brackets

阅读更多关于 Find coresponding open/close brackets

问题 A follow up on a previous question: PHP how to best edit an RTF File I believe I have a solution, but need some more help. I found that if I use merge fields in my template builder, my php code could find/replace fields that are in this pattern: "{\field}" The problem is, though, that I would need to find the whole string, remove all RTF tags, and compare the text left behind. The first step, though, is to find the full markup. And this is where I am stuck. I would need to be able to find the

Elasticsearch - combine terms into one

阅读更多关于 Elasticsearch - combine terms into one

问题 For example, Elasticsearch tokenizes The Whole Foods Market into terms [the, whole, foods, market] , is there a tokenizer or token filter can combine all the terms into one, in this example, I want a single term thewholefoodsmarket ? 来源： https://stackoverflow.com/questions/58716832/elasticsearch-combine-terms-into-one

Using nlp.pipe() with pre-segmented and pre-tokenized text with spaCy

阅读更多关于 Using nlp.pipe() with pre-segmented and pre-tokenized text with spaCy

问题 I am trying to tag and parse text that has already been split up in sentences and has already been tokenized. As an example: sents = [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']] The fastest approach to process batches of text is .pipe() . However, it is not clear to me how I can use that with pre-tokenized, and pre-segmented text. Performance is key here. I tried the following, but that threw an error docs = [nlp.tokenizer.tokens_from_list(sentence) for sentence in sents] nlp.tagger

XSLT tokenize - capturing the separators

阅读更多关于 XSLT tokenize - capturing the separators

问题 here is a piece of code in XSL which tokenizes a text into fragments separated by interpunction and similar characters. I'd like to ask if there is a possibility to somehow capture the strings by which the text was tokenized, for example the comma or dot etc. <xsl:stylesheet version="2.0" exclude-result-prefixes="xs xdt err fn" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:err="http://www.w3

Ignore parentheses with string tokenizer?

阅读更多关于 Ignore parentheses with string tokenizer?

问题 I have an input that looks like: (0 0 0) I would like to ignore the parenthesis and only add the numbers, in this case 0, to an arraylist. I am using scanner to read from a file and this is what I have so far transitionInput = data.nextLine(); st = new StringTokenizer(transitionInput,"()", true); while (st.hasMoreTokens()) { transition.add(st.nextToken(",")); } However, the output looks like this [(0 0 0)] I would like to ignore the parentheses 回答1: You are first using () as delimiters, then

How to distinguish between added sentences and altered sentences with difflib and nltk?

阅读更多关于 How to distinguish between added sentences and altered sentences with difflib and nltk?

问题 Downloading this page and making a very minor edit to it, changing the first 65 in this paragraph to 68 : I then run it through the following code to pull out the diffs. import bs4 from bs4 import BeautifulSoup import urllib2 import lxml.html as lh url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM' response = urllib2.urlopen(url) content = response.read() # get response as list of lines root = lh.fromstring(content) section1 = root.xpath("//div[@class = 'column-12']")

word_tokenize TypeError: expected string or buffer [closed]

阅读更多关于 word_tokenize TypeError: expected string or buffer [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . When calling word_tokenize I get the following error: File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1322, in _slices_from_text for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or buffer I have a large text file (1500.txt) from which I want to remove

Create analyzer with Edge N Gram analyzer and char filter which replaces space with new line

阅读更多关于 Create analyzer with Edge N Gram analyzer and char filter which replaces space with new line

问题 I have below type of text coming in. foo bar , hello world etc. I created an analyzer using Edge NGram tokenizer and using the analyze api it creates below token. { "tokens": [ { "token": "f", "start_offset": 0, "end_offset": 1, "type": "word", "position": 1 }, { "token": "fo", "start_offset": 0, "end_offset": 2, "type": "word", "position": 2 }, { "token": "foo", "start_offset": 0, "end_offset": 3, "type": "word", "position": 3 }, { "token": "b", "start_offset": 4, "end_offset": 5, "type":

API design regarding file tokenization, object value setting and enums of token positions

阅读更多关于 API design regarding file tokenization, object value setting and enums of token positions

问题 In my Android application I need to read a large amount of data from a set of .ini files that were originally (and still are) deployed with a Windows application. The application is to do with a particular piece of hardware, and the basic purpose of the .ini files is to describe programming constants, real-time data, and UI elements regarding the hardware device. To move onto the specific Java design question I would like assistance on, here's an example of how a typical row of data from an

Tokenization based pattern replacement in web.config.token

阅读更多关于 Tokenization based pattern replacement in web.config.token

问题 I am using Release Manager 2015 to deploy my application. I am using Microsoft's Extension Utilities pack to do this: Extension Utility Pack - Documentation This simply states: Tokenization based pattern replacement This task finds the pattern __<pattern>__ and replaces the same with the value from the variable with name <pattern>. Eg. If you have a variable defined as foo with value bar, on running this task on a file that contains __foo__ will be changed to bar. So in my web.config.token