tokenize

Create new tokens and tuples from existing ones based on conditions

谁都会走 提交于 2019-12-24 06:28:13
问题 This is very related to a previous question but I am having difficulties adapting for my use case. I have a sentence: "Forbes Asia 200 Best Under 500 Billion 2011" I have tokens like: oldTokens = [u'Forbes', u'Asia', u'200', u'Best', u'Under', u'500', u'Billion', u'2011'] And the indices of where a previous parser has figured out where there should be location or number slots: numberTokenIDs = {(7,): 2011.0, (2,): 200.0, (5,6): 500000000000.00} locationTokenIDs = {(0, 1): u'Forbes Asia'} The

Create new tokens and tuples from existing ones based on conditions

倖福魔咒の 提交于 2019-12-24 06:27:10
问题 This is very related to a previous question but I am having difficulties adapting for my use case. I have a sentence: "Forbes Asia 200 Best Under 500 Billion 2011" I have tokens like: oldTokens = [u'Forbes', u'Asia', u'200', u'Best', u'Under', u'500', u'Billion', u'2011'] And the indices of where a previous parser has figured out where there should be location or number slots: numberTokenIDs = {(7,): 2011.0, (2,): 200.0, (5,6): 500000000000.00} locationTokenIDs = {(0, 1): u'Forbes Asia'} The

Splitting of strings based on the required length

天涯浪子 提交于 2019-12-24 05:46:31
问题 Is there an easy way on how to split a string based from the required length? For example, I have a string: <Data>AAAAABBBBB1111122222RRRRR<Data> and I want to populate an output like this: AAAAA BBBBB 11111 22222 RRRRR Thank you. 回答1: You can use analyze-string to break up the data: <xsl:template match="Data"> <xsl:variable name="tokens" as="xs:string*"> <xsl:analyze-string select="." regex=".{{1,5}}"> <xsl:matching-substring> <xsl:sequence select="."/> </xsl:matching-substring> </xsl

What's the best way to have stringTokenizer split up a line of text into predefined variables

╄→гoц情女王★ 提交于 2019-12-24 04:48:09
问题 I'm not sure if the title is very clear, but basically what I have to do is read a line of text from a file and split it up into 8 different string variables. Each line will have the same 8 chunks in the same order (title, author, price, etc). So for each line of text, I want to end up with 8 strings. The first problem is that the last two fields in the line may or may not be present, so I need to do something with stringTokenizer.hasMoreTokens, otherwise it will die messily when fields 7 and

Not able to tag hindi sentence properly

浪尽此生 提交于 2019-12-24 04:43:09
问题 I have recently started a project on Hindi data processing. I have tried executing certain below code but have not got the expected output. e = u"पूर्ण प्रतिबंध हटाओ : इराक" tokens=nltk.word_tokenize(e) from nltk import pos_tag print tokens tag = nltk.pos_tag(tokens) print tag The output I have obtained is shown below: [u'\u092a\u0942\u0930\u094d\u0923', u'\u092a\u094d\u0930\u0924\u093f\u092c\u0902\u0927', u'\u0939\u091f\u093e\u0913', u':', u'\u0907\u0930\u093e\u0915'] [(u'\u092a\u0942\u0930

Insert text in between file lines in python

时间秒杀一切 提交于 2019-12-24 03:29:22
问题 I have a file that I am currently reading from using fo = open("file.txt", "r") Then by doing file = open("newfile.txt", "w") file.write(fo.read()) file.write("Hello at the end of the file") fo.close() file.close() I basically copy the file to a new one, but also add some text at the end of the newly created file. How would I be able to insert that line say, in between two lines separated by an empty line? I.e: line 1 is right here <---- I want to insert here line 3 is right here Can I

Java code reads UTF-8 text incorrectly

谁说我不能喝 提交于 2019-12-24 00:43:01
问题 I'm having a problem reading UTF-8 characters in my code (running on Eclipse). I have a file text which has a few lines in it, for example: אך 1234 NOTE : There is a \t before the word, and the word should appear on the left, the number on the right... I don't know how to reverse them here, sorry. That is, a Hebrew word and then a number. I need to separate the word from the number somehow. I tried this: BufferedReader br = new BufferedReader(new FileReader(text)); String content; while (

Antlr3 matching tokens without whitespace

元气小坏坏 提交于 2019-12-23 21:06:13
问题 Given the input "term >1" , the number(1) and comparison operator(>) should generate seperate nodes in an AST. How can this be achieved? In my tests matching only occured if "c" and "1" where seperated with a space like so " term < 1 ". Current grammar: startExpression : orEx; expressionLevel4 : LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression; expressionLevel3 : (fieldExpression) | expressionLevel4 ; expressionLevel2 : (nearExpression) | expressionLevel3 ; expressionLevel1 :

Lucene, indexing already/externally tokenized tokens and defining own analyzing process

我与影子孤独终老i 提交于 2019-12-23 17:06:33
问题 in the process of using Lucene, I am a bit disapointed. I do not see or understand how i should proceed to feed any Lucene analyzers with something that is already and directly indexable. Or how i should proceed to create my own analyzer... for example, if i have a List<MyCustomToken> , which already contains many tokens (and actually many more informations about capitalization, etc. that i would also like to index as features on each of MyCustomToken) if i understand well what i have read, i

Custom tokenizer solr only is invoked at the first

∥☆過路亽.° 提交于 2019-12-23 12:52:41
问题 I created a custom tokenizer, it seem work fine by checking with admin/analysis.jsp and with system.out log. However when I perform querying in the field which use this custom tokenizer, I saw that custom tokenizer solr only is invoked for the first query string (check by system.out log). Could you help me by point out what I am wrong ?. These are my code: package com.fosp.searchengine; import java.io.Reader; import org.apache.lucene.analysis.WhitespaceTokenizer; import org.apache.solr