tokenize | 易学教程

Create new tokens and tuples from existing ones based on conditions

阅读更多关于 Create new tokens and tuples from existing ones based on conditions

问题 This is very related to a previous question but I am having difficulties adapting for my use case. I have a sentence: "Forbes Asia 200 Best Under 500 Billion 2011" I have tokens like: oldTokens = [u'Forbes', u'Asia', u'200', u'Best', u'Under', u'500', u'Billion', u'2011'] And the indices of where a previous parser has figured out where there should be location or number slots: numberTokenIDs = {(7,): 2011.0, (2,): 200.0, (5,6): 500000000000.00} locationTokenIDs = {(0, 1): u'Forbes Asia'} The

Create new tokens and tuples from existing ones based on conditions

阅读更多关于 Create new tokens and tuples from existing ones based on conditions

Splitting of strings based on the required length

阅读更多关于 Splitting of strings based on the required length

问题 Is there an easy way on how to split a string based from the required length? For example, I have a string: <Data>AAAAABBBBB1111122222RRRRR<Data> and I want to populate an output like this: AAAAA BBBBB 11111 22222 RRRRR Thank you. 回答1: You can use analyze-string to break up the data: <xsl:template match="Data"> <xsl:variable name="tokens" as="xs:string*"> <xsl:analyze-string select="." regex=".{{1,5}}"> <xsl:matching-substring> <xsl:sequence select="."/> </xsl:matching-substring> </xsl

What's the best way to have stringTokenizer split up a line of text into predefined variables

阅读更多关于 What's the best way to have stringTokenizer split up a line of text into predefined variables

问题 I'm not sure if the title is very clear, but basically what I have to do is read a line of text from a file and split it up into 8 different string variables. Each line will have the same 8 chunks in the same order (title, author, price, etc). So for each line of text, I want to end up with 8 strings. The first problem is that the last two fields in the line may or may not be present, so I need to do something with stringTokenizer.hasMoreTokens, otherwise it will die messily when fields 7 and

Not able to tag hindi sentence properly

阅读更多关于 Not able to tag hindi sentence properly

问题 I have recently started a project on Hindi data processing. I have tried executing certain below code but have not got the expected output. e = u"पूर्ण प्रतिबंध हटाओ : इराक" tokens=nltk.word_tokenize(e) from nltk import pos_tag print tokens tag = nltk.pos_tag(tokens) print tag The output I have obtained is shown below: [u'\u092a\u0942\u0930\u094d\u0923', u'\u092a\u094d\u0930\u0924\u093f\u092c\u0902\u0927', u'\u0939\u091f\u093e\u0913', u':', u'\u0907\u0930\u093e\u0915'] [(u'\u092a\u0942\u0930

Insert text in between file lines in python

阅读更多关于 Insert text in between file lines in python

问题 I have a file that I am currently reading from using fo = open("file.txt", "r") Then by doing file = open("newfile.txt", "w") file.write(fo.read()) file.write("Hello at the end of the file") fo.close() file.close() I basically copy the file to a new one, but also add some text at the end of the newly created file. How would I be able to insert that line say, in between two lines separated by an empty line? I.e: line 1 is right here <---- I want to insert here line 3 is right here Can I

Java code reads UTF-8 text incorrectly

阅读更多关于 Java code reads UTF-8 text incorrectly

问题 I'm having a problem reading UTF-8 characters in my code (running on Eclipse). I have a file text which has a few lines in it, for example: אך 1234 NOTE : There is a \t before the word, and the word should appear on the left, the number on the right... I don't know how to reverse them here, sorry. That is, a Hebrew word and then a number. I need to separate the word from the number somehow. I tried this: BufferedReader br = new BufferedReader(new FileReader(text)); String content; while (

Antlr3 matching tokens without whitespace

阅读更多关于 Antlr3 matching tokens without whitespace

问题 Given the input "term >1" , the number(1) and comparison operator(>) should generate seperate nodes in an AST. How can this be achieved? In my tests matching only occured if "c" and "1" where seperated with a space like so " term < 1 ". Current grammar: startExpression : orEx; expressionLevel4 : LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression; expressionLevel3 : (fieldExpression) | expressionLevel4 ; expressionLevel2 : (nearExpression) | expressionLevel3 ; expressionLevel1 :

Lucene, indexing already/externally tokenized tokens and defining own analyzing process

阅读更多关于 Lucene, indexing already/externally tokenized tokens and defining own analyzing process

问题 in the process of using Lucene, I am a bit disapointed. I do not see or understand how i should proceed to feed any Lucene analyzers with something that is already and directly indexable. Or how i should proceed to create my own analyzer... for example, if i have a List<MyCustomToken> , which already contains many tokens (and actually many more informations about capitalization, etc. that i would also like to index as features on each of MyCustomToken) if i understand well what i have read, i

Custom tokenizer solr only is invoked at the first

阅读更多关于 Custom tokenizer solr only is invoked at the first

问题 I created a custom tokenizer, it seem work fine by checking with admin/analysis.jsp and with system.out log. However when I perform querying in the field which use this custom tokenizer, I saw that custom tokenizer solr only is invoked for the first query string (check by system.out log). Could you help me by point out what I am wrong ?. These are my code: package com.fosp.searchengine; import java.io.Reader; import org.apache.lucene.analysis.WhitespaceTokenizer; import org.apache.solr