tokenize | 易学教程

What is an efficient data structure for tokenized data in Python?

阅读更多关于 What is an efficient data structure for tokenized data in Python?

问题 I have a pandas dataframe that has a column with some text. I want to modify the dataframe such that there is a column for every distinct word that occurs across all rows, and a boolean indicating whether or not that word occurs in that particular row's value in my text column. I have some code to do this: from pandas import * a = read_table('file.tsv', sep='\t', index_col=False) b = DataFrame(a['text'].str.split().tolist()).stack().value_counts() for i in b.index: a[i] = Series(numpy.zeros

elasticsearch tokenize “H&R Blocks” as “H”, “R”, “H&R”, “Blocks”

阅读更多关于 elasticsearch tokenize “H&R Blocks” as “H”, “R”, “H&R”, “Blocks”

问题 I want to preserve the special character in the token, meanwhile still tokenize special characters. Say I have the word "H&R Blocks" I want to tokenize it as "H", "R", "H&R", "Blocks" I read this post http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html . It explained how to preserve the special character. 回答1: Try using the word_delimiter token filter. Reading the docs on its use you an set the parameter preserve_original: true to do

Python3.0 - tokenize and untokenize

阅读更多关于 Python3.0 - tokenize and untokenize

问题 I am using something similar to the following simplified script to parse snippets of python from a larger file: import io import tokenize src = 'foo="bar"' src = bytes(src.encode()) src = io.BytesIO(src) src = list(tokenize.tokenize(src.readline)) for tok in src: print(tok) src = tokenize.untokenize(src) Although the code is not the same in python2.x, it uses the same idiom and works just fine. However, running the above snippet using python3.0, I get this output: (57, 'utf-8', (0, 0), (0, 0)

Elasticsearch: index a field with keyword tokenizer but without stopwords

阅读更多关于 Elasticsearch: index a field with keyword tokenizer but without stopwords

问题 I am looking for a way to search company names with keyword tokenizing but without stopwords. For ex : The indexed company name is "Hansel und Gretel Gmbh." Here "und" and "Gmbh" are stop words for the company name. If the search term is "Hansel Gretel", that document should be found, If the search term is "Hansel" then no document should be found. And if the search term is "hansel gmbh", the no document should be found as well. I have tried to combine keywords tokenizer with stopwords in

tokenizing a string of data into a vector of structs?

阅读更多关于 tokenizing a string of data into a vector of structs?

问题 So I have the following string of data, which is being received through a TCP winsock connection, and would like to do an advanced tokenization, into a vector of structs, where each struct represents one record. std::string buf = "44:william:adama:commander:stuff\n33:luara:roslin:president:data\n" struct table_t { std::string key; std::string first; std::string last; std::string rank; std::additional; }; Each record in the string is delimited by a carriage return. My attempt at splitting up

get the last token of a string in C

阅读更多关于 get the last token of a string in C

问题 what I want to do is given an input string, which I will not know it's size or the number of tokens, be able to print it's last token. e.x.: char* s = "some/very/big/string"; char* token; const char delimiter[2] = "/"; token = strtok(s, delimiter); while (token != NULL) { printf("%s\n", token); token = strtok(NULL, delimiter); } return token; and i want my return to be string but I what I get is (null). Any workarounds? I've searched the web and can't seem to find an answer to this. At least

Responsibilities of the Lexer and the Parser

阅读更多关于 Responsibilities of the Lexer and the Parser

问题 I'm currently implementing a lexer for a simple programming language. So far, I can tokenize identifiers, assignment symbols, and integer literals correctly; in general, whitespace is insignificant. For the input foo = 42 , three tokens are recognized: foo (identifier) = (symbol) 42 (integer literal) So far, so good. However, consider the input foo = 42bar , which is invalid due to the (significant) missing space between 42 and bar . My lexer incorrectly recognizes the following tokens: foo

Someone can give a simple explanation about the elements of Natural Language Processing?

阅读更多关于 Someone can give a simple explanation about the elements of Natural Language Processing?

问题 I'm new to Natural Language Processing and I'm a confused about the terms used. What is tokenization? POS tagging? Entity Identify? Tokenization is only split the text in parts that can have a meaning or give a meaning for these parts? And the meaning, what is the name when I determine that something is a noun, verb or adjetive. And if I want to divide into dates, names, currency? I need a simple explanation about the areas/terms used in NLP. 回答1: To add to dmn's explanation: In general,

Tokenizing non English Text in Python

阅读更多关于 Tokenizing non English Text in Python

问题 I have a Persian text file that has some lines like this: ذوب 6 خوی 7 بزاق ،آب‌دهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف I want to generate a list of words from this line. For me the word borders are numbers, like 6, 7, etc in the above line and also ، character. so the list should be: [ 'ذوب','خوی','بزاق','آب‌دهان','یم','زهاب','آبرو','حیثیت' ,'شرف'] I want to do this in Python 3.3. What is the best way of doing this, I really appreciate any help on this. EDIT: I got a number of answers but when

How to implement XSLT tokenize function?

阅读更多关于 How to implement XSLT tokenize function?

问题 It seems like EXSLT tokenize function is not available with PHP XSLTProcessor (XSLT 1.0). I tried to implement it in pure XSL but I can't make it work : <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:func="http://exslt.org/functions" xmlns:exsl="http://exslt.org/common" xmlns:my="http://mydomain.com/"> <func:function name="my:tokenize"> <xsl:param name="string"/> <xsl:param name="separator" select="'|'"/> <xsl:variable name="item" select="substring-before