tokenize

What is an efficient data structure for tokenized data in Python?

≯℡__Kan透↙ 提交于 2019-12-11 02:52:37
问题 I have a pandas dataframe that has a column with some text. I want to modify the dataframe such that there is a column for every distinct word that occurs across all rows, and a boolean indicating whether or not that word occurs in that particular row's value in my text column. I have some code to do this: from pandas import * a = read_table('file.tsv', sep='\t', index_col=False) b = DataFrame(a['text'].str.split().tolist()).stack().value_counts() for i in b.index: a[i] = Series(numpy.zeros

elasticsearch tokenize “H&R Blocks” as “H”, “R”, “H&R”, “Blocks”

落花浮王杯 提交于 2019-12-11 02:03:27
问题 I want to preserve the special character in the token, meanwhile still tokenize special characters. Say I have the word "H&R Blocks" I want to tokenize it as "H", "R", "H&R", "Blocks" I read this post http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html . It explained how to preserve the special character. 回答1: Try using the word_delimiter token filter. Reading the docs on its use you an set the parameter preserve_original: true to do

Python3.0 - tokenize and untokenize

﹥>﹥吖頭↗ 提交于 2019-12-11 01:49:34
问题 I am using something similar to the following simplified script to parse snippets of python from a larger file: import io import tokenize src = 'foo="bar"' src = bytes(src.encode()) src = io.BytesIO(src) src = list(tokenize.tokenize(src.readline)) for tok in src: print(tok) src = tokenize.untokenize(src) Although the code is not the same in python2.x, it uses the same idiom and works just fine. However, running the above snippet using python3.0, I get this output: (57, 'utf-8', (0, 0), (0, 0)

Elasticsearch: index a field with keyword tokenizer but without stopwords

拈花ヽ惹草 提交于 2019-12-11 01:42:49
问题 I am looking for a way to search company names with keyword tokenizing but without stopwords. For ex : The indexed company name is "Hansel und Gretel Gmbh." Here "und" and "Gmbh" are stop words for the company name. If the search term is "Hansel Gretel", that document should be found, If the search term is "Hansel" then no document should be found. And if the search term is "hansel gmbh", the no document should be found as well. I have tried to combine keywords tokenizer with stopwords in

tokenizing a string of data into a vector of structs?

痞子三分冷 提交于 2019-12-11 00:14:00
问题 So I have the following string of data, which is being received through a TCP winsock connection, and would like to do an advanced tokenization, into a vector of structs, where each struct represents one record. std::string buf = "44:william:adama:commander:stuff\n33:luara:roslin:president:data\n" struct table_t { std::string key; std::string first; std::string last; std::string rank; std::additional; }; Each record in the string is delimited by a carriage return. My attempt at splitting up

get the last token of a string in C

不羁的心 提交于 2019-12-10 23:45:34
问题 what I want to do is given an input string, which I will not know it's size or the number of tokens, be able to print it's last token. e.x.: char* s = "some/very/big/string"; char* token; const char delimiter[2] = "/"; token = strtok(s, delimiter); while (token != NULL) { printf("%s\n", token); token = strtok(NULL, delimiter); } return token; and i want my return to be string but I what I get is (null). Any workarounds? I've searched the web and can't seem to find an answer to this. At least

Responsibilities of the Lexer and the Parser

柔情痞子 提交于 2019-12-10 22:04:31
问题 I'm currently implementing a lexer for a simple programming language. So far, I can tokenize identifiers, assignment symbols, and integer literals correctly; in general, whitespace is insignificant. For the input foo = 42 , three tokens are recognized: foo (identifier) = (symbol) 42 (integer literal) So far, so good. However, consider the input foo = 42bar , which is invalid due to the (significant) missing space between 42 and bar . My lexer incorrectly recognizes the following tokens: foo

Someone can give a simple explanation about the elements of Natural Language Processing?

十年热恋 提交于 2019-12-10 20:26:52
问题 I'm new to Natural Language Processing and I'm a confused about the terms used. What is tokenization? POS tagging? Entity Identify? Tokenization is only split the text in parts that can have a meaning or give a meaning for these parts? And the meaning, what is the name when I determine that something is a noun, verb or adjetive. And if I want to divide into dates, names, currency? I need a simple explanation about the areas/terms used in NLP. 回答1: To add to dmn's explanation: In general,

Tokenizing non English Text in Python

自古美人都是妖i 提交于 2019-12-10 20:09:27
问题 I have a Persian text file that has some lines like this: ذوب 6 خوی 7 بزاق ،آب‌دهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف I want to generate a list of words from this line. For me the word borders are numbers, like 6, 7, etc in the above line and also ، character. so the list should be: [ 'ذوب','خوی','بزاق','آب‌دهان','یم','زهاب','آبرو','حیثیت' ,'شرف'] I want to do this in Python 3.3. What is the best way of doing this, I really appreciate any help on this. EDIT: I got a number of answers but when

How to implement XSLT tokenize function?

不问归期 提交于 2019-12-10 17:47:12
问题 It seems like EXSLT tokenize function is not available with PHP XSLTProcessor (XSLT 1.0). I tried to implement it in pure XSL but I can't make it work : <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:func="http://exslt.org/functions" xmlns:exsl="http://exslt.org/common" xmlns:my="http://mydomain.com/"> <func:function name="my:tokenize"> <xsl:param name="string"/> <xsl:param name="separator" select="'|'"/> <xsl:variable name="item" select="substring-before