tokenize

Solr: Can't search for numbers mixed with characters

亡梦爱人 提交于 2019-12-23 11:59:34
问题 I have some items in my index (Solr. 4.4), which contain names like Foobar 135g , where the 135g refers to some weights. Searching for foobar or foobar 135 does work, but when I try to search for the exact phrase foobar 135g , nothing is found. I analysed the query inside the solr admin panel "Analysis". Here everything looks good. The fields are indexed correctly, the query is splitted correctly, and I get hits (indicated by this purple background on the tokens). But there has to be an issue

how does the String.Split method determine separator precedence when passed multiple multi-character separators?

随声附和 提交于 2019-12-23 08:03:40
问题 If you have this code: "......".Split(new String[]{"...", ".."}, StringSplitOptions.None); The resulting array elements are: 1. "" 2. "" 3. "" Now if you reverse the order of the separators, "......".Split(new String[]{"..", "..."}, StringSplitOptions.None); The resulting array elements are: 1. "" 2. "" 3. "" 4. "" From these 2 examples I feel inclined to conclude that the Split method recursively tokenizes as it goes through each element of the array from left to right. However, once we

Elasticsearch - how to make shorter phrase more relevant in result

你。 提交于 2019-12-23 05:52:16
问题 Hi can somebody explain me please how to achieve higher score in search result for shorter phrase in compare with longer phrase? E.g. if I have two words ABCXXX and ABCXXXXX with edge ngram tokenizer and if I want to find ABC token score is the same . I would like to get higher score for ABCXXX and lower for ABCXXXXXX. Mapping looks like: { "settings": { "index": { "refresh_interval": "1m", "number_of_shards": "1", "number_of_replicas": "1", "analysis": { "filter": { "autocomplete_filter": {

Elasticsearch - how to make shorter phrase more relevant in result

匆匆过客 提交于 2019-12-23 05:52:07
问题 Hi can somebody explain me please how to achieve higher score in search result for shorter phrase in compare with longer phrase? E.g. if I have two words ABCXXX and ABCXXXXX with edge ngram tokenizer and if I want to find ABC token score is the same . I would like to get higher score for ABCXXX and lower for ABCXXXXXX. Mapping looks like: { "settings": { "index": { "refresh_interval": "1m", "number_of_shards": "1", "number_of_replicas": "1", "analysis": { "filter": { "autocomplete_filter": {

Modify python nltk.word_tokenize to exclude “#” as delimiter

血红的双手。 提交于 2019-12-23 05:00:16
问题 I am using Python's NLTK library to tokenize my sentences. If my code is text = "C# billion dollars; we don't own an ounce C++" print nltk.word_tokenize(text) I get this as my output ['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++'] The symbols ; , . , # are considered as delimiters. Is there a way to remove # from the set of delimiters like how + isn't a delimiter and thus C++ appears as a single token? I want my output to be ['C#', 'billion', 'dollars', '

Access Violation With Pointers? - C++

岁酱吖の 提交于 2019-12-23 04:26:53
问题 I've written a simple string tokenizing program using pointers for a recent school project. However, I'm having trouble with my StringTokenizer::Next() method, which, when called, is supposed to return a pointer to the first letter of the next word in the char array. I get no compile-time errors, but I get a runtime error which states: Unhandled exception at 0x012c240f in Project 5.exe: 0xC0000005: Access violation reading location 0x002b0000. The program currently tokenizes the char array,

Splitting a sentence without any whitespace/seperators into a sentence with whitespace

纵然是瞬间 提交于 2019-12-23 02:29:06
问题 I'm working on an end-of-semester project for a Programming languages course. The assignment is given below. I'm finishing writing it in Java and I'm having a lot of trouble writing in Prolog. I've been having a lot of trouble with Prolog so this question is as much looking for help with the assignment as it is trying to understand Prolog more. Any help that I can get would be GREATLY appreciated A sentence contains words, all occurring in a dictionary, that happen to be concatenated without

How to insert a character every N characters in a string in C++

自闭症网瘾萝莉.ら 提交于 2019-12-22 11:36:30
问题 How can I insert a char acter into a string exactly after 1 character? I need to insert '|' into the string after every other character. In other words (C++): "Tokens all around!" Turns into: "T|o|k|e|n|s| |a|l|l| |a|r|o|u|n|d|!" (no thats not an array) Thanks 回答1: std::string tokenize(const std::string& s) { if (!s.size()) { return ""; } std::stringstream ss; ss << s[0]; for (int i = 1; i < s.size(); i++) { ss << '|' << s[i]; } return ss.str(); } 回答2: I think I'd use a standard algorithm and

String tokenizer without using strtok()

半世苍凉 提交于 2019-12-22 10:45:45
问题 I'm in the process of writing a string tokenizer without using strtok(). This is mainly for my own betterment and for a greater understanding of pointers. I think I almost have it, but I've been receiving the following errors: myToc.c:25 warning: assignment makes integer from pointer without a cast myToc.c:35 (same as above) myToc.c:44 error: invalid type argument of 'unary *' (have 'int') What I'm doing is looping through the string sent to the method, finding each delimiter, and replacing

solr not tokenizing protected words

半腔热情 提交于 2019-12-22 10:28:36
问题 I have documents in Solr/Lucene (3.x) with a special copy field facet_headline in order to have an unstemmed field for faceting. Sometimes 2 ore more words are belong together, and this should be handled/counted as one word, for example "kim jong il". So the headline "Saturday: kim jong il had died" should be split into: Saturday kim jong il had died For this reason I decided to use protected words (protwords), where I add kim jong il . The schema.xml looks like this. <fieldType name="facet