Create analyzer with Edge N Gram analyzer and char filter which replaces space with new line

ε祈祈猫儿з 提交于 2019-12-24 11:15:30

问题


I have below type of text coming in. foo bar, hello world etc. I created an analyzer using Edge NGram tokenizer and using the analyze api it creates below token.

{
  "tokens": [
    {
      "token": "f",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 1
    },
    {
      "token": "fo",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 2
    },
    {
      "token": "foo",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 3
    },
    {
      "token": "b",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 4
    },
    {
      "token": "ba",
      "start_offset": 4,
      "end_offset": 6,
      "type": "word",
      "position": 5
    },
    {
      "token": "bar",
      "start_offset": 4,
      "end_offset": 7,
      "type": "word",
      "position": 6
    }
  ]
}

But when in my code I pass the text "foo bar" to method tokenStream, it create below tokens for foo bar.

f, fo, foo, foo , foo b, foo ba, foo bar.

This is causing the mismatch in the tokens returned by analyze api. I want to know how can I add a char filter which removes the space in the text and apply Edge NGram tokenizer on individual terms in the text.

So, In the foo bar example, it should create below token. when I call tokenStream method.

f, fo, foo, b, ba, bar.

I tried adding the char filter to my java code of create the analyzer. Below is the code of it.

@Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
        NormalizeCharMap normalizeCharMap = new NormalizeCharMap();
        normalizeCharMap.add(" ", "\\u2424");
        Reader replaceDots = new MappingCharFilter(normalizeCharMap, reader);
        TokenStream result = new EdgeNGramTokenizer(replaceDots, EdgeNGramTokenizer.DEFAULT_SIDE, 1, 30);
        return result;
    }

But it takes lu2424 as it as. Also please let me know if my code of Analyzer is correct or not?


回答1:


What you have tested using the analyze API is an edge-ngram token filter, which is different from an edge-ngram tokenizer.

In your code, you need to replace EdgeNGramTokenizer by EdgeNGramTokenFilter if you want to have the same behavior in your code as you tested with the analyze API.



来源:https://stackoverflow.com/questions/51930012/create-analyzer-with-edge-n-gram-analyzer-and-char-filter-which-replaces-space-w

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!