how edge ngram token filter differs from ngram token filter?

后端 未结 2 1644
时光取名叫无心
时光取名叫无心 2021-01-07 17:38

As I am new to elastic search, I am not able to identify difference between ngram token filter and edge ngram token filter.

How these two differ f

相关标签:
2条回答
  • 2021-01-07 18:22

    ngram moves the cursor while breaking the text:

    Text: Red Wine
    
    Options:
        ngram_min: 2
        ngram_max: 3
    
    Result: Re, Red, ed, Wi, Win, in, ine, ne
    

    As you see here, the cursor moves ngram_min times to the next fragment until it reaches the ngram_max.


    ngram_edge does the exact same thing as ngram but it doesn't move the cursor:

    Text: Red Wine
    
    Options:
        ngram_min: 2
        ngram_max: 3
    
    Result: Re, Red
    

    Why didn't it return Win? because the cursor doesn't move, it'll always start from the position zero, moves ngram_min times and backs to the same position (which is always zero).


    Think of ngram_edge as if it was a substring function in other programming languages such as JavaScript:

    // ngram
    let str = "Red Wine";
    console.log(str.substring(0, 2)); // Re
    console.log(str.substring(0, 3)); // Red
    console.log(str.substring(1, 3)); // ed, start from position 1
    // ...
    
    // ngram_edge
    // notice that the position is always zero
    console.log(str.substring(0, 2)); // Re
    console.log(str.substring(0, 3)); // Red
    

    Try it out by yourself using Kibana:

    PUT my_index
    {
      "settings": {
        "analysis": {
          "tokenizer": {
            "my_ngram_tokenizer" : {
              "type" : "ngram",
              "min_gram": 2,
              "max_gram": 3,
              "token_chars": [
                "letter",
                "digit"
              ]
            },
            "my_edge_ngram_tokenizer": {
              "type": "edge_ngram",
              "min_gram": 2,
              "max_gram": 3
            }
          }
        }
      }
    }
    
    POST my_index/_analyze
    {
      "tokenizer": "my_ngram_tokenizer",
      "text": "Red Wine"
    }
    
    POST my_index/_analyze
    {
      "tokenizer": "my_edge_ngram_tokenizer", 
      "text": "Red Wine"
    }
    
    0 讨论(0)
  • 2021-01-07 18:34

    I think the documentation is pretty clear on this:

    This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.

    And the best example for nGram tokenizer again comes from the documentation:

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'
    
    
        # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
    

    With this tokenizer definition:

                        "type" : "nGram",
                        "min_gram" : "2",
                        "max_gram" : "3",
                        "token_chars": [ "letter", "digit" ]
    

    In short:

    • the tokenizer, depending on the configuration, will create tokens. In this example: FC, Schalke, 04.
    • nGram generates groups of characters of minimum min_gram size and maximum max_gram size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).
    • edgeNGram does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.

    For the same text as above, an edgeNGram generates this: FC, Sc, Sch, Scha, Schal, 04. Every "word" in the text is considered and for every "word" the first character is the starting point (F from FC, S from Schalke and 0 from 04).

    0 讨论(0)
提交回复
热议问题