solr scoring - fieldnorm

前端 未结 2 1183
迷失自我
迷失自我 2021-01-19 05:24

I have the following records and the scores against it when I search for \"iphone\" -

Record1: FieldName - DisplayName : \"Iphone\" FieldName - Name : \"Iphone\"

相关标签:
2条回答
  • 2021-01-19 06:03

    fieldnorm takes into account the field length i.e. the number of terms.
    The fieldtype used is text for the fields display name & name, which would have the stopwords and the word delimiter filters.

    Record 1 - Iphone
    Would generate a single token - IPhone

    Record 2 - The Iphone Book
    Would generate 2 tokens - Iphone, Book
    The would be removed by the stopwords.

    Record 3 - iPhone
    Would also generate 2 tokens - i,phone
    As iPhone has a case change, the word delimiter filter with splitOnCaseChange would now split iPhone into 2 tokens i, Phone and would produce the field norm same as Record 2

    0 讨论(0)
  • 2021-01-19 06:19

    This is the answer to user1021590's follow-up question/answer on the "da vinci code" search example.

    The reason all the documents get the same score is due to a subtle implementation detail of lengthNorm. Lucence TFIDFSimilarity doc states the following about norm(t, d):

    the resulted norm value is encoded as a single byte before being stored. At search time, the norm byte value is read from the index directory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75.

    If you dig into the code, you see that this float-to-byte encoding is implemented as follows:

    public static byte floatToByte315(float f)
    {
        int bits = Float.floatToRawIntBits(f);
        int smallfloat = bits >> (24 - 3);
        if (smallfloat <= ((63 - 15) << 3))
        {
            return (bits <= 0) ? (byte) 0 : (byte) 1;
        }
        if (smallfloat >= ((63 - 15) << 3) + 0x100)
        {
            return -1;
        }
        return (byte) (smallfloat - ((63 - 15) << 3));
    }
    

    and the decoding of that byte to float is done as:

    public static float byte315ToFloat(byte b)
    {
        if (b == 0)
            return 0.0f;
        int bits = (b & 0xff) << (24 - 3);
        bits += (63 - 15) << 24;
        return Float.intBitsToFloat(bits);
    }
    

    lengthNorm is calculated as 1 / sqrt( number of terms in field ). This is then encoded for storage using floatToByte315. For a field with 3 terms, we get:

    floatToByte315( 1/sqrt(3.0) ) = 120

    and for a field with 4 terms, we get:

    floatToByte315( 1/sqrt(4.0) ) = 120

    so both of them get decoded to:

    byte315ToFloat(120) = 0.5.

    The doc also states this:

    The rationale supporting such lossy compression of norm values is that given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.

    UPDATE: As of Solr 4.10, this implementation and corresponding statements are part of DefaultSimilarity.

    0 讨论(0)
提交回复
热议问题