Speed up regex string search in MongoDB

后端 未结 2 812
半阙折子戏
半阙折子戏 2021-01-01 20:02

I\'m trying to use MongoDB to implement a natural language dictionary. I have a collection of lexemes, each of which has a number of wordforms as subdocuments. This is what

相关标签:
2条回答
  • 2021-01-01 20:18

    One possibility would be to store all the variants that you're thinking might be useful as an array element — not sure whether that might be possible though!

        {
            "number" : "pl",
            "surface_form" : "skrejjen",
            "surface_forms: [ "skrej", "skre" ],
            "phonetic" : "'skrɛjjɛn",
            "pattern" : "CCCVCCVC"
        }
    

    I would probably also suggest to not store 1000 word forms with each word, but turn this around to have smaller documents. The smaller your documents are, the less MongoDB would have to read into memory for each search (as long as the search conditions don't require a full scan of course):

    {
        "word": {
            "pos" : "N",
            "lemma" : "skrun",
            "gloss" : "screw",
        },
        "form" : {
            "number" : "sg",
            "surface_form" : "skrun",
            "phonetic" : "ˈskruːn",
            "gender" : "m"
        },
        "source" : "Mayer2013"
    }
    
    {
        "word": {
            "pos" : "N",
            "lemma" : "skrun",
            "gloss" : "screw",
        },
        "form" : {
            "number" : "pl",
            "surface_form" : "skrejjen",
            "phonetic" : "'skrɛjjɛn",
            "pattern" : "CCCVCCVC"
        },
        "source" : "Mayer2013"
    }
    

    I also doubt that MySQL would be performing better here with searches for random word forms as it will have to do a full table scan just as MongoDB would be. The only thing that could help there is a query cache - but that is something that you could build in your search UI/API in your application quite easily of course.

    0 讨论(0)
  • 2021-01-01 20:37

    As suggested by Derick, I refactored the data in my database such that I have "wordforms" as a collection rather than as subdocuments under "lexemes". The results were in fact better! Here are some speed comparisons. The last example using hint is intentionally bypassing the indexes on surface_form, which in the old schema was actually faster.

    Old schema (see original question)

    Query                                                              Avg. Time
    db.lexemes.find({"wordforms.surface_form":"skrun"})                0s
    db.lexemes.find({"wordforms.surface_form":/^skr/})                 1.0s
    db.lexemes.find({"wordforms.surface_form":/skru/})                 > 3mins !
    db.lexemes.find({"wordforms.surface_form":/skru/}).hint('_id_')    2.8s
    

    New schema (see Derick's answer)

    Query                                                              Avg. Time
    db.wordforms.find({"surface_form":"skrun"})                        0s
    db.wordforms.find({"surface_form":/^skr/})                         0.001s
    db.wordforms.find({"surface_form":/skru/})                         1.4s
    db.wordforms.find({"surface_form":/skru/}).hint('_id_')            3.0s
    

    For me this is pretty good evidence that a refactored schema would make searching faster, and worth the redundant data (or extra join required).

    0 讨论(0)
提交回复
热议问题