I\'m trying to use MongoDB to implement a natural language dictionary. I have a collection of lexemes, each of which has a number of wordforms as subdocuments. This is what
One possibility would be to store all the variants that you're thinking might be useful as an array element — not sure whether that might be possible though!
{
"number" : "pl",
"surface_form" : "skrejjen",
"surface_forms: [ "skrej", "skre" ],
"phonetic" : "'skrɛjjɛn",
"pattern" : "CCCVCCVC"
}
I would probably also suggest to not store 1000 word forms with each word, but turn this around to have smaller documents. The smaller your documents are, the less MongoDB would have to read into memory for each search (as long as the search conditions don't require a full scan of course):
{
"word": {
"pos" : "N",
"lemma" : "skrun",
"gloss" : "screw",
},
"form" : {
"number" : "sg",
"surface_form" : "skrun",
"phonetic" : "ˈskruːn",
"gender" : "m"
},
"source" : "Mayer2013"
}
{
"word": {
"pos" : "N",
"lemma" : "skrun",
"gloss" : "screw",
},
"form" : {
"number" : "pl",
"surface_form" : "skrejjen",
"phonetic" : "'skrɛjjɛn",
"pattern" : "CCCVCCVC"
},
"source" : "Mayer2013"
}
I also doubt that MySQL would be performing better here with searches for random word forms as it will have to do a full table scan just as MongoDB would be. The only thing that could help there is a query cache - but that is something that you could build in your search UI/API in your application quite easily of course.
As suggested by Derick, I refactored the data in my database such that I have "wordforms" as a collection rather than as subdocuments under "lexemes".
The results were in fact better!
Here are some speed comparisons. The last example using hint
is intentionally bypassing the indexes on surface_form
, which in the old schema was actually faster.
Old schema (see original question)
Query Avg. Time
db.lexemes.find({"wordforms.surface_form":"skrun"}) 0s
db.lexemes.find({"wordforms.surface_form":/^skr/}) 1.0s
db.lexemes.find({"wordforms.surface_form":/skru/}) > 3mins !
db.lexemes.find({"wordforms.surface_form":/skru/}).hint('_id_') 2.8s
New schema (see Derick's answer)
Query Avg. Time
db.wordforms.find({"surface_form":"skrun"}) 0s
db.wordforms.find({"surface_form":/^skr/}) 0.001s
db.wordforms.find({"surface_form":/skru/}) 1.4s
db.wordforms.find({"surface_form":/skru/}).hint('_id_') 3.0s
For me this is pretty good evidence that a refactored schema would make searching faster, and worth the redundant data (or extra join required).