问题
I have already asked a similar question at Calculating Word Proximity in an inverted Index. However i felt that the question was too general and not refined enough. So here goes.
I have a List which contains the location of tokens in a document. for each token it goes as
public List<int> hitLocation;
Lets say the the document is
Java programming language has a name similar to java island in Indonesia however
local language in java bears no resemblance to the programming language called java.
and the query is
java island language
So Say i lock on to the Java HitList and attempt to directly calculate the distance between the Java HisList, Island HitList and Language Hitlist.
Now the first problem is that there are 4 java tokens occurrences in the sentence. Which one do i select. Assuming i select the first one.
I go onto the island token list and after comparing find it that it adjacent to the second occurrence of java. So i change my selection and lock onto the second occurrence of java.
Proceeding to the third token language i find that it situated at quite a distance from our selection however i find it that it is quite near the first java occurrence.
So you see the dilemma here if now again revert back to the original selection i.e the first occurrence of java the distance to second token "island" increases and if i stay with my current selection the sheer distance of the second occurrence of the token "language" will make relevance busted.
Previously there was the suggestion of dot product however i am at loss on how to proceed forward with that option.
Any other solution would also be welcomed.
I Understand that this question is quite detailed. However i have searched long and hard and haven't found any question like this on this topic.
I feel if this question is answered it will be a great addition to the community and will make anybody who is designing anything related to relevancy quite happy.
Thank You.
回答1:
You seem to be using the hit lists a little differently then how they are intended to be used (at least given my understanding).
Typically people compare hit lists returned by different documents. This is how they rank one document as being "more relevant" than a different document.
That said, if you want to find all locations of some multi-word phrase like "java island" given the locations of the words "java" and "island" you would...
- Get a list of locations for "java"
- Get a list of locations for "island"
- Sort both lists
- Iterate through both lists at the same time. You start be getting the first entry of both lists. Now test this pair of entries. I.E., if these entries are "off by one" you have found one instance of "java island" (or perhaps "island java"). Get the next entry in the list that currently shows the minimum value. Test this new pair of entries. Repeat.
BTW -- The dot product is more useful when comparing 2 different documents.
回答2:
Well, since you explicitly ask about the dot product suggestion, i'll try to explain a little more formally what I had in mind. Keep in mind that it's not very efficient as it might convert the complexity from basing on the lengths of the hitlists, into something based on the length of the text (unless there's some trick to cut that).
My initial thought was to convert each hitlist into a series of binary value at the text length, high where there's a hit and low otherwise.
for e.g. java would look
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
But since you want proximity, convert each occurrence into a pyramid, for e.g. -
3 2 1 0 0 0 1 2 3 2 1 0 0 0 1 2 3 2 0 0 0 0 0 1 2 3
Same way for island -
0 0 0 0 0 0 0 1 2 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Now a dot product would give you some sort of proximity "score" between the two vectors, since it accumulates all the locations where two words are close (the closer the better). Java and island can be said to have a mutual score of 16. For a higher threshold you could stretch the pyramid further, or play with the shape.
Now, here you add another suggestion that this method isn't very suited for, you also want to catch the exact location of highest proximity, this isn't very well defined IMHO, what if word1 matches word2 (at some level) in position1, but word2 matches word3 at the same level in position2 - what location would you want?
Also, keep in mind that this method is O(text_length * words^2), that might be good in some cases, but very bad for others (if you're searching the bible for e.g.)
回答3:
Ok so guys i realize that i am answering my own question and a bit late.
So to all those people trying to calculate word proximity starting from inverted index should take a look at this link http://www.ardendertat.com/2011/05/31/how-to-implement-a-search-engine-part-2-query-index/
来源:https://stackoverflow.com/questions/19034271/computing-the-dot-product-for-calculating-proximity