Inverted Index in Python not returning desired results

孤街浪徒 提交于 2019-12-13 09:29:29

问题


I'm having trouble returning proper results for an inverted index in python. I'm trying to load a list of strings in the variable 'strlist' and then with my Inverse index looping over the strings to return the word + where it occurs. Here is what I have going so far:

def inverseIndex(strlist):
  d={}
  for x in range(len(strlist)):
    for y in strlist[x].split():
      for index, word in set(enumerate([y])):
        if word in d:
          d=d.update(index)
        else:
          d._setitem_(index,word)
        break
      break
    break
  return d

Now when i run inverseIndex(strlist)

all it returns is {0:'This'} where what I want is a dictionary mapping all the words in 'strlist' to the set d.

Is my initial approach wrong? am i tripping up in the if/else? Any and all help is greatly appreciated. to point me in the right direction.


回答1:


Based on what you're saying, I think you're trying to get some data like this:

input = ["hello world", "foo bar", "red cat"]
data_wanted = {
    "foo" : 1,
    "hello" : 0,
    "cat" : 2,
    "world" : 0,
    "red" : 2
    "bar" : 1
}

So what you should be doing is adding the words as keys to a dictionary, and have their values be the index of the substring in strlist in which they are located.

def locateWords(strlist):
d = {}
for i, substr in enumerate(strlist):   # gives you the index and the item itself
    for word in substr.split()
        d[word] = i
return d

If the word occurs in more than one string in strlist, you should change the code to the following:

def locateWords(strlist):
d = {}
for i, substr in enumerate(strlist):
    for word in substr.split()
        if word not in d:
            d[word] = [i]
        else:
            d[word].append(i)
return d

This changes the values to lists, which contain the indices of the substrings in strlist which contain that word.

Some of your code's problems explained

  1. {} is not a set, it's a dictionary.
  2. break forces a loop to terminate immediately - you didn't want to end the loop early because you still had data to process.
  3. d.update(index) will give you a TypeError: 'int' object is not iterable. This method actually takes an iterable object and updates the dictionary with it. Normally you would use a list of tuples for this: [("foo",1), ("hello",0)]. It just adds the data to the dictionary.
  4. You don't normally want to use d.__setitem__ (which you typed wrong anyway). You'd just use d[key] = value.
  5. You can iterate using a "for each" style loop instead, like my code above shows. Looping over the range means you are looping over the indices. (Not exactly a problem, but it could lead to extra bugs if you're not careful to use the indices properly).

It looks like you are coming from another programming language in which braces indicate sets and there is a keyword which ends control blocks (like if, fi). It's easy to confuse syntax when you're first starting - but if you run into trouble running the code, look at the exceptions you get and search them on the web!

P.S. I'm not sure why you wanted a set - if there are duplicates, you probably want to know all of their locations, not just the first or the last one or anything in between. Just my $0.02.




回答2:


break is not an end-of-block marker; it means "if you hit this line of code, exit the loop immediately". You probably don't want all those break statements.

I'm not sure what you think the update method does.

d.update(index)

will try to treat index as a dict or a sequence of key-value pairs and add all the mappings in index to d. Since index is a number, this doesn't seem to be what you expect update to do. Also, update returns None, which is the Python equivalent of not returning anything, so you probably don't want to assign its value to d.

I'm not sure what you expect

for index, word in set(enumerate([y])):

to do. Let's go over what it does. [y] creates a 1-element list whose only element is y. enumerate([y]) will then return an iterator yielding a single element, the tuple (0, y). set(enumerate([y])) will then take all the items from that iterator (so just one item) and make a set containing those items. Finally, for index, word in set(enumerate([y])): will iterate over that one-item set, executing a single loop iteration with index == 0 and word == y. This is probably not what you were trying to do.

The __setitem__ special method (which has two underscores on each side) is called by Python to implement element assignment.

d.__setitem__(index, word)

is better written as

d[index] = word

If you want to iterate over strlist, then instead of using range(len(strlist)), you can iterate over strlist directly.

  for x in range(len(strlist)):
    for y in strlist[x].split():

is equivalent to

  for string in strlist:
    for y in string.split():

since looping over strlist will give the items of strlist.

I hope that helps.



来源:https://stackoverflow.com/questions/17554977/inverted-index-in-python-not-returning-desired-results

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!