Python: how to sort a list of strings by substring relevance?

后端 未结 3 454
[愿得一人]
[愿得一人] 2021-01-16 06:38

I have some list of strings, for example:

["foo bar SOME baz TEXT bob",
"SOME foo bar baz bob TEXT",
"SOME foo TEXT",
"         


        
相关标签:
3条回答
  • 2021-01-16 06:58

    See your friendly neighborhood sorting tutorial. You'll need a sort with a key. Here's a trivial function to give you the idea; it finds the distance between the two words, returning that as the difference metric.

    sentence = ["foo bar SOME baz TEXT bob",
                "SOME foo bar baz bob TEXT",
                "SOME foo TEXT",
                "foo bar SOME TEXT baz",
                "SOME TEXT"]
    
    def match_score(sentence):
        some_pos = sentence.find("SOME")
        text_pos = sentence.find("TEXT")
        return abs(text_pos - some_pos)
    
    sentence.sort(key = lambda x: match_score(x))
    
    for item in sentence:
        print(item)
    

    Output:

    foo bar SOME TEXT baz
    SOME TEXT
    foo bar SOME baz TEXT bob
    SOME foo TEXT
    SOME foo bar baz bob TEXT
    
    0 讨论(0)
  • 2021-01-16 07:05

    Here is my take on it.

    l = ["foo bar SOME baz TEXT bob",
    "SOME foo bar baz bob TEXT",
    "SOME foo TEXT",
    "foo bar SOME TEXT baz",     
    "SOME TEXT"]
    
    l.sort(key=lambda x: (x.find("SOME")-x.find("TEXT"))*0.9-0.1*x.find("SOME"), reverse=True)
    
    print(l)
    

    OUTPUT:

    ['SOME TEXT', 'foo bar SOME TEXT baz', 'SOME foo TEXT', 'foo bar SOME baz TEXT bob', 'SOME foo bar baz bob TEXT']
    

    So what we have done is sorted the list based on major weight to the distance between "SOME" and "TEXT" and some minor weight to the occurrence of "SOME" in the string.

    Another longer way would be to first group the list based on the their distance between SOME and TEXT. And then sort the each group based on the position of "SOME".

    0 讨论(0)
  • 2021-01-16 07:09

    You can use difflib.SequenceMatcher, to achieve something very similar to your desired output:

    >>> import difflib
    >>> l = ["foo bar SOME baz TEXT bob", "SOME foo bar baz bob TEXT", "SOME foo TEXT", "foo bar SOME TEXT baz", "SOME TEXT"]
    >>> sorted(l, key=lambda z: difflib.SequenceMatcher(None, z, "SOME TEXT").ratio(), reverse=True)
    ['SOME TEXT', 'SOME foo TEXT', 'foo bar SOME TEXT baz', 'foo bar SOME baz TEXT bob', 'SOME foo bar baz bob TEXT']
    

    If you can't tell the only difference is that the position of the two elements "foo bar SOME TEXT baz" and "SOME foo TEXT" are swapped compared to your desired output.

    0 讨论(0)
提交回复
热议问题