most efficient way to find partial string matches in large file of strings (python)

前端 未结 3 615
眼角桃花
眼角桃花 2021-01-12 19:46

I downloaded the Wikipedia article titles file which contains the name of every Wikipedia article. I need to search for all the article titles that may be a possible match.

相关标签:
3条回答
  • 2021-01-12 20:04

    I'd suggest you put your data into an sqlite database, and use the SQL 'like' operator for your searches.

    0 讨论(0)
  • 2021-01-12 20:05

    If you've got a fixed data set and variable queries, then the usual technique is to reorganise the data set into something that can be searched more easily. At an abstract level, you could break up each article title into individual lowercase words, and add each of them to a Python dictionary data structure. Then, whenever you get a query, convert the query word to lower case and look it up in the dictionary. If each dictionary entry value is a list of titles, then you can easily find all the titles that match a given query word.

    This works for straightforward words, but you will have to consider whether you want to do matching on similar words, such as finding "smoking" when the query is "smoke".

    0 讨论(0)
  • 2021-01-12 20:08

    Greg's answer is good if you want to match on individual words. If you want to match on substrings you'll need something a bit more complicated, like a suffix tree (http://en.wikipedia.org/wiki/Suffix_tree). Once constructed, a suffix tree can efficiently answer queries for arbitrary substrings, so in your example it could match "Ice_Hockey" when someone searched for "hock".

    0 讨论(0)
提交回复
热议问题