Java: Search in HashMap keys based on regex?

前端 未结 6 816
醉话见心
醉话见心 2020-12-05 01:12

I\'m building a thesaurus using a HashMap to store the synonyms.

I\'m trying to search through the words based on a regular expression: the method will have to take

相关标签:
6条回答
  • 2020-12-05 01:42

    Regular expressions are case sensitive. You want:

    Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
    
    0 讨论(0)
  • 2020-12-05 01:49

    It looks like you're using your regexes inappropriately. "c" would only match a lower case c, not upper case.

    That said, I'd suggest you look into using an embedded database with full text search capabilities.

    0 讨论(0)
  • 2020-12-05 01:54

    But, hmm:

    (a) Why would you use a HashMap if you intend to always search it sequentially? That's a lot of wasted overhead to process the hash keys and all when you never use them. Surely a simple ArrayList or LinkedList would be a better idea.

    (b) What does this have to do with a thesaurus? Why would you search a thesaurus using regular expressions? If I want to know synonyms for, say, "cat", I would think that I would search for "cat", not "c.*".

    My first thought on how to build a thesaurus would be ... well, I guess the first question I'd ask is, "Is synonym an equivalance relationship?", i.e. if A is a synonym for B, does it follow that B is a synonym for A? And if A is a synonym for B and B is a synonym for C, then is A a synonym for C? Assuming the answers to these questions are "yes", then what we want to build is something that divides all the words in the language into sets of synonyms, so we then can map any word in each set to all the other words in that set. So what you need is a way to take any word, map it to some sort of nexus point, and then go from that nexus point to all of the words that map to it.

    This would be straightforward on a database: Just create a table with two columns, say "word" and "token", each with its own index. All synonyms map to the same token. The token can be anything as long as its unique for any given set of synonyms, like a sequence number. Then search for the given word, find the associated token, and then get all the words with that token. For example we might create records with (big,1), (large,1), (gigantic,1), (cat,2), (feline,2), etc. Search for "big" and you get 1, then search for 1 and you get "big", "large", and "giant".

    I don't know any class in the built-in Java collections that does this. The easiest way I can think of is to build two co-ordinated hash tables: One that maps words to tokens, and another that maps tokens to an array of words. So table 1 might have big->1, large->1, gigantic->1, cat->2, feline->2, etc. Then table 2 maps 1->[big,large,gigantic], 2->[cat,feline], etc. You look up in the first table to map a word to a token, and in the second to map that token back to a list of words. It's clumsy because all the data is stored redundantly, maybe there's a better solution but I'm not getting it off the top of my head. (Well, it would be easy if we assume that we're going to sequentially search the entire list of words every time, but performance would suck as the list got big.)

    0 讨论(0)
  • 2020-12-05 01:58

    Is that the regular expression you're using?

    The Matcher.matches() method returns true only if the whole entire input sequence matches the expression (from the Javadoc), so you would need to use "c.*" in this case, not "c*" as well as matching case insensitively.

    0 讨论(0)
  • 2020-12-05 01:59

    Responding to Jay of "But Hmm" above,

    (I'd add a comment but don't have the rep.)

    Searching it sequentially is doing it the slow way. Doing it with regular expressions is to descend into madness. Doing it with a database is a programming cop out. Sure if your data set was massive that might be required but remember "for this assignment we're asked to use Java Collection Map" We should be figuring out the proper way to use this java collection.

    The reason it isn't obvious is because it isn't one collection. It's two. But it isn't two maps. It’s not an ArrayList. What’s missing is a Set. It's a map to sets of synonyms.

    Set<String> will let you build your lists of synonyms. You can make as many as you like. Two sets of synonyms would make a good example. It's a Set not an ArrayList because you don't want duplicate words.

    Map<String, Set<String>> will let you quickly find your way from any word to its synonym set.

    Build your sets. Then build the map. Write a helper method to build the map that takes a map and a set.

    addSet(Map<String, Set<String>> map, Set<String> newSet)

    This method just loops newSet and adds the strings to the map as keys and the reference to newSet as the value. You’d call addSet once for every set.

    Now that you're data structure is built we should be able to find stuff. To make that a little more robust, remember to clean your search key before you search. Use trim() to get rid of meaningless whitespace. Use toLowerCase() to get rid of meaningless capitalization. You should have done both of these on the synonym data before (or while) building the sets. Do that and who needs regular expressions for this? This way is much faster and more importantly safer. Regular Expressions are very powerful but can be a nightmare to debug when they go wrong. Don't use them just because you think they're cool.

    0 讨论(0)
  • 2020-12-05 02:00

    You need to specify case insensitivity Pattern.compile( "c",Pattern.CASE_INSENSITIVE ). To find a word with a c in it you need to use matcher.find(). Matcher.matches() tries to match the whole string.

    0 讨论(0)
提交回复
热议问题