fuzzy-comparison

How can I recognize slightly modified images?

前提是你 提交于 2019-11-28 17:34:41
I have a very large database of jpeg images, about 2 million. I would like to do a fuzzy search for duplicates among those images. Duplicate images are two images that have many (around half) of their pixels with identical values and the rest are off by about +/- 3 in their R/G/B values. The images are identical to the naked eye. It's the kind of difference you'd get from re-compressing a jpeg. I already have a foolproof way to detect if two images are identical: I sum the delta-brightness over all the pixels and compare to a threshold. This method has proven 100% accurate but doing 1 photo

Fuzzy Regular Expressions

旧街凉风 提交于 2019-11-28 16:30:27
In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes. Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...) . This means that the string TV Schedule for 10 Jan should return 0 while T Schedule for 10. Jan should return 2. This could be done by generating all strings in the regex (in this case 100x12) and find the best match, but that doesn't seam practical. Do you have any ideas how to do this effectively? Thomas Ahle

Find Match of two data frames and rewrite the answer as data frame

耗尽温柔 提交于 2019-11-28 02:35:42
问题 i have two data frames which are cleaned and merged as a single csv file , the data frames are like this **Source Master** chang chun petrochemical CHANG CHUN GROUP chang chun plastics CHURCH AND DWIGHT CO INC church dwight CITRIX SYSTEMS ASIA PACIFIC P L citrix systems pacific CNH INDUSTRIAL N.V now from these , i have to consider the first name and check with each name of master names and find a match that is relevant and print the output as another data frame. the above data frames are few

How can I recognize slightly modified images?

删除回忆录丶 提交于 2019-11-27 10:49:03
问题 I have a very large database of jpeg images, about 2 million. I would like to do a fuzzy search for duplicates among those images. Duplicate images are two images that have many (around half) of their pixels with identical values and the rest are off by about +/- 3 in their R/G/B values. The images are identical to the naked eye. It's the kind of difference you'd get from re-compressing a jpeg. I already have a foolproof way to detect if two images are identical: I sum the delta-brightness

Techniques for finding near duplicate records

孤人 提交于 2019-11-27 02:50:45
I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!". My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar. My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some

Fuzzy String Comparison

五迷三道 提交于 2019-11-26 23:47:22
What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to the original will receive a score of 1 and a sentence which is the total opposite will receive a 0. All other fuzzy sentences will receive a grade in between 1 and 0. I am unsure which operation to use to allow me to complete this in Python 3. I have included the sample text in which the Text 1 is the original and the other preceding strings are the comparisons. Text: Sample Text 1: It was a dark and stormy night. I was

How can I match fuzzy match strings from two datasets?

时光怂恿深爱的人放手 提交于 2019-11-26 17:32:25
I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS. So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the

Joining two datasets using fuzzy logic

怎甘沉沦 提交于 2019-11-26 16:56:51
问题 I’m trying to do a fuzzy logic join in R between two datasets: first data set has the name of a location and a column called config second data set has the name of a location and two additional attributes that need to be summarized before they are joined to the first data set. I would like to use the name column to join between the two data sets. However the name column may have additional or leading characters in either data set or have one word contained inside of a larger word. So for

Techniques for finding near duplicate records

喜你入骨 提交于 2019-11-26 10:17:00
问题 I\'m attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like \"Some Company Limited\" and \"SOME COMPANY LTD!\". My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like \"limited\" -> \"ltd\"), strip out non-alphabetic characters and then use agrep to see what looks similar. My first problem is that agrep only

How can I match fuzzy match strings from two datasets?

一个人想着一个人 提交于 2019-11-26 08:51:47
问题 I\'ve been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS. So far AGREP is the closest tool I\'ve found that might work. I can use levenshtein distances in the AGREP package, which measure the