Duke deduplication engine : exact same record not matched

回眸只為那壹抹淺笑 提交于 2020-01-24 19:09:27

问题


I am attempting to use Duke to match records from one csv to another.First csv and second both has ID,Model,Price,CompanyName,Review,Url columns. I am trying to match to another csv to find duplicates records.

package no.priv.garshol.duke;

import no.priv.garshol.duke.matchers.PrintMatchListener;

public class RunDuke {

  public static void main(String[] argv) throws Exception {
    Configuration config =
        ConfigLoader
            .load("/home/kishore/Duke-master/doc/example-data/presonalCare.xml");
    Processor proc = new Processor(config);
    proc.addMatchListener(new PrintMatchListener(true, true, true, false, config.getProperties(),
        true));
    proc.link();
    proc.close();
  }

}

Here is an example of personalCare.xml:

<!-- language: xml -->
<!-- For more information, see https://github.com/larsga/Duke/wiki/ Improvements 
    needed: - some area numbers have spaces in them - not stripping accents from 
    names -->
<duke>
    <schema>
        <threshold>0.7</threshold>
        <property type="id">
            <name>ID</name>
        </property>
        <property>
            <name>Model</name>
            <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
            <low>0.4</low>
            <high>0.8</high>
        </property>
        <property>
            <name>Price</name>
            <comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator>
            <low>0.04</low>
            <high>0.73</high>
        </property>
        <property>
            <name>CompanyName</name>
            <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
            <low>0.4</low>
            <high>0.8</high>
        </property>
        <property>
            <name>Review</name>
            <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
            <low>0.12</low>
            <high>0.93</high>
        </property>
        <property>
            <name>Url</name>
            <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
            <low>0.12</low>
            <high>0.93</high>
        </property>
    </schema>

    <database class="no.priv.garshol.duke.databases.InMemoryDatabase">
    </database>

    <group>
        <csv>
            <param name="input-file" value="personal_care_11.csv" />
            <param name="header-line" value="false" />
            <column name="1" property="ID" />
            <column name="2" property="Model" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
            <column name="3" property="Price" />
            <column name="4" property="CompanyName" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
            <column name="5" property="Review" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
            <column name="6" property="Url" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
        </csv>
    </group>

    <group>
        <csv>
            <param name="input-file" value="personal_care_11.csv" />
            <param name="header-line" value="false" />
            <column name="1" property="ID" />
            <column name="2" property="Model" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
            <column name="3" property="Price" />
            <column name="4" property="CompanyName" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
            <column name="5" property="Review" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
            <column name="6" property="Url" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
        </csv>
    </group>


</duke>

The above code is working fine but it does not match the exact record example

STHDRNFKAQ4AFYE8,Littmann 3M Classic II S.E Acoustic
Stethoscope,6297,Littmann,,http://dl.flipkart.com/dl/littmann-3m-classic-ii-s-e-acoustic-stethoscope/p/itme3uhzbqxhzfda?pid=STHDRNFKAQFAFYE8&affid=3ba0de4902524e2b90e43b84b89ea0ef

which is in both csv files. I also want to know the work of low and high property value which is given in .xml file, how to decide the low and high value for column value.


回答1:


You are doing record linkage (two data sets) and not deduplication (single data set), so take out the .deduplicate() call.

Also, please don't use the 'no.priv.garshol.duke' package name. You should never use domain names you don't own yourself.

Anyway, the reason you can't find any matches is that the two records have the same ID. Duke verifies that it's not reporting records as matching themselves, and so the match gets filtered out. If you make a copy of the csv file and use that for group 2, then make a change to the ID then Duke finds the duplicate.

Here's what happens when I try that:

[lars.garshol@laptop tmp]$ java -cp ~/cvs-co/duke/duke-core/target/duke-core-1.3-SNAPSHOT.jar:. no.priv.garshol.duke.Duke --showmatches presonalCare.xml 

MATCH 0.9982630751840313
ID: 'SHDRNFKAQ4AFYE8', Model: 'littmann 3m classic ii s.e acoustic stethoscope', Price: '6297', CompanyName: 'littmann', Url: 'http://dl.flipkart.com/dl/littmann-3m-classic-ii-s-e-acoustic-stethoscope/p/itme3uhzbqxhzfda?pid=sthdrnfkaqfafye8&affid=3ba0de4902524e2b90e43b84b89ea0ef', 
ID: 'STHDRNFKAQ4AFYE8', Model: 'littmann 3m classic ii s.e acoustic stethoscope', Price: '6297', CompanyName: 'littmann', Url: 'http://dl.flipkart.com/dl/littmann-3m-classic-ii-s-e-acoustic-stethoscope/p/itme3uhzbqxhzfda?pid=sthdrnfkaqfafye8&affid=3ba0de4902524e2b90e43b84b89ea0ef', 


来源:https://stackoverflow.com/questions/32991390/duke-deduplication-engine-exact-same-record-not-matched

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!