Determining duplicates in a datatable

前端 未结 2 782
走了就别回头了
走了就别回头了 2021-01-25 03:22

I have a data table I\'ve loaded from a CSV file. I need to determine which rows are duplicates based on two columns (product_id and owner_org_id) in t

相关标签:
2条回答
  • 2021-01-25 03:36

    Your criterium is off. You are comparing sets of objects that you are not interested (Except excludes) in.

    Instead, be as clear (data type) as possible and keep it simple:

    public bool Equals(DataRow x, DataRow y)
    {   
        // Usually you are dealing with INT keys
        return (x["PRODUCT_ID"] as int?) == (y["PRODUCT_ID"] as int?)
          && (x["OWNER_ORG_ID"] as int?) == (y["OWNER_ORG_ID"] as int?);
    
        // If you really are dealing with strings, this is the equivalent:
        // return (x["PRODUCT_ID"] as string) == (y["PRODUCT_ID"] as string)
        //  && (x["OWNER_ORG_ID"] as string) == (y["OWNER_ORG_ID"] as string)
    }  
    

    Check for null if that is a possibility. Maybe you want to exclude rows that are equal because their IDs are null.

    Observe the int?. This is not a typo. The question mark is required if you are dealing with database values from columns that can be NULL. The reason is that NULL values will be represented by the type DBNull in C#. Using the as operator just gives you null in this case (instead of an InvalidCastException. If you are sure, you are dealing with INT NOT NULL, cast with (int).

    The same is true for strings. (string) asserts you are expecting non-null DB values.

    EDIT1:

    Had the type wrong. ItemArray is not a hashtable. Use the row directly.

    EDIT2:

    Added string example, some comment

    For a more straight-forward way, check How to select distinct rows in a datatable and store into an array

    EDIT3:

    Some explanation regarding the casts.

    The other link I suggested does the same as your code. I forgot your original intent ;-) I just saw your code and responded to the most obvious error, I saw - sorry

    Here is how I would solve the problem

    using System.Linq;
    using System.Data.Linq;
    
    var q = dtCSV
        .AsEnumerable()
        .GroupBy(r => new { ProductId = (int)r["PRODUCT_ID"], OwnerOrgId = (int)r["OWNER_ORG_ID"] })
        .Where(g => g.Count() > 1).SelectMany(g => g);
    
    var duplicateRows = q.ToList();
    

    I don't know if this 100% correct, I don't have an IDE at hand. And you'll need to adjust the casts to the appropriate type. See my addition above.

    0 讨论(0)
  • 2021-01-25 03:44

    You could use LINQ-To-DataSet and Enumerable.Except/Intersect:

    var tbl1ID = tbl1.AsEnumerable()
            .Select(r => new
            {
                product_id = r.Field<String>("product_id"),
                owner_org_id = r.Field<String>("owner_org_id"),
            });
    var tbl2ID = tbl2.AsEnumerable()
            .Select(r => new
            {
                product_id = r.Field<String>("product_id"),
                owner_org_id = r.Field<String>("owner_org_id"),
            });
    
    
    var unique = tbl1ID.Except(tbl2ID);
    var both = tbl1ID.Intersect(tbl2ID);
    
    var tblUnique = (from uniqueRow in unique
                    join row in tbl1.AsEnumerable()
                    on uniqueRow equals new
                    {
                        product_id = row.Field<String>("product_id"),
                        owner_org_id = row.Field<String>("owner_org_id")
                    }
                    select row).CopyToDataTable();
    var tblBoth = (from bothRow in both
                  join row in tbl1.AsEnumerable()
                  on bothRow equals new
                  {
                      product_id = row.Field<String>("product_id"),
                      owner_org_id = row.Field<String>("owner_org_id")
                  }
                  select row).CopyToDataTable();
    

    Edit: Obviously i've misunderstood your requirement a little bit. So you only have one DataTable and want to get all unique and all duplicate rows, that's even more straight-forward. You can use Enumerable.GroupBy with an anonymous type containing both fields:

    var groups = tbl1.AsEnumerable()
        .GroupBy(r => new
        {
            product_id = r.Field<String>("product_id"),
            owner_org_id = r.Field<String>("owner_org_id")
        });
    var tblUniques = groups
        .Where(grp => grp.Count() == 1)
        .Select(grp => grp.Single())
        .CopyToDataTable();
    var tblDuplicates = groups
        .Where(grp => grp.Count() > 1)
        .SelectMany(grp => grp)
        .CopyToDataTable();
    
    0 讨论(0)
提交回复
热议问题