I have a data table I\'ve loaded from a CSV file. I need to determine which rows are duplicates based on two columns (product_id
and owner_org_id
) in t
Your criterium is off. You are comparing sets of objects that you are not interested (Except
excludes) in.
Instead, be as clear (data type) as possible and keep it simple:
public bool Equals(DataRow x, DataRow y)
{
// Usually you are dealing with INT keys
return (x["PRODUCT_ID"] as int?) == (y["PRODUCT_ID"] as int?)
&& (x["OWNER_ORG_ID"] as int?) == (y["OWNER_ORG_ID"] as int?);
// If you really are dealing with strings, this is the equivalent:
// return (x["PRODUCT_ID"] as string) == (y["PRODUCT_ID"] as string)
// && (x["OWNER_ORG_ID"] as string) == (y["OWNER_ORG_ID"] as string)
}
Check for null
if that is a possibility. Maybe you want to exclude rows that are equal because their IDs are null.
Observe the int?
. This is not a typo. The question mark is required if you are dealing with database values from columns that can be NULL
. The reason is that NULL
values will be represented by the type DBNull
in C#. Using the as
operator just gives you null
in this case (instead of an InvalidCastException
.
If you are sure, you are dealing with INT NOT NULL
, cast with (int)
.
The same is true for strings. (string)
asserts you are expecting non-null DB values.
EDIT1:
Had the type wrong. ItemArray is not a hashtable. Use the row directly.
EDIT2:
Added string
example, some comment
For a more straight-forward way, check How to select distinct rows in a datatable and store into an array
EDIT3:
Some explanation regarding the casts.
The other link I suggested does the same as your code. I forgot your original intent ;-) I just saw your code and responded to the most obvious error, I saw - sorry
Here is how I would solve the problem
using System.Linq;
using System.Data.Linq;
var q = dtCSV
.AsEnumerable()
.GroupBy(r => new { ProductId = (int)r["PRODUCT_ID"], OwnerOrgId = (int)r["OWNER_ORG_ID"] })
.Where(g => g.Count() > 1).SelectMany(g => g);
var duplicateRows = q.ToList();
I don't know if this 100% correct, I don't have an IDE at hand. And you'll need to adjust the casts to the appropriate type. See my addition above.
You could use LINQ-To-DataSet and Enumerable.Except
/Intersect
:
var tbl1ID = tbl1.AsEnumerable()
.Select(r => new
{
product_id = r.Field<String>("product_id"),
owner_org_id = r.Field<String>("owner_org_id"),
});
var tbl2ID = tbl2.AsEnumerable()
.Select(r => new
{
product_id = r.Field<String>("product_id"),
owner_org_id = r.Field<String>("owner_org_id"),
});
var unique = tbl1ID.Except(tbl2ID);
var both = tbl1ID.Intersect(tbl2ID);
var tblUnique = (from uniqueRow in unique
join row in tbl1.AsEnumerable()
on uniqueRow equals new
{
product_id = row.Field<String>("product_id"),
owner_org_id = row.Field<String>("owner_org_id")
}
select row).CopyToDataTable();
var tblBoth = (from bothRow in both
join row in tbl1.AsEnumerable()
on bothRow equals new
{
product_id = row.Field<String>("product_id"),
owner_org_id = row.Field<String>("owner_org_id")
}
select row).CopyToDataTable();
Edit: Obviously i've misunderstood your requirement a little bit. So you only have one DataTable
and want to get all unique and all duplicate rows, that's even more straight-forward. You can use Enumerable.GroupBy with an anonymous type containing both fields:
var groups = tbl1.AsEnumerable()
.GroupBy(r => new
{
product_id = r.Field<String>("product_id"),
owner_org_id = r.Field<String>("owner_org_id")
});
var tblUniques = groups
.Where(grp => grp.Count() == 1)
.Select(grp => grp.Single())
.CopyToDataTable();
var tblDuplicates = groups
.Where(grp => grp.Count() > 1)
.SelectMany(grp => grp)
.CopyToDataTable();