Do you need to DISTINCT your datatable? In this article, I will show you a faster solution for DISTINCT a datatable.
There is some other solutions for DISTINCT a datatble, like http://weblogs.sqlteam.com/davidm/archive/2004/01/21/753.aspx. As you seen, this solution firstly sorts the datatable rows using Datatable.Select method. Then we get a collection of rows which all of the same rows will be placed next to each other. Finally, just compare the rows with previous one, if they are equal, remove it, otherwise, insert to the result datatable.
{
if(Values == null)
return false;
for(int i = 0; i < Values.Length; i++)
{
if(!Values[i].Equals(OtherValues[i]))
return false;
}
return true;
}
public static DataTable Distinct(DataTable Table, DataColumn[] Columns)
{
//Empty table
DataTable table = new DataTable("Distinct");
//Sort variable
string sort = string.Empty;
//Add Columns & Build Sort expression
for(int i = 0; i < Columns.Length; i++)
{
table.Columns.Add(Columns[i].ColumnName,Columns[i].DataType);
sort += Columns[i].ColumnName + ",";
}
//Select all rows and sort
DataRow[] sortedrows = Table.Select(string.Empty,sort.Substring(0,sort.Length-1));
object[] currentrow = null;
object[] previousrow = null;
table.BeginLoadData();
foreach(DataRow row in sortedrows)
{
//Current row
currentrow = new object[Columns.Length];
for(int i = 0; i < Columns.Length; i++)
{
currentrow[i] = row[Columns[i].ColumnName];
}
//Match Current row to previous row
if(!SQLOps.RowEqual(previousrow, currentrow))
table.LoadDataRow(currentrow,true);
//Previous row
previousrow = new object[Columns.Length];
for(int i = 0; i < Columns.Length; i++)
{
previousrow[i] = row[Columns[i].ColumnName];
}
}
table.EndLoadData();
return table;
}
It works fine for a small datatable However, to sort the rows in a big datatable is a time-consuming job. And what's more if the datatable has more columns the row equal routine also hits the performance. I thought we can be faster, with the help of hash code and hash table.
First, we do not sort the rows in the datatable, but calculate the hash code for each row. And then a hashtable will be used to save all the different hash codes for the rows in the datatable. If we meet a new row whose hash code is contained in the hashtable, it means there is some rows equal to this row. Otherwise, we should save the new row to the result datatable.
With my solution, the DISTINCT operation can be about 14 times faster than the solution above, while testing 1000 columns* 1000rows datatable. However, we use a hashtable to save the hashcode.
Ok, let's view the details, Code never lies.
{
DataTable table = source.Clone();
object[] currentrow = null;
System.Collections.Hashtable ht = new System.Collections.Hashtable();
table.BeginLoadData();
foreach(DataRow row in Table.Rows)
{
int hash = string.Concat(row.ItemArray).GetHashCode();
if (ht.Contains(hash))
continue;
ht.Add(hash,hash);
//Insert a copy of current row
currentrow = new object[Columns.Length];
Array.Copy(row.ItemArray, currentrow, Columns.Length);
table.LoadDataRow(currentrow,true);
}
table.EndLoadData();
return table;
}