Code never lies!


                                                   -- iret

导航

A faster DISTINCT for DataTable

Posted on 2005-11-21 13:07  iret  阅读(2798)  评论(3编辑  收藏  举报

Do you need to DISTINCT your datatable? In this article, I will show you a faster solution for DISTINCT a datatable.

There is some other solutions for DISTINCT a datatble, like http://weblogs.sqlteam.com/davidm/archive/2004/01/21/753.aspx. As you seen, this solution firstly sorts the datatable rows using Datatable.Select method. Then we get a collection of rows which all of the same rows will be placed next to each other. Finally, just compare the rows with previous one, if they are equal, remove it, otherwise, insert to the result datatable.

private static bool RowEqual(object[] Values, object[] OtherValues)
{
      
if(Values == null)
            
return false;

      
for(int i = 0; i < Values.Length; i++)
      
{
            
if(!Values[i].Equals(OtherValues[i]))
                  
return false;
      }

      
return true;
}
 
public static DataTable Distinct(DataTable Table, DataColumn[] Columns)
{
      
//Empty table
      DataTable table = new DataTable("Distinct");
      
//Sort variable
      string sort = string.Empty;
      
//Add Columns & Build Sort expression
      for(int i = 0; i < Columns.Length; i++)
      
{
            table.Columns.Add(Columns[i].ColumnName,Columns[i].DataType);

            sort 
+= Columns[i].ColumnName + ",";
      }


      
//Select all rows and sort
      DataRow[] sortedrows = Table.Select(string.Empty,sort.Substring(0,sort.Length-1));      

      
object[] currentrow = null;
      
object[] previousrow = null;      

      table.BeginLoadData();
      
foreach(DataRow row in sortedrows)
      
{
            
//Current row
            currentrow = new object[Columns.Length];

            
for(int i = 0; i < Columns.Length; i++)
            
{
                  currentrow[i] 
= row[Columns[i].ColumnName];
            }
 

            
//Match Current row to previous row
            if(!SQLOps.RowEqual(previousrow, currentrow))
                  table.LoadDataRow(currentrow,
true);

            
//Previous row
            previousrow = new object[Columns.Length];

            
for(int i = 0; i < Columns.Length; i++)
            
{
                  previousrow[i] 
= row[Columns[i].ColumnName];
            }

      }


      table.EndLoadData();
      
return table;

}


It works fine for a small datatable However, to sort the rows in a big datatable is a time-consuming job. And what's more if the datatable has more columns the row equal routine also hits the performance. I thought we can be faster, with the help of hash code and hash table.

First, we do not sort the rows in the datatable, but calculate the hash code for each row. And then a hashtable will be used to save all the different hash codes for the rows in the datatable. If we meet a new row whose hash code is contained in the hashtable, it means there is some rows equal to this row. Otherwise, we should save the new row to the result datatable.

With my solution, the DISTINCT operation can be about 14 times faster than the solution above, while testing 1000 columns* 1000rows datatable. However, we use a hashtable to save the hashcode.

Ok, let's view the details, Code never lies.

public static DataTable Distinct3(DataTable source, DataColumn[] Columns)
{
            DataTable table 
= source.Clone();

            
object[] currentrow = null;
            
            System.Collections.Hashtable ht 
= new System.Collections.Hashtable();
            table.BeginLoadData();

            
foreach(DataRow row in Table.Rows)
            
{
                
                
int hash = string.Concat(row.ItemArray).GetHashCode();
                
if (ht.Contains(hash))
                    
continue;
                ht.Add(hash,hash);
                
//Insert a copy of current row 
                currentrow = new object[Columns.Length]; 
                Array.Copy(row.ItemArray, currentrow, Columns.Length); 
                table.LoadDataRow(currentrow,
true); 

            }


            table.EndLoadData();
            
return table;

}