DataTable.Select and Performance Issue in C#

后端 未结 6 2062
误落风尘
误落风尘 2020-12-15 01:56

I\'m importing the data from three Tab delimited files in the DataTables and after that I need to go thru every row of master table and find all the rows in two child tables

相关标签:
6条回答
  • 2020-12-15 02:19

    Have you ran it through a profiler? That should be the first step. Anyhow, this might help:

    • Read the master text file into memory line by line. Put the master record into a dictionary as the key. Add it to the dataset (1 pass through master).

    • Read child text file line by line, add this as a value for the appropriate master record in the dictionary created above

    • Now you have everything in the dictionary in memory, only doing 1 pass through each file. Do a final pass through the dictionary/children and process each column and perform final calcs.

    0 讨论(0)
  • 2020-12-15 02:20

    DataTables can be made to have Relationships with other DataTables in a DataSet. See http://msdn.microsoft.com/en-us/library/ay82azad%28VS.71%29.aspx for a bit of discussion and as a start point to browsing. I've not much experience of using them but as I understand it they will do what you want (assuming your tables are in a suitable format). I would assume that these have greater efficiency than a manual process of doing the same but I may be wrong. Might be worth seeing if they work for you and benchmarking to see if they are an improvement or not...

    0 讨论(0)
  • 2020-12-15 02:29

    .Net 4.5 and the issue is still there.

    Here are the results of a simple benchmark where DataTable.Select and different dictionary implementations are compared for CPU time (results are in milliseconds)

        #Rows Table.Select  Hashtable[] SortedList[] Dictionary[]
         1000        43,31         0,01         0,06         0,00
         6000       291,73         0,07         0,13         0,01
        11000       604,79         0,04         0,16         0,02
        16000       914,04         0,05         0,19         0,02
        21000      1279,67         0,05         0,19         0,02
        26000      1501,90         0,05         0,17         0,02
        31000      1738,31         0,07         0,20         0,03
    

    Problem:

    The DataTable.Select method creates a "System.Data.Select" class instance internally, and this "Select" class creates indexes based on the fields (columns) specified in the query. The Select class makes re-use of the indexes it had created but the DataTable implementation does not re-use the Select class instance hence the indexes are re-created every time DataTable.Select is invoked. (This behaviour can be observed by decompiling System.Data)

    Solution:

    Assume the following query

    DataRow[] rows = data.Select("COL1 = 'VAL1' AND (COL2 = 'VAL2' OR COL2 IS NULL)");
    

    Instead, create and fill a Dictionary with keys corresponding to the different value combinations of the values of the columns used as the filter. (This relatively expensive operation must be done only once and the dictionary instance must then be re-used)

    Dictionary<string, List<DataRow>> di = new Dictionary<string, List<DataRow>>();
    
    foreach (DataRow dr in data.Rows)
    {
        string key = (dr["COL1"] == DBNull.Value ? "<NULL>" : dr["COL1"]) + "//" + (dr["COL2"] == DBNull.Value ? "<NULL>" : dr["COL2"]);
        if (di.ContainsKey(key))
        {
            di[key].Add(dr);
        }
        else
        {
            di.Add(key, new List<DataRow>());
            di[key].Add(dr);
        }
    }
    

    Query the Dictionary (multiple queries may be required) to filter the rows and combine the results into a List

    string key1 = "VAL1//VAL2";
    string key2 = "VAL1//<NULL>";
    List<DataRow>() results = new List<DataRow>();
    if (di.ContainsKey(key1))
    {
        results.AddRange(di[key1]);
    }
    if (di.ContainsKey(key2))
    {
        results.AddRange(di[key2]);
    }
    
    0 讨论(0)
  • 2020-12-15 02:29

    I know this is an old question, and code underpinning this issue may have changed, but I've recently encountered (and gain some insight into) this very issue.

    For anyone coming along at a later date ... here's what I found.

    Performance of the DataTable.Select(condition) is quite sensitive to the nature and structure of the 'condition' you provide. This looks like a bug to me (where would I report it to Microsoft?) but it may merely be a quirk.

    I've written a set of tests to demonstrate the issue that are structured as follows:

    1. Define a datatable with a few simple columns,like this:

      var dataTable = new DataTable();
      var idCol = dataTable.Columns.Add("Id", typeof(Int32));
      dataTable.Columns.Add("Code", typeof(string));
      dataTable.Columns.Add("Name", typeof(string));
      dataTable.Columns.Add("FormationDate", typeof(DateTime));
      dataTable.Columns.Add("Income", typeof(Decimal));
      dataTable.Columns.Add("ChildCount", typeof(Int32));
      dataTable.Columns.Add("Foreign", typeof(Boolean));
      dataTable.PrimaryKey = new DataColumn[1] { idCol };

    2. Populate the table with 40000 records, each with a unique 'Code' field.

    3. Perform a batch of 'selects' (each with different parameters) against the datatable using two similar, but differently formatted, queries and record and compare the total time taken by each of the two formats.

    You get remarkable results. Testing, for example, the below two conditions side-by-side:

    Q1: [Code] = 'XX'

    Q2: ([Code] = 'XX')

    [ I do multiple Select calls using the above two queries, each iteration I replace the XX with a valid code that exists in the datatable ] The result?

    Time comparison for 320 lookups against 40000 records: 180 msec total search time with no brackets, 6871 msec total search time for search WITH brackets

    Yes - 38 times slower if you just have the extra brackets surrounding the condition. There are other scenarios which react differently.

    For example, [Code] = '{searchCode}' OR 1=0 vs ([Code] = '{searchCode}' OR 1=0) take similar (slow) times to execute, but:

    [Code] = '{searchCode}' AND 1=1 vs ([Code] = '{searchCode}' AND 1=1) again shows the non-bracketed version to be close to 40 times faster.

    I've not investigated all scenarios, but it seems that the introduction of brackets - either redundantly around a simple comparison check, or as required to specify sub-expression precedence - or the presence of an 'OR' slows the query down considerably.

    I could speculate that the issue is caused by how the datatable parses the condition you use and how it creates and uses internal indexes ... but I won't.

    0 讨论(0)
  • 2020-12-15 02:38

    You can speed it up a lot by using a dictionary. For example:

    if (distinctdt.Rows.Count > 0)
    {
        // build index of C1 values to speed inner loop
        Dictionary<string, DataRow> masterIndex = new Dictionary<string, DataRow>();
        foreach (DataRow row in rawMasterdt.Rows)
            masterIndex[row["C1"].ToString()] = row;
    
        int count = 0;
        foreach (DataRow offer in distinctdt.Rows)
        {
    

    Then in place of

        string exp = "C1 = " + "'" + offer[0].ToString() + "'" + "";
        DataRow masterRow = rawMasterdt.Select(exp)[0];
    

    You would do this

    DataRow masterRow;
    if (masterIndex.ContainsKey(offer[0].ToString())
        masterRow = masterIndex[offer[0].ToString()];
    else
        masterRow = null;
    
    0 讨论(0)
  • 2020-12-15 02:40

    If you create a DataRelation between your parent and child DataTables, you can look up child rows by invoking DataRow.GetChildRows(DataRelation) on the parent row (resp. DataRow.GetChildRelName in case of typed DataSets). The search will apply a TreeMap lookup, and performance should be fine even with a lot of child rows.

    In case you have to search for rows based on other criteria than on a DataRelation's foreign keys, I recommend to use DataView.Sort / DataView.FindRows() instead of DataTable.Select(), as soon as you have to query the data more than once. DataView.FindRows() will be based on TreeMap lookup (O(log(N)), where as DataTable.Select() has to scan all rows (O(N)). This article contains more details: http://arnosoftwaredev.blogspot.com/2011/02/when-datatableselect-is-slow-use.html

    0 讨论(0)
提交回复
热议问题