Group characters of varchar field

问题

I am creating an import data tool from several vendors. Unfortunately the data is not generated by me, so i have to work with it. I have come across the following situation.

I have a table like the following:

ID    |SartDate    |Availability
========================================
H1    |20130728    |YYYYYYNNNNQQQQQ
H2    |20130728    |NNNNYYYYYYY
A3    |20130728    |NNQQQQNNNNNNNNYYYYYY
A2    |20130728    |NNNNNYYYYYYNNNNNN

To explain what this data means is: Every letter in the Availability column is the availability flag for a specific date, starting from the date noted in the StartDate column.

Y : Available
N : Not Available
Q : On Request

For instance for ID H1 20130728 - 20130802 is available, then from 20130803 - 20130806 is not available and from 20130807 - 20130811 is available on request.

What i need to do is transform this table to the following setup:

ID    |Available   |SartDate    |EndDate     
========================================
H1    |Y           |20130728    |20130802    
H1    |N           |20130803    |20130806    
H1    |Q           |20130806    |20130811    
H2    |N           |20130728    |20130731
H2    |Y           |20130801    |20130807
A3    |N           |20130728    |20130729
A3    |Q           |20130730    |20130802
A3    |N           |20130803    |20130810
A3    |Y           |20130811    |20130816
A2    |Y           |20130728    |20130801
A2    |Y           |20130802    |20130807
A2    |Y           |20130808    |20130813

The initial table has approximately 40,000 rows. The Availability column may have several days (I've seen up to 800).

What i have tried is turn the Availability into rows and then group consecutive days together and then get min and max date for each group. For this i have used three or four CTEs

This works fine for a few IDs, but when i try to apply it to the whole table it take ages (I stopped the initial test run after a fool time sleep and it hadn't finish, and yes i mean i was sleeping while it was running!!!!)

I have estimated that if i turn each character in a single row then i end up with something like 14.5 million rows.

So, i am asking, is there a more efficient way of doing this? (I know there is, but i need you to tell me)

Thanks in advance.

回答1:

This can be done in SQL Server, using recursive CTEs. Here is an example:

with t as (
    select 'H1' as id, cast('20130728' as date) as StartDate,
           'YYYYYYNNNNQQQQQ' as Availability union all
    select 'H2' as id, cast('20130728' as date) as StartDate,
           'NNNNYYYYYYY' as Availability union all
    select 'H3' as id, cast('20130728' as date) as StartDate,
           'NQ' as Availability 
   ),
   cte as (
     select id, left(Availability, 1) as Available,
            StartDate as thedate,
            substring(Availability, 2, 1000) as RestAvailability,
            1 as i,
            1 as periodcnt
     from t
     union all
     select t.id, left(RestAvailability, 1),
            dateadd(dd, 1, thedate),
            substring(RestAvailability, 2, 1000) as RestAvailability,
            1 + cte.i,
            (case when substring(t.Availability, i, 1) = substring(t.Availability, i+1, 1)
                  then periodcnt
                  else periodcnt + 1
             end)
     from t join
          cte
          on t.id = cte.id
     where len(RestAvailability) > 0

   )
select id, min(thedate), max(thedate), Available
from cte
group by id, periodcnt, Available;

The way this works is that it first spreads out the dates. This would be a "typical" use of CTEs. In the process, it also keeps track of whether Available has changed from the previous value (in the variable periodcnt. It is using string manipulations for this.

With this information, the final result is simply an aggregation from this CTE.

回答2:

As SQL Server is not the best tool, If I had to do this, I would probably set up an Integration Services package where I would use a script component to code the generate the several records from one in C#.

回答3:

Did you try using CROSS APPLY can it give better performance? This isn't a full response. Just another way parse?

Edit : I'm now using table variable for Index table.

DECLARE @MaxLen INT
SELECT @MaxLen = MAX(LEN(Availability))
FROM InputTable 
DECLARE @a TABLE (i int)

;WITH x AS
(
    SELECT 1 AS i
    UNION ALL SELECT i + 1 FROM x WHERE i <= @MaxLen
)
INSERT INTO @a 
SELECT i FROM x
OPTION (MAXRECURSION 0);


;WITH cte AS (
SELECT *, DATEADD(DAY, i-1, StartDate) StatusAtDay
FROM InputTable t
cross apply (
    select SUBSTRING(t.Availability, i, 1)  as c, i
    from @a
    WHERE LEN(Availability) >= i
    ) ca
)
SELECT *
FROM cte
order by 1

I tried with some 5000 rows and the where the length of Availability > 1250 it took 19 seconds (throwing the output to a temp table).

回答4:

I have tried a different approach. Instead of using the SQLXMLBulkLoad library with the initial xml file, i thought i could do a transformation using LINQ and then Bulk Load the output to the db.

So my initial xml was something like:

<vacancies>
  <vacancy>
    <code>AT1010.200.1</code>
    <startday>2010-07-01</startday>
    <availability>YYYYYYNNNNQQQQQ</availability>
    <changeover>CCIIOOX</changeover>
    <minstay>GGGGGGGG</minstay>
    <flexbooking>YYYYY</flexbooking>
  </vacancy>
  <vacancy>
    <code>AT1010.200.2</code>
    <startday>2010-07-01</startday>
    <availability>NNNNYYYYYYY</availability>
    <changeover>CCIIOOX</changeover>
    <minstay>GGGGGGGG</minstay>
    <flexbooking>YYYYY</flexbooking>
  </vacancy>
  <vacancy>
    <code>AT1010.200.3</code>
    <startday>2010-07-01</startday>
    <availability>NNQQQQNNNNNNNNYYYYYY</availability>
    <changeover>CCIIOOX</changeover>
    <minstay>GGGGGGGG</minstay>
    <flexbooking>YYYYY</flexbooking>
  </vacancy>
  <vacancy>
    <code>AT1010.200.4</code>
    <startday>2010-07-01</startday>
    <availability>NNNNNYYYYYYNNNNNN</availability>
    <changeover>CCIIOOX</changeover>
    <minstay>GGGGGGGG</minstay>
    <flexbooking>YYYYY</flexbooking>
  </vacancy>
</vacancies>

The task here would be to create a new xml that would have start and end dates for each group of availability flags.

XElement xe = XElement.Load(file);

int i = 0;
char previousFlag = ' ';
int GroupIndex = 0;

XElement vacancies =
new XElement
(
    "vacancies",
    xe.Elements("vacancy")
    .Select
    (
        x =>
        {
            i = 0;
            GroupIndex = 0;
            return new
            {
                availabilities = x.Element("availability")
                .Value
                .Select
                (
                    v =>
                    {
                        if (previousFlag != v)
                        {
                            GroupIndex++;
                        }
                        previousFlag = v;
                        return new
                        {
                            Code = x.Element("code").Value,
                            startday = x.Element("startday").Value,
                            Date = DateTime.Parse(x.Element("startday").Value).AddDays(i++),
                            GIndex = GroupIndex
                        };
                    }
                )
            };
        }
    )
    .SelectMany
    (
        x =>
        x.availabilities
    )
    .GroupBy
    (
        g =>
        new
        {
            Code = g.Code,
            startday = g.startday,
            GroupIndex = g.GIndex
        }
    )
    .Select
    (
        x =>
        new XElement
        (
            "vacancy",
            new XElement("code", x.Key.Code),
            new XElement("startday", x.Key.startday),
            new XElement("GroupIndex", x.Key.GroupIndex),
            new XElement("minDate", x.Min(z => z.Date)),
            new XElement("maxDate", x.Max(z => z.Date))
        )
    )
);
vacancies.Save(outputfile);

Opening the outputfile i have the following xml format:

<vacancies>
  <vacancy>
    <code>AT1010.200.1</code>
    <startday>2010-07-01</startday>
    <GroupIndex>1</GroupIndex>
    <minDate>2010-07-01T00:00:00</minDate>
    <maxDate>2010-07-06T00:00:00</maxDate>
  </vacancy>
  <vacancy>
    <code>AT1010.200.1</code>
    <startday>2010-07-01</startday>
    <GroupIndex>2</GroupIndex>
    <minDate>2010-07-07T00:00:00</minDate>
    <maxDate>2010-07-10T00:00:00</maxDate>
  </vacancy>
  <vacancy>
    <code>AT1010.200.1</code>
    <startday>2010-07-01</startday>
    <GroupIndex>3</GroupIndex>
    <minDate>2010-07-11T00:00:00</minDate>
    <maxDate>2010-07-15T00:00:00</maxDate>
  </vacancy>
  <vacancy>
    <code>AT1010.200.2</code>
    <startday>2010-07-01</startday>
    <GroupIndex>1</GroupIndex>
    <minDate>2010-07-01T00:00:00</minDate>
    <maxDate>2010-07-04T00:00:00</maxDate>
  </vacancy>
  <vacancy>
    <code>AT1010.200.2</code>
    <startday>2010-07-01</startday>
    <GroupIndex>2</GroupIndex>
    <minDate>2010-07-05T00:00:00</minDate>
    <maxDate>2010-07-11T00:00:00</maxDate>
  </vacancy>
  <vacancy>
    <code>AT1010.200.3</code>
    <startday>2010-07-01</startday>
    <GroupIndex>1</GroupIndex>
    <minDate>2010-07-01T00:00:00</minDate>
    <maxDate>2010-07-02T00:00:00</maxDate>
  </vacancy>
  <vacancy>
    <code>AT1010.200.3</code>
    <startday>2010-07-01</startday>
    <GroupIndex>2</GroupIndex>
    <minDate>2010-07-03T00:00:00</minDate>
    <maxDate>2010-07-06T00:00:00</maxDate>
  </vacancy>
  <vacancy>
    <code>AT1010.200.3</code>
    <startday>2010-07-01</startday>
    <GroupIndex>3</GroupIndex>
    <minDate>2010-07-07T00:00:00</minDate>
    <maxDate>2010-07-14T00:00:00</maxDate>
  </vacancy>
  <vacancy>
    <code>AT1010.200.3</code>
    <startday>2010-07-01</startday>
    <GroupIndex>4</GroupIndex>
    <minDate>2010-07-15T00:00:00</minDate>
    <maxDate>2010-07-20T00:00:00</maxDate>
  </vacancy>
  <vacancy>
    <code>AT1010.200.4</code>
    <startday>2010-07-01</startday>
    <GroupIndex>1</GroupIndex>
    <minDate>2010-07-01T00:00:00</minDate>
    <maxDate>2010-07-05T00:00:00</maxDate>
  </vacancy>
  <vacancy>
    <code>AT1010.200.4</code>
    <startday>2010-07-01</startday>
    <GroupIndex>2</GroupIndex>
    <minDate>2010-07-06T00:00:00</minDate>
    <maxDate>2010-07-11T00:00:00</maxDate>
  </vacancy>
  <vacancy>
    <code>AT1010.200.4</code>
    <startday>2010-07-01</startday>
    <GroupIndex>3</GroupIndex>
    <minDate>2010-07-12T00:00:00</minDate>
    <maxDate>2010-07-17T00:00:00</maxDate>
  </vacancy>
</vacancies>

Which is flat and ready to be processed by the SQLXMLBulkLoad tool with no further process needed.

My initial xml was 60MB and it was transformed to a 45MB file in under one minute, and though i have not tested the SQLXMLBulkLoad in the new file, it will be lightning fast, as i know its performance with the initial file.

I will still try all your solutions, as you certainly worth the try and i will accept the best of them.

Thank you all for the effort.

来源：https://stackoverflow.com/questions/17943169/group-characters-of-varchar-field

标签

sql

sql-server

sql-server-2008-r2

sql-server-express