(SQL) Identify positions of multiple occurrences of a string format within a field

问题

I need to split a narrative field (free text) into multiple rows. Format is currently along the lines of:

Case_Reference | Narrative
```````````````|`````````````````````````````````````
XXXX/XX-123456 | [Endless_Text up to ~50k characters]

Within the narrative field as text, individual entries (when various agents have done something to the case) begin with the entry date followed by two spaces (i.e. 'dd/mm/yyyy '), with the values of the dates changing with each entry within that same field.

In other words, after trawling for a better delimiter, the only one I can use is this format of string, so I need to identify multiple positions within the Narrative text where the format (would mask be a better word?) matches 'dd/mm/yyyy '.

I can identify multiple occurrences of a consistent string no problem, but it's identifying it where I'm essentially looking for:

'%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] %'

PATINDEX of course returns the first occurrence/position of this, but so far as I'm aware, there's no way to "modify" this (i.e. a created function) to allow for picking up the rest of the occurrences/positions of this they way we can with CHARINDEX (since PATINDEX doesn't have a starting position parameter).

For clarity, I'm not looking for code to delimit this directly as I need to further manipulate each entry, so it's purely the positions of multiple occurrences of the string within the Narrative text I'm looking for.

Any help would be very much appreciated.

For clarity, there's no option to do this pre-import, so it needs to be done on this landed data.

Desired output would be

 Case_Reference1 | 1st_Position_of_Delimiter_String  
 Case_Reference1 | 2nd_Position_of_Delimiter_String  
 Case_Reference2 | 1st_Position_of_Delimiter_String  
 Case_Reference2 | 2nd_Position_of_Delimiter_String  
 Case_Reference2 | 3rd_Position_of_Delimiter_String

回答1:

You might solve this with an recursive CTE

DECLARE @tbl TABLE (Case_Reference NVARCHAR(MAX),Narrative NVARCHAR(MAX));
INSERT INTO @tbl VALUES
 (N'C1',N'01/02/2000  Some text with     blanks 02/03/2000  More text 03/04/2000  An even more')
,(N'C2',N'01/02/2000  Test for C2 02/03/2000  One more for C2 03/04/2000  An even more 04/05/2000  Blah')
,(N'C3',N'01/02/2000  Test for C3 02/03/2000  One more for C3 03/04/2000  An even more')
 ;

WITH recCTE AS
(
    SELECT 1 AS Step,Case_Reference,Narrative,CAST(1 AS BIGINT) AS StartsAt,NewPos.EndsAt+10 AS EndsAt,LEN(Narrative) AS MaxLen
          ,SUBSTRING(Narrative,NewPos.EndsAt+10+1,999999) AS RestString
    FROM @tbl AS tbl
    CROSS APPLY(SELECT PATINDEX('%[0-3][0-9]/[0-1][0-9]/[1-2][0-9][0-9][0-9]  %',SUBSTRING(Narrative,12,9999999))) AS NewPos(EndsAt)

    UNION ALL

    SELECT r.Step+1,r.Case_Reference,r.Narrative,r.EndsAt+1,CASE WHEN NewPos.EndsAt>0 THEN r.EndsAt+NewPos.EndsAt+10 ELSE r.MaxLen END,r.MaxLen
          ,SUBSTRING(r.RestString,NewPos.EndsAt+10+1,999999) 
    FROM recCTE AS r
    CROSS APPLY(SELECT PATINDEX('%[0-3][0-9]/[0-1][0-9]/[1-2][0-9][0-9][0-9]  %',SUBSTRING(r.RestString,12,99999999))) AS NewPos(EndsAt)
    WHERE r.EndsAt<r.MaxLen
)
SELECT Step,Case_Reference,StartsAt,EndsAt
      ,SUBSTRING(Narrative,StartsAt,EndsAt-StartsAt+1) AS OutputString 
FROM recCTE

ORDER BY Case_Reference,Step

The result

+------+----------------+----------+--------+---------------------------------------+
| Step | Case_Reference | StartsAt | EndsAt | OutputString                          |
+------+----------------+----------+--------+---------------------------------------+
| 1    | C1             | 1        | 38     | 01/02/2000  Some text with     blanks |
+------+----------------+----------+--------+---------------------------------------+
| 2    | C1             | 39       | 60     | 02/03/2000  More text                 |
+------+----------------+----------+--------+---------------------------------------+
| 3    | C1             | 61       | 84     | 03/04/2000  An even more              |
+------+----------------+----------+--------+---------------------------------------+
| 1    | C2             | 1        | 24     | 01/02/2000  Test for C2               |
+------+----------------+----------+--------+---------------------------------------+
| 2    | C2             | 25       | 52     | 02/03/2000  One more for C2           |
+------+----------------+----------+--------+---------------------------------------+
| 3    | C2             | 53       | 77     | 03/04/2000  An even more              |
+------+----------------+----------+--------+---------------------------------------+
| 4    | C2             | 78       | 93     | 04/05/2000  Blah                      |
+------+----------------+----------+--------+---------------------------------------+
| 1    | C3             | 1        | 24     | 01/02/2000  Test for C3               |
+------+----------------+----------+--------+---------------------------------------+
| 2    | C3             | 25       | 52     | 02/03/2000  One more for C3           |
+------+----------------+----------+--------+---------------------------------------+
| 3    | C3             | 53       | 76     | 03/04/2000  An even more              |
+------+----------------+----------+--------+---------------------------------------+

回答2:

Try this recursive cte

declare @t table 
(
    caseref varchar(20),
    narrative varchar(max)
)

insert into @t values('Case_Reference1', 'blah 10/11/2016  something 13/11/2016  something else');
insert into @t values('Case_Reference2', '11/11/2016  something 12/11/2016  something else 14/11/2016  something yet still');
insert into @t values('Case_Reference3', 'should find nothing');

with cte (caseref, pos, remainingstr) as 
(
    select caseref, 
        patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', narrative),
        substring(narrative, patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', narrative) + 12, len(narrative) - 12 - patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', narrative))
    from @t
    where patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', narrative) > 0

    union all

    select caseref,
        pos + 12 + patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', remainingstr),
        substring(remainingstr, patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', remainingstr) + 12, len(remainingstr) - 12 - patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', remainingstr))
    from cte
    where patindex('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]  %', remainingstr) > 0

)
select caseref, pos
from cte
order by caseref, pos

来源：https://stackoverflow.com/questions/40738970/sql-identify-positions-of-multiple-occurrences-of-a-string-format-within-a-fie

标签

sql

sql-server

sql-server-2012