SSIS Derived Column - Parse Text between break returns

后端 未结 1 688
遥遥无期
遥遥无期 2021-01-23 23:06

I have a text field from a SQL Server Source. It is a phone number field that typically has this format:

Home: 555-555-1212
Work: 555-555-1212
Cell: 555-555-1212         


        
相关标签:
1条回答
  • 2021-01-23 23:42

    I look at your data and I see

    Home:|555-555-1212|Work:|555-555-1212|Cell:|555-555-1212|Emergency:|555-555-1212

    I'm using the pipe character, |, as a placeholder for where I would segment that string, which is basically wherever you have whitespace (space, tab, newline, etc).

    There are two approaches to this. I'll start with the easy one.

    Script Component

    String.Split is your friend here. Look at what it did with that source data

    I added a new Script Component, acting as a Transformation and created 4 output columns, all string of length 12 codepage 1252: Home, Work, Cell, and Emergency. I populate them like so

    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        string[] split = Row.PhoneData.Split();
    
        Row.Home = split[1];
        Row.Work = split[4];
        Row.Cell = split[7];
        Row.Emergency = split[10];
    }
    

    Derived Column

    I'm not going to build out a full blown implementation of this. The above is much to simple but I run into situations where ETL devs say they aren't allowed to use Script tasks/components and that's usually because people reached for them first instead of last.

    The approach here is to have lots of Derived Columns Components on your Data Flow. It won't hurt performance and in fact can make it easier. It definitely will make your debugging easier as you'll have lots of it to do.

    DER Find Colons

    This would add 4 columns into the dataflow - HomeColonPosition, WorkColonPosition etc. You've already started down this path but just build it out into the actual data flow as you'll need to reference these positions and again, it's easier to fix the calculation that populates a column versus a calculation that's wrong and used everywhere. You're likely to find that 4 derived columns are useful here as you'd want to use the previous colon's position as the starting point for the third argument to FINDSTRING

    Thus, instead of Work being

    FINDSTRING(PhoneData, ":", FINDSTRING(PhoneData, ":" 1) + 1)
    

    it would just be

    FINDSTRING(PhoneData, ":", HomeColonPosition + 1)
    

    Just knowing the position of the 4 colons in that string, I can figure out where the phone numbers are (maybe). The position of the colon + 2 (colon and the space) is the starting point and then go out 12 characters.

    Where this approach gets ugly, much as it did with the script approach is when that data isn't consistent.

    0 讨论(0)
提交回复
热议问题