How to load mixed record type fixed width file with two headers into two separate files

后端 未结 1 1948
北恋
北恋 2021-01-18 23:12

I got a task to load a strangely formatted text file. The file contains unwanted data too. It contains two headers back to back and data for each header is specified on alte

相关标签:
1条回答
  • 2021-01-19 00:02

    Ignore first 3 rows

    To ignore first 3 rows you can simply configure the flat file connection manager to ignore them, similar to:


    Split file and remove bad rows

    1. Configure connection managers

    In addition, in the flat file connection manager, go to the advanced tab and delete all columns except one and change its data type to DT_STR and the MaxLength to 4000.

    Add two connection managers , one for each destination file where you must define only one column with max length = 4000:

    2. Configure Data flow task

    Add a Data Flow Task, And add a Flat File Source inside. Select the Source File connection manager.

    Add a conditional split with the following expressions:

    File1

    FINDSTRING([Column 0],"OPENING",1) > 1 || FINDSTRING([Column 0],"DATE",1) > 1 || TOKENCOUNT([Column 0]," ") == 19
    

    File2

    FINDSTRING([Column 0],"A/C",1) > 1 || FINDSTRING([Column 0],"FACTOR",1) > 1 || TOKENCOUNT([Column 0]," ") == 10
    

    The expressions above are created based on the expected output you mentioned in the question, i tired to search for unique keywords inside each header and splitted the data rows based on the number of space occurrence.

    Finally Map each output to a destination flat file component:

    Experiments

    The execution result is shown in the following screenshots:


    Update 1 - Remove duplicates

    To remove duplicates you must you can refer to the following link:

    • How to remove duplicate rows from flat file using SSIS?

    Update 2 - Remove only duplicates headers + Replace spaces with Tab

    If you need only to remove duplicate headers then you can do this in two steps:

    1. Add a script component after each conditional split output to flag unwanted rows
    2. Add a conditional split to filter rows based on the script component output

    In addition, because the columns values does not contains spaces you can use regular expression to replace spaces with single Tab to make the file consistent.

    Script Component

    In the Script Component add an output column of type DT_BOOL and name it outFlag also add a output column outColumn0 of type DT_STR and length equal to 4000 and select Column0 as Input Column.

    Then write the following script in the Script Editor (C#):

    First make sure that you add the RegularExpressions namespace

    using System.Text.RegularExpressions;
    

    Script Code

    int SEOCount = 0;
    int NOMCount = 0;
    
    Regex regex = new Regex("[ ]{2,}", RegexOptions.None);
    
    
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        if (Row.Column0.Trim().StartsWith("SEO"))
        {
    
    
            if (SEOCount == 0)
            {
    
                SEOCount++;
                Row.outFlag = true;
    
            }
            else
            {
    
                Row.outFlag = false;
    
            }
    
    
    
        }
        else if (Row.Column0.Trim().StartsWith("NOM"))
        {
    
            if (NOMCount == 0)
            {
    
                NOMCount++;
                Row.outFlag = true;
    
            }
            else
            {
    
                Row.outFlag = false;
    
            }
    
        }
        else if (Row.Column0.Trim().StartsWith("PAGE"))
        {
            Row.outFlag = false;
        }
        else
        {
    
            Row.outFlag = true;
    
        }
    
    
        Row.outColumn0 = regex.Replace(Row.Column0.TrimStart(), "\t");
    }
    

    Conditional Split

    Add a conditional split after each Script Component and use the following expression to filter duplicate header:

    [outFlag] == True
    

    And connect the conditional split to the destination. Make Sure to map outColumn0 to the destination column.

    Package link

    • https://www.dropbox.com/s/d936u4xo3mkzns8/Package.dtsx?dl=0
    0 讨论(0)
提交回复
热议问题