How can I load a large flat file into a database table using SSIS?

后端 未结 5 954
佛祖请我去吃肉
佛祖请我去吃肉 2020-11-27 07:15

I\'m not sure how it works so I\'m looking for the right solution. I think SSIS is the right way to go but I have never used it before

Scenario:

Every mornin

相关标签:
5条回答
  • 2020-11-27 07:36

    SSIS Sounds like the way to go. The way I've seen your type of issue handled previously is with a Staging Table. The new document loads into the Staging Table- then Staging and Production are compared- obsolete records are archived (not JUST deleted) from Production, existing rows with some changes are updated (again, original data archived somewhere), and new rows are inserted.

    Note: Your definition of "obsolete" needs to be very, very precise. For instance: should something be archived away just because a matching row does not exist in your most recent file? Should it stay for X amount of time in case it comes on a subsequent file? These and other questions should be considered.

    Almost any standard SSIS tutorial should point you down the correct path for how to do each of these steps.

    0 讨论(0)
  • 2020-11-27 07:37

    I would give Merge a shot. Make sure you eventually have indexes on ItemID on both the tables.

    Merge [dbo].[ItemInfo] as target
    using
    (
        SELECT stg.ItemID, stg.ItemName, stg.ItemType
        FROM [dbo].[ItemInfo_Staging] stg
        LEFT OUTER JOIN [dbo].[ItemInfo] final
            on stg.ItemId = final.ItemId
    ) as SOURCE
    ON SOURCE.ItemID = target.ItemID
    
    WHEN MATCHED THEN
        Update SET
            target.ItemID = SOURCE.ItemID
            , target.ItemName = SOURCE.ItemName
            , target.ItemType = SOURCE.ItemType
    
    WHEN NOT MATCHED BY TARGET THEN
        INSERT (ItemID, ItemName, ItemType )
            VALUES (SOURCE.ItemID, SOURCE.ItemName, SOURCE.ItemType ) 
    
    WHEN NOT MATCHED BY SOURCE THEN
        DELETE
    ;
    
    0 讨论(0)
  • 2020-11-27 07:46

    Yes, SSIS can perform the requirements that you have specified in the question. Following example should give you an idea of how it can be done. Example uses SQL Server as the back-end. Some of the basic test scenarios performed on the package are provided below. Sorry for the lengthy answer.

    Step-by-step process:

    1. In the SQL Server database, create two tables namely dbo.ItemInfo and dbo.Staging. Create table queries are available under Scripts section. Structure of these tables are shown in screenshot #1. ItemInfo will hold the actual data and Staging table will hold the staging data to compare and update the actual records. Id column in both these tables is an auto-generated unique identity column. IsProcessed column in the table ItemInfo will be used to identify and delete the records that are no longer valid.

    2. Create an SSIS package and create 5 variables as shown in screenshot #2. I have used .txt extension for the tab delimited files and hence the value *.txt in the variable FileExtension. FilePath variable will be assigned with value during run-time. FolderLocation variable denotes where the files will be located. SQLPostLoad and SQLPreLoad variables denote the stored procedures used during the pre-load and post-load operations. Scripts for these stored procedures are provided under the Scripts section.

    3. Create an OLE DB connection pointing to the SQL Server database. Create a flat file connection as shown in screenshots #3 and #4. Flat File Connection Columns section contains column level information. Screenshot #5 shows the columns data preview.

    4. Configure the Control Flow Task as shown in screenshot #6. Configure the tasks Pre Load, Post Load and Loop Files as shown in screenshots #7 - #10. Pre Load will truncate staging table and set IsProcessed flag to false for all rows in ItemInfo table. Post Load will update the changes and will delete rows in database that are not found in the file. Refer the stored procedures used in those tasks to understand what is being done in these Execute SQL tasks.

    5. Double-click on the Load Items data flow task and configure it as shown in screenshot #11. Read File is a flat file source configured to use the flat file connection. Row Count is derived column transformation and its configuration is shown in screenshto #12. Check Exist is a lookup transformation and its configurations are shown in screenshots #13 - #15. Lookup No Match Output is redirected to Destination Split on the left side. Lookup Match Output is redirected to Staging Split on the left side. Destination Split and Staging Split have the exact same configuration as shown in screenshot #16. The reason for 9 different destinations for both destination and staging table is to improve the performance of the package.

    6. All the destination tasks 0 - 8 are configured to insert data into table dbo.ItemInfo as shown in screenshot #17. All the staging tasks 0 - 8 are configured to insert data into dbo.Staging as shown in screenshot #18.

    7. On the Flat File connection manager, set the ConnectionString property to use the variable FilePath as shown in screenshot #19. This will enable the package to use the value set in the variable as it loops through each file in a folder.

    Test scenarios:

    Test results may vary from machine to machine. 
    In this scenario, file was located locally on the machine. 
    Files on network might perform slower. 
    This is provided just to give you an idea. 
    So, please take these results with grain of salt.
    
    1. Package was executed on a 64-bit machine with Xeon single core CPU 2.5GHz and 3.00 GB RAM.

    2. Loaded a flat file with 1 million rows. Package executed in about 2 mins 47 seconds. Refer screenshots #20 and #21.

    3. Used the queries provided under Test queries section to modify the data to simulate update, delete and creation of new records during the second run of the package.

    4. Loaded the same file containing the 1 million rows after the following queries were executed in the database. Package executed in about 1 min 35 seconds. Refer screenshots #22 and #23. Please note the number of rows redirected to destination and staging table in screenshot #22.

    Hope that helps.

    Test queries: .

    --These records will be deleted during next run 
    --because item ids won't match with file data.
    --(111111 row(s) affected)
    UPDATE dbo.ItemInfo SET ItemId = 'DEL_' + ItemId WHERE Id % 9 IN (3)
    
    --These records will be modified to their original item type of 'General'
    --because that is the data present in the file.
    --(222222 row(s) affected)
    UPDATE dbo.ItemInfo SET ItemType = 'Testing' + ItemId WHERE Id % 9 IN (2,6)
    
    --These records will be reloaded into the table from the file.
    --(111111 row(s) affected)
    DELETE FROM dbo.ItemInfo WHERE Id % 9 IN (5,9)
    

    Flat File Connection Columns .

    Name        InputColumnWidth     DataType          OutputColumnWidth
    ----------  ----------------     ---------------   -----------------
    Id          8                    string [DT_STR]   8
    ItemId      11                   string [DT_STR]   11
    ItemName    21                   string [DT_STR]   21
    ItemType    9                    string [DT_STR]   9
    

    Scripts: (to create both tables and stored procedures) .

    CREATE TABLE [dbo].[ItemInfo](
        [Id] [int] IDENTITY(1,1) NOT NULL,
        [ItemId] [varchar](255) NOT NULL,
        [ItemName] [varchar](255) NOT NULL,
        [ItemType] [varchar](255) NOT NULL,
        [IsProcessed] [bit] NULL,
        CONSTRAINT [PK_ItemInfo] PRIMARY KEY CLUSTERED ([Id] ASC),
        CONSTRAINT [UK_ItemInfo_ItemId] UNIQUE NONCLUSTERED ([ItemId] ASC)) ON [PRIMARY]
    GO
    
    CREATE TABLE [dbo].[Staging](
        [Id] [int] IDENTITY(1,1) NOT NULL,
        [ItemId] [varchar](255) NOT NULL,
        [ItemName] [varchar](255) NOT NULL,
        [ItemType] [varchar](255) NOT NULL,
     CONSTRAINT [PK_Staging] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]
    GO
    
    CREATE PROCEDURE [dbo].[PostLoad]
    AS
    BEGIN
        SET NOCOUNT ON;
    
        UPDATE      ITM
        SET         ITM.ItemName    = STG.ItemName
                ,   ITM.ItemType    = STG.ItemType 
                ,   ITM.IsProcessed = 1
        FROM        dbo.ItemInfo    ITM
        INNER JOIN  dbo.Staging     STG
        ON          ITM.ItemId      = STG.ItemId;
    
        DELETE FROM dbo.ItemInfo
        WHERE       IsProcessed = 0;
    END
    GO
    
    CREATE PROCEDURE [dbo].[PreLoad]
    AS
    BEGIN
        SET NOCOUNT ON;
    
        TRUNCATE TABLE dbo.Staging;     
    
        UPDATE  dbo.ItemInfo 
        SET     IsProcessed = 0;
    END
    GO
    

    Screenshot #1:

    1

    Screenshot #2:

    2

    Screenshot #3:

    3

    Screenshot #4:

    4

    Screenshot #5:

    5

    Screenshot #6:

    6

    Screenshot #7:

    7

    Screenshot #8:

    8

    Screenshot #9:

    9

    Screenshot #10:

    10

    Screenshot #11:

    11

    Screenshot #12:

    12

    Screenshot #13:

    13

    Screenshot #14:

    14

    Screenshot #15:

    15

    Screenshot #16:

    16

    Screenshot #17:

    17

    Screenshot #18:

    18

    Screenshot #19:

    19

    Screenshot #20:

    20

    Screenshot #21:

    21

    Screenshot #22:

    22

    Screenshot #23:

    23

    0 讨论(0)
  • 2020-11-27 07:48

    Assuming you are using SQL Agent (or similar scheduler)

    Reqs 1/4) I would have a precursor step handle the FTP and/or file copy steps. I don't like to clutter my packages with file manipulation if I can avoid it.

    Reqs 2/3) At the control flow level, the package design is going to look like an Execute SQL task connected to a Data Flow connected to another Execute SQL task. As @AllenG indicated, you'd be best served by loading into a staging table via the Data flow task. The first Execute SQL Task will purge any rows from the staging table (TRUNCATE TABLE dbo.DAILY_STAGE)

    Approximate table design looks like this. The MICHAEL_BORN table is your existing table and the DAILY_STAGE is where your data flow will land.

    CREATE TABLE DBO.MICHAEL_BORN
    (
        ID int identity(1,1) NOT NULL PRIMARY KEY CLUSTERED
    ,   ItemID int NOT NULL
    ,   ItemName varchar(20) NOT NULL
    ,   ItemType varchar(20) NOT NULL
    )
    CREATE TABLE dbo.DAILY_STAGE
    (
        ItemID int NOT NULL PRIMARY KEY CLUSTERED
    ,   ItemName varchar(20) NOT NULL
    ,   ItemType varchar(20) NOT NULL
    )
    

    For demonstration purposes, I will load the above tables with sample data via TSQL

    -- Original data
    INSERT INTO
        dbo.MICHAEL_BORN
    VALUES
        (2345,'Apple','Fruit')
    ,   (4578, 'Bannana','Fruit')
    
    
    -- Daily load runs
    -- Adds a new fruit (pear), corrects misspelling of banana, eliminates apple
    INSERT INTO
        dbo.DAILY_STAGE
    VALUES
        (7721,'Pear','Fruit')
    ,   (4578, 'Banana','Fruit')
    

    The Execute SQL task will take advantage of the MERGE statement available in 2008+ editions of SQL Server. Please note the trailing semi-colon is part of the MERGE statement. Failure to include it will result in an error of "A MERGE statement must be terminated by a semi-colon (;)."

    -- MERGE statement
    -- http://technet.microsoft.com/en-us/library/bb510625.aspx
    -- Given the above scenario, this script will
    -- 1)  Update the matched (4578 bannana/banana) row
    -- 2)  Add the new (pear) row
    -- 3)  Remove the unmatched (apple) row
    
    MERGE
        dbo.[MICHAEL_BORN] AS T
    USING
    (
        SELECT
            ItemID
        ,   ItemName
        ,   ItemType
        FROM
            dbo.DAILY_STAGE
    
    ) AS S
    ON T.ItemID = S.ItemID
    WHEN
        MATCHED THEN
        UPDATE
        SET
            T.ItemName = S.ItemName
        ,   T.ItemType = S.ItemType
    WHEN
        NOT MATCHED THEN
        INSERT
        (
            ItemID
        ,   ItemName
        ,   ItemType
        )
        VALUES
        (
            ItemID
        ,   ItemName
        ,   ItemType
        )
    WHEN
        NOT MATCHED BY SOURCE THEN
        DELETE
        ;
    

    Req 5) Efficiency is totally based on your data and how wide your rows are but it shouldn't be terrible.

    -- Performance testing
    -- Assumes you have a similar fast row number generator function
    -- http://billfellows.blogspot.com/2009/11/fast-number-generator.html
    
    TRUNCATE TABLE dbo.MICHAEL_BORN
    TRUNCATE TABLE dbo.DAILY_STAGE
    
    -- load initial rows
    -- 20ish seconds
    INSERT INTO
        dbo.MICHAEL_BORN
    SELECT
        N.number AS ItemID
    ,   'Spam & eggs ' + CAST(N.number AS varchar(10)) AS ItemName
    ,   'SPAM' AS ItemType
    --, CASE N.number % 2 WHEN 0 THEN N.number + 1000000 ELSE N.number END AS UpTheEvens
    FROM
        dbo.GenerateNumbers(1000000) N
    
    
    -- Load staging table
    -- Odds get item type switched out
    -- Evens get delete and new ones created
    -- 20ish seconds
    INSERT INTO
        dbo.DAILY_STAGE
    SELECT
        CASE N.number % 2 WHEN 0 THEN N.number + 1000000 ELSE N.number END AS ItemID
    ,   'Spam & eggs ' + CAST(N.number AS varchar(10)) AS ItemName
    ,   CASE N.number % 2 WHEN 0 THEN 'SPAM' ELSE 'Not much spam' END AS ItemType
    FROM
        dbo.GenerateNumbers(1000000) N
    
    
    -- Run MERGE statement, 32 seconds 1.5M rows upserted
    -- Probably fast enough for you
    
    0 讨论(0)
  • 2020-11-27 07:48

    I just want to give my idea for the next guy who may pass by this question. So I'm going to suggest my idea for each scenario's.
    1. Getfile from FTP or local.
    I would suggest you to use Drop box, Google Drive or any other file syncing cloud services of your choice see this link for detail.
    2. I would suggest loading all flat file data to staging table as you suggested Then comparing the data would be easily done by using MERGE between your staging table and Target table on your unique column (ID). You can see this link for how to use merge script. The 2nd & 3rd scenarios will be solved if you are using MERGE Script.
    For the last two scenarios i suggest you use SQL JOB to automatically run the package and schedule it at off hours or on time where the server is not busy.Please take a look at the link for detail on how to Run a Package Using a SQL Server Agent Job just type it on your favorite search engine and you will find tons of blogs that shows how its done.

    0 讨论(0)
提交回复
热议问题