BULK INSERT with two row terminators

问题

I am trying to import a text file, so the result would be just words in a seperate rows of one column. For example a text:

'Hello Mom,

we meet again'

should give 5 records:

'Hello' 
'Mom,'
'we' 
'meet' 
'again'

I tried to accomplish this with BULK INSERT with ROWTERMINATOR = ' ', but there is a problem with treating new line as a terminator too and I get 'Mom,we' in one of the results.

From what i know, there is no way to add a second ROWTEMRMINATOR to BULK INSERT (true?). What is the best way you know to achieve the result as specified above?

The file cannot be preprocessed outside of SQL Server and the method should be useful for hundreds of files with thausands lines of words, imported at a different times, not just once.

回答1:

Given:

The file cannot be preprocessed outside of SQL Server

Option 1

Why not use OPENROWSET(BULK...)? This would allow you to import/insert (which takes care of the row terminator) while at the same time splitting (which takes care of the field terminator). Depending on whether or not you can create a Format File, it should look something like one of the following:

Format File = split each row

INSERT INTO dbo.TableName (ColumnName)
  SELECT split.SplitVal
  FROM   OPENROWSET(BULK 'path\to\file.txt',
                    FORMATFILE='Path\to\FormatFile.XML') data(eachrows)
  CROSS APPLY SQL#.String_Split(data.eachrow, N' ', 2) split;

No Format File = split entire file as a single row

INSERT INTO dbo.TableName (ColumnName)
  SELECT split.SplitVal
  FROM   OPENROWSET(BULK 'path\to\file.txt', SINGLE_CLOB) data(allrows)
  CROSS APPLY SQL#.String_Split(
                                REPLACE(data.allrows, NCHAR(10), N' '),
                                N' ',
                                2 -- remove empty entries
                               ) split;

Notes:

For both methods you need to use a string splitter. SQLCLR-based splitters are the fastest and in the examples above I used one from the SQL# library (which I created but the String_Split function is available in the Free version). You can also write your own. If you do write your own and are not using a Format File, it might be a good idea to allow for multiple split characters so you can pass in both " " and "\n" and get rid of the REPLACE().
If you can write your own SQLCLR string splitter, then it might be even better to just write a SQLCLR stored procedure that accepts an input parameter for @FilePath, reads the file, does the splitting, and spits out the words as many rows of a single column:
```
INSERT INTO dbo.TableName(ColumnName)
  EXEC dbo.MySQLCLRproc(N'C:\path\to\file.txt');
```
If you are not using (or cannot use) a Format File, then be sure to use the proper "SINGLE_" option as you can do either SINGLE_CLOB (returns VARCHAR(MAX) for standard ASCII file) or SINGLE_NCLOB (returns NVARCHAR(MAX) for Unicode file).
Even if you can create a Format File, it might be more efficient to pull in the entire file as a single string, depending on the size of the files, as splitting a large string can be done rather quickly and would be a single function call, whereas a file of thousands of short lines would be thousands of function calls that are also fast, but likely not 1000 times faster than the single call. But if the file is 1 MB or larger then I would probably still opt for doing the Format File and processing as many short lines.

Option 2

If by "pre-processed" you mean altered, but that there is no restriction on simply reading them and inserting the data from something external to SQL Server, you should write a small .NET app which reads the rows, splits the lines, and inserts the data by calling a Stored Procedure that accepts a Table-Valued Parameter (TVP). I have detailed this approach in another answer here on S.O.:

How can I insert 10 million records in the shortest time possible?

This could be compiled as a Console App and used in batch (i.e. .CMD / .BAT) scripts and even scheduled as a Windows Task.

来源：https://stackoverflow.com/questions/27730419/bulk-insert-with-two-row-terminators

标签

sql

sql-server

tsql

bulkinsert

bcp