Find non-ASCII characters in varchar columns using SQL Server

前端 未结 8 1517
遥遥无期
遥遥无期 2020-12-02 14:38

How can rows with non-ASCII characters be returned using SQL Server?
If you can show how to do it for one column would be great.

I am doing something like this

相关标签:
8条回答
  • 2020-12-02 14:41

    I've been running this bit of code with success

    declare @UnicodeData table (
         data nvarchar(500)
    )
    insert into 
        @UnicodeData
    values 
        (N'Horse�')
        ,(N'Dog')
        ,(N'Cat')
    
    select
        data
    from
        @UnicodeData 
    where
        data collate LATIN1_GENERAL_BIN != cast(data as varchar(max))
    

    Which works well for known columns.

    For extra credit, I wrote this quick script to search all nvarchar columns in a given table for Unicode characters.

    declare 
        @sql    varchar(max)    = ''
        ,@table sysname         = 'mytable' -- enter your table here
    
    ;with ColumnData as (
        select
            RowId               = row_number() over (order by c.COLUMN_NAME)
            ,c.COLUMN_NAME
            ,ColumnName         = '[' + c.COLUMN_NAME + ']'
            ,TableName          = '[' + c.TABLE_SCHEMA + '].[' + c.TABLE_NAME + ']' 
        from
            INFORMATION_SCHEMA.COLUMNS c
        where
            c.DATA_TYPE         = 'nvarchar'
            and c.TABLE_NAME    = @table
    )
    select
        @sql = @sql + 'select FieldName = ''' + c.ColumnName + ''',         InvalidCharacter = [' + c.COLUMN_NAME + ']  from ' + c.TableName + ' where ' + c.ColumnName + ' collate LATIN1_GENERAL_BIN != cast(' + c.ColumnName + ' as varchar(max)) '  +  case when c.RowId <> (select max(RowId) from ColumnData) then  ' union all ' else '' end + char(13)
    from
        ColumnData c
    
    -- check
    -- print @sql
    exec (@sql)
    

    I'm not a fan of dynamic SQL but it does have its uses for exploratory queries like this.

    0 讨论(0)
  • 2020-12-02 14:45

    Here is a solution for the single column search using PATINDEX.
    It also displays the StartPosition, InvalidCharacter and ASCII code.

    select line,
      patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) as [Position],
      substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1) as [InvalidCharacter],
      ascii(substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1)) as [ASCIICode]
    from  staging.APARMRE1
    where patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) >0
    
    0 讨论(0)
  • 2020-12-02 14:45

    There is a user defined function available on the web 'Parse Alphanumeric'. Google UDF parse alphanumeric and you should find the code for it. This user defined function removes all characters that doesn't fit between 0-9, a-z, and A-Z.

    Select * from Staging.APARMRE1 ar
    where udf_parsealpha(ar.last_name) <> ar.last_name
    

    That should bring back any records that have a last_name with invalid chars for you...though your bonus points question is a bit more of a challenge, but I think a case statement could handle it. This is a bit psuedo code, I'm not entirely sure if it'd work.

    Select id, case when udf_parsealpha(ar.last_name) <> ar.last_name then 'last name'
    when udf_parsealpha(ar.first_name) <> ar.first_name then 'first name'
    when udf_parsealpha(ar.Address1) <> ar.last_name then 'Address1'
    end, 
    case when udf_parsealpha(ar.last_name) <> ar.last_name then ar.last_name
    when udf_parsealpha(ar.first_name) <> ar.first_name then ar.first_name
    when udf_parsealpha(ar.Address1) <> ar.last_name then ar.Address1
    end
    from Staging.APARMRE1 ar
    where udf_parsealpha(ar.last_name) <> ar.last_name or
    udf_parsealpha(ar.first_name) <> ar.first_name or
    udf_parsealpha(ar.Address1) <> ar.last_name 
    

    I wrote this in the forum post box...so I'm not quite sure if that'll function as is, but it should be close. I'm not quite sure how it will behave if a single record has two fields with invalid chars either.

    As an alternative, you should be able to change the from clause away from a single table and into a subquery that looks something like:

    select id,fieldname,value from (
    Select id,'last_name' as 'fieldname', last_name as 'value'
    from Staging.APARMRE1 ar
    Union
    Select id,'first_name' as 'fieldname', first_name as 'value'
    from Staging.APARMRE1 ar
    ---(and repeat unions for each field)
    )
    where udf_parsealpha(value) <> value
    

    Benefit here is for every column you'll only need to extend the union statement here, while you need to put that comparisson three times for every column in the case statement version of this script

    0 讨论(0)
  • 2020-12-02 14:48

    To find which field has invalid characters:

    SELECT * FROM Staging.APARMRE1 FOR XML AUTO, TYPE
    

    You can test it with this query:

    SELECT top 1 'char 31: '+char(31)+' (hex 0x1F)' field
    from sysobjects
    FOR XML AUTO, TYPE
    

    The result will be:

    Msg 6841, Level 16, State 1, Line 3 FOR XML could not serialize the data for node 'field' because it contains a character (0x001F) which is not allowed in XML. To retrieve this data using FOR XML, convert it to binary, varbinary or image data type and use the BINARY BASE64 directive.

    It is very useful when you write xml files and get error of invalid characters when validate it.

    0 讨论(0)
  • 2020-12-02 14:51

    This script searches for non-ascii characters in one column. It generates a string of all valid characters, here code point 32 to 127. Then it searches for rows that don't match the list:

    declare @str varchar(128)
    declare @i int
    set @str = ''
    set @i = 32
    while @i <= 127
        begin
        set @str = @str + '|' + char(@i)
        set @i = @i + 1
        end
    
    select  col1
    from    YourTable
    where   col1 like '%[^' + @str + ']%' escape '|'
    
    0 讨论(0)
  • 2020-12-02 14:59

    running the various solutions on some real world data - 12M rows varchar length ~30, around 9k dodgy rows, no full text index in play, the patIndex solution is the fastest, and it also selects the most rows.

    (pre-ran km. to set the cache to a known state, ran the 3 processes, and finally ran km again - the last 2 runs of km gave times within 2 seconds)

    patindex solution by Gerhard Weiss -- Runtime 0:38, returns 9144 rows

    select dodgyColumn from myTable fcc
    WHERE  patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,dodgyColumn ) >0
    

    the substring-numbers solution by MT. -- Runtime 1:16, returned 8996 rows

    select dodgyColumn from myTable fcc
    INNER JOIN dbo.Numbers32k dn ON dn.number<(len(fcc.dodgyColumn ))
    WHERE ASCII(SUBSTRING(fcc.dodgyColumn , dn.Number, 1))<32 
        OR ASCII(SUBSTRING(fcc.dodgyColumn , dn.Number, 1))>127
    

    udf solution by Deon Robertson -- Runtime 3:47, returns 7316 rows

    select dodgyColumn 
    from myTable 
    where dbo.udf_test_ContainsNonASCIIChars(dodgyColumn , 1) = 1
    
    0 讨论(0)
提交回复
热议问题