Escaping non-ASCII characters (or how to remove the BOM?)

后端 未结 2 1359
一个人的身影
一个人的身影 2021-01-13 04:16

I need to create an ANSI text file from an Access recordset that outputs to JSON and YAML. I can write the file, but the output is coming out with the original characters, a

相关标签:
2条回答
  • 2021-01-13 04:53

    Late to the game here, but I can't be the only coder who got got fed up with my SQL imports being broken by text files with a Byte Order Marker. There are very few 'Stack questions that touch on the problem - this is one of closest - so I'm posting an overlapping answer here.

    I say 'overlapping' because the code below is solving a slightly different problem to yours - the primary purpose is writing a Schema file for a folder with a heterogeneous collection of files - but the BOM-handling segment is clearly marked.

    The key functionality is that we iterate through all the '.csv' files in a folder, and we test each file with a quick nibble of the first four bytes: and we only only strip out the Byte Order Marker if we see one.

    After that, we're working in low-level file-handling code from the primordial C. We have to, all the way down to using byte arrays, because everything else that you do in VBA will deposit the Byte Order Markers embedded in the structure of a string variable.

    So, without further adodb, here's the code:

    BOM-Disposal code for text files in a schema.ini file:

    Public Sub SetSchema(strFolder As String)
    On Error Resume Next 
    ' Write a Schema.ini file to the data folder.
    ' This is necessary if we do not have the registry privileges to set the ' correct 'ImportMixedTypes=Text' registry value, which overrides IMEX=1
    ' The code also checks for ANSI or UTF-8 and UTF-16 files, and applies a ' usable setting for CharacterSet ( UNICODE|ANSI ) with a horrible hack.
    ' OEM codepage-defined text is not supported: further coding is required
    ' ...And we strip out Byte Order Markers, if we see them - the OLEDB SQL ' provider for textfiles can't deal with a BOM in a UTF-16 or UTF-8 file
    ' Not implemented: handling tab-delimited files or other delimiters. The ' code assumes a header row with columns, specifies 'scan all rows', and ' imposes 'read the column as text' if the data types are mixed.
    Dim strSchema As String Dim strFile As String Dim hndFile As Long Dim arrFile() As Byte Dim arrBytes(0 To 4) As Byte
    If Right(strFolder, 1) <> "\" Then strFolder = strFolder & "\"
    ' Dir() is an iterator function when you call it with a wildcard:
    strFile = VBA.FileSystem.Dir(strFolder & "*.csv")
    Do While Len(strFile) > 0
    hndFile = FreeFile Open strFolder & strFile For Binary As #hndFile Get #hndFile, , arrBytes Close #hndFile
    strSchema = strSchema & "[" & strFile & "]" & vbCrLf strSchema = strSchema & "Format=CSVDelimited" & vbCrLf strSchema = strSchema & "ImportMixedTypes=Text" & vbCrLf strSchema = strSchema & "MaxScanRows=0" & vbCrLf
    If arrBytes(2) = 0 Or arrBytes(3) = 0 Then ' this is a hack strSchema = strSchema & "CharacterSet=UNICODE" & vbCrLf Else strSchema = strSchema & "CharacterSet=ANSI" & vbCrLf End If
    strSchema = strSchema & "ColNameHeader = True" & vbCrLf strSchema = strSchema & vbCrLf

    ' BOM disposal - Byte order marks confuse OLEDB text drivers:
    If arrBytes(0) = &HFE And arrBytes(1) = &HFF _ Or arrBytes(0) = &HFF And arrBytes(1) = &HFE Then
    hndFile = FreeFile Open strFolder & strFile For Binary As #hndFile ReDim arrFile(0 To LOF(hndFile) - 1) Get #hndFile, , arrFile Close #hndFile
    BigReplace arrFile, arrBytes(0) & arrBytes(1), ""
    hndFile = FreeFile Open strFolder & strFile For Binary As #hndFile Put #hndFile, , arrFile Close #hndFile Erase arrFile
    ElseIf arrBytes(0) = &HEF And arrBytes(1) = &HBB And arrBytes(2) = &HBF Then
    hndFile = FreeFile Open strFolder & strFile For Binary As #hndFile ReDim arrFile(0 To LOF(hndFile) - 1) Get #hndFile, , arrFile Close #hndFile BigReplace arrFile, arrBytes(0) & arrBytes(1) & arrBytes(2), ""
    hndFile = FreeFile Open strFolder & strFile For Binary As #hndFile Put #hndFile, , arrFile Close #hndFile Erase arrFile
    End If

    strFile = "" strFile = Dir
    Loop
    If Len(strSchema) > 0 Then
    strFile = strFolder & "Schema.ini"
    hndFile = FreeFile Open strFile For Binary As #hndFile Put #hndFile, , strSchema Close #hndFile
    End If

    End Sub

    Public Sub BigReplace(ByRef arrBytes() As Byte, ByRef SearchFor As String, ByRef ReplaceWith As String) On Error Resume Next
    Dim varSplit As Variant
    varSplit = Split(arrBytes, SearchFor) arrBytes = Join$(varSplit, ReplaceWith)
    Erase varSplit
    End Sub

    The code's easier to understand if you know that a Byte Array can be assigned to a VBA.String, and vice versa. The BigReplace() function is a hack that sidesteps some of VBA's inefficient string-handling, especially allocation: you'll find that large files cause serious memory and performance problems if you do it any other way.

    0 讨论(0)
  • 2021-01-13 05:12

    ... ok .... i found some example code on how to remove the BOM. I would have thought it would be possible to do this more elegantly when actually writing the text in the first place. Never mind. The following code removes the BOM.

    (This was originally posted by Simon Pedersen at http://www.imagemagick.org/discourse-server/viewtopic.php?f=8&t=12705)

    ' Removes the Byte Order Mark - BOM from a text file with UTF-8 encoding
    ' The BOM defines that the file was stored with an UTF-8 encoding.
    
    Public Function RemoveBOM(filePath)
    
        ' Create a reader and a writer
                Dim writer, reader, fileSize
                Set writer = CreateObject("Adodb.Stream")
                Set reader = CreateObject("Adodb.Stream")
    
        ' Load from the text file we just wrote
                reader.Open
                reader.LoadFromFile filePath
    
        ' Copy all data from reader to writer, except the BOM
                writer.Mode = 3
                writer.Type = 1
                writer.Open
                reader.Position = 5
                reader.CopyTo writer, -1
    
        ' Overwrite file
                writer.SaveToFile filePath, 2
    
        ' Return file name
                RemoveBOM = filePath
    
        ' Kill objects
                Set writer = Nothing
                Set reader = Nothing
        End Function
    

    It might be useful for someone else.

    0 讨论(0)
提交回复
热议问题