Parse CSV, ignoring commas inside string literals in VBA?

后端 未结 11 1789
青春惊慌失措
青春惊慌失措 2020-12-30 03:27

I have a VBA application that runs every day. It checks a folder where CSVs are downloaded automatically, and adds their contents to a database. When parsing them, I reali

相关标签:
11条回答
  • 2020-12-30 04:03

    The first way to solve this problem is to look at the structure of the line from the csv file (int,int,"String literal, will have at most one comma", etc). A naive solution would be (Assuming that the line don't have any semicolons)

    Function splitLine1(line As String) As String()
    
       Dim temp() As String
       'Splits the line in three. The string delimited by " will be at temp(1)
       temp = Split(line, Chr(34)) 'chr(34) = "
    
       'Replaces the commas in the numeric fields by semicolons
       temp(0) = Replace(temp(0), ",", ";")
       temp(2) = Replace(temp(2), ",", ";")
    
       'Joins the temp array with quotes and then splits the result using the semicolons
       splitLine1 = Split(Join(temp, Chr(34)), ";")
    
    End Function
    

    This function only solves this particular problem. Another way to do the job is using the regular expression object from VBScript.

    Function splitLine2(line As String) As String()
    
        Dim regex As Object
        Set regex = CreateObject("vbscript.regexp")
        regex.IgnoreCase = True
        regex.Global = True
    
        'This pattern matches only commas outside quotes
        'Pattern = ",(?=([^"]*"[^"]*")*(?![^"]*"))"
        regex.Pattern = ",(?=([^" & Chr(34) & "]*" & Chr(34) & "[^" & Chr(34) & "]*" & Chr(34) & ")*(?![^" & Chr(34) & "]*" & Chr(34) & "))"
    
        'regex.replaces will replace the commas outside quotes with semicolons and then the
        'Split function will split the result based on the semicollons
        splitLine2 = Split(regex.Replace(line, ";"), ";")
    
    End Function
    

    This method seems much more cryptic, but does not deppends on the structure of the line

    You can read more about regular expressions patterns in VBScript Here

    0 讨论(0)
  • 2020-12-30 04:05

    If you are working with MS Access tables, there are advantages in simply importing text from disk. For example:

    ''If you have a reference to the Windows Script Host Object Model
    Dim fs As New FileSystemObject
    Dim ts As TextStream
    
    ''For late binding
    ''Dim fs As Object
    ''Dim ts As Object
    ''Set fs=CreateObject("Scripting.FileSystemObject")
    
    Set ts = fs.CreateTextFile("z:\docs\import.csv", True)
    
    sData = "1,2,3,""This should,be one part"",5,6,7"
    
    ts.Write sData
    ts.Close
    
    ''Just for testing, your table will already exist
    ''sSQL = "Create table Imports (f1 int, f2 int, f3 int, f4 text, " _
    ''     & "f5 int, f6 int, f7 int)"
    ''CurrentDb.Execute sSQL
    
    ''The fields will be called F1,F2 ... Fn in the text file
    sSQL = "INSERT INTO Imports SELECT * FROM " _
         & "[text;fmt=delimited;hdr=no;database=z:\docs\].[import.csv]"
    CurrentDb.Execute sSQL
    
    0 讨论(0)
  • 2020-12-30 04:05

    Taking your comments into account you could take the easy way out here

    • split on " --> gives you 3 or more entries (could be more due to doublequotes inside the string literal)
    • split first part on ,
    • keep part 2 to n-1 together (is your string literal)
    • split the last part on ,
    0 讨论(0)
  • We had a similar CSV parsing challenge in excel recently, and implemented a solution adapted from Javascript code to parse CSV data:

    Function SplitCSV(csvText As String, delimiter As String) As String()
    
        ' Create a regular expression to parse the CSV values
        Dim RegEx As New RegExp
    
        ' Create pattern which will match each column in the CSV, wih submatches for each of the groups in the regex
        ' Match Groups:  Delimiter            Quoted fields                  Standard fields
        RegEx.Pattern = "(" + delimiter + "|^)(?:\""([^\""]*(?:\""\""[^\""]*)*)\""|([^\""\""" + delimiter + """]*))"
        RegEx.Global = True
        RegEx.IgnoreCase = True
    
        ' Create an array to hold all pattern matches (i.e. columns)
        Dim Matches As MatchCollection
        Set Matches = RegEx.Execute(csvText)
    
        ' Create an array to hold output data
        Dim Output() As String
    
        ' Create int to track array location when iterating
        Dim i As Integer
        i = 0
    
        ' Manually add blank if first column is blank, since VBA regex misses this
        If csvText Like ",*" Then
            ReDim Preserve Output(i)
            Output(i) = ""
            i = i + 1
        End If
    
        ' Iterate over all pattern matches and get values into output array
        Dim Match As Match
        Dim MatchedValue As String
        For Each Match In Matches
    
            ' Check to see which kind of value we captured (quoted or unquoted)
            If (Len(Match.SubMatches(1)) > 0) Then
                ' We found a quoted value. When we capture this value, unescape any double quotes
                MatchedValue = Replace(Match.SubMatches(1), """""", """")
            Else
                ' We found a non-quoted value
                MatchedValue = Match.SubMatches(2)
            End If
    
            ' Now that we have our value string, let's add it to the data array
            ReDim Preserve Output(i)
            Output(i) = MatchedValue
            i = i + 1
    
        Next Match
    
        ' Return the parsed data
        SplitCSV = Output
    
    End Function
    
    0 讨论(0)
  • 2020-12-30 04:11

    @Gimp said...

    The current answers do not contain enough detail.

    I'm running into the same problem. Looking for more detail in this answer.

    To elaborate on @MRAB's answer:

    Function ParseCSV(FileName)
        Dim Regex       'As VBScript_RegExp_55.RegExp
        Dim MatchColl   'As VBScript_RegExp_55.MatchCollection
        Dim Match       'As VBScript_RegExp_55.Match
        Dim FS          'As Scripting.FileSystemObject
        Dim Txt         'As Scripting.TextStream
        Dim CSVLine
        ReDim ToInsert(0)
    
        Set FS = CreateObject("Scripting.FileSystemObject")
        Set Txt = FS.OpenTextFile(FileName, 1, False, -2)
        Set Regex = CreateObject("VBScript.RegExp")
    
        Regex.Pattern = """[^""]*""|[^,]*"    '<- MRAB's answer
        Regex.Global = True
    
        Do While Not Txt.AtEndOfStream
            ReDim ToInsert(0)
            CSVLine = Txt.ReadLine
            For Each Match In Regex.Execute(CSVLine)
                If Match.Length > 0 Then
                    ReDim Preserve ToInsert(UBound(ToInsert) + 1)
                    ToInsert(UBound(ToInsert) - 1) = Match.Value
                End If
            Next
            InsertArrayIntoDatabase ToInsert
        Loop
        Txt.Close
    End Function
    

    You need to customize the InsertArrayIntoDatabase Sub for your own table. Mine has several text fields named f00, f01, etc...

    Sub InsertArrayIntoDatabase(a())
        Dim rs As DAO.Recordset
        Dim i, n
        Set rs = CurrentDb().TableDefs("tbl").OpenRecordset()
        rs.AddNew
        For i = LBound(a) To UBound(a)
            n = "f" & Format(i, "00") 'fields in table are f00, f01, f02, etc..
            rs.Fields(n) = a(i)
        Next
        rs.Update
    End Sub
    

    Note that instead of using CurrentDb() in InsertArrayIntoDatabase(), you should really use a global variable that gets set to the value of CurrentDb() before ParseCSV() runs, because running CurrentDb() in a loop is very slow, especially on a very large file.

    0 讨论(0)
  • 2020-12-30 04:12

    A simple regex for parsing a CSV line, assuming no quotes inside quoted fields, is:

    "[^"]*"|[^,]*
    

    Each match will return a field.

    0 讨论(0)
提交回复
热议问题