Handling consecutive quotes when splitting CSV lines

允我心安 提交于 2019-12-11 10:00:20

问题


I am struggling with parsing values from a CSV file because of two consecutive double quotes "".

Here's an example of a CSV field I pulled from wikipedia: 1997,Ford,E350,"Super, ""luxurious"" truck"

I have tried to find different ways to account for it.

The result I keep getting is:

"1997"
"Ford"
"E350"
"Super,"
""Super"
" ""luxurious"" truck""

This is my VB.Net function.

Private Function splitCSV(ByVal sLine As String) As List(Of String)
    Dim comA As Integer = -1, comB = -1, quotA = -1, quotB = -1, pos = 0
    Dim parsed As New List(Of String)
    Dim quote As String = """"
    Dim comma As String = ","
    Dim len As Integer = sLine.Length
    Dim first As Boolean = True

    comA = sLine.IndexOf(comma, pos)                        ' Find the next comma.
    quotA = sLine.IndexOf(quote, pos)                       ' Find the next quotation mark.

    ' This if function works if there is only one field in the given row.
    If comA < 0 Then
        parsed.Add(False)
        Return parsed
    End If

    While pos < len                                                     ' While not at end of the string

        comB = sLine.IndexOf(comma, comA + 1)                               ' Find the second comma
        quotB = sLine.IndexOf(quote, quotA + 1)                             ' Find the second quotation mark

        ' Looking for the actual second quote mark
        '     Skips over the double quotation marks.

        If quotA > -1 And quotA < comB Then                                 ' If the quotation mark is before the first comma

            If Math.Abs(quotA - quotB).Equals(1) Then
                Dim tempA As Integer = quotA
                Dim tempB As Integer = quotB

                ' Looking for the actual second quote mark
                '     Skips over the double quotation marks.
                While (Math.Abs(tempA - tempB).Equals(1))
                    tempA = tempB

                    If Not tempA.Equals(sLine.LastIndexOf(quote)) Then
                        tempB = sLine.IndexOf(quote, tempA + 1)

                    Else
                        tempA = tempB - 2
                    End If

                End While

                quotB = tempB
            End If

            If quotB < 0 Then                                                   ' If second quotation mark does not exist
                parsed.Add(False)                                                   ' End the function and Return False

                Return parsed
            End If

            parsed.Add(sLine.Substring(quotA + 1, quotB - quotA - 1))       ' Otherwise, add the substring of initial and end quotation marks.
            quotA = quotB                                                       ' Give quotA the position of quotB
            pos = quotB                                                         ' Mark the current position

        ElseIf comA < comB Then
            If first Then                                                   ' If it is the first comma in the line,
                parsed.Add(sLine.Substring(pos, comA))                          ' Parse the first field
                first = False                                                   ' The future commas will not be considered as the first one.
            End If

            comB = sLine.IndexOf(comma, comA + 1)                           ' Find the second comma

            If comB > comA Then                                             ' If the second comma exists
                parsed.Add(sLine.Substring(comA + 1, comB - comA - 1))          ' Add the substring of the first and second comma.
                comA = comB                                                     ' Give comA the position of comB
                pos = comB                                                      ' Mark the current position

            End If

        ElseIf len > 0 Then                                                 ' If the first comma does not exist, as long as sLine has something,
            parsed.Add(sLine.Substring(pos + 1, len - pos - 1))                         ' Return the substing of position to end of string.
            pos = len                                                           ' Mark the position at the end to exit out of while loop


        End If

    End While

    Return parsed                                                           ' Return parsed list of string
End Function

回答1:


The TextFieldParser is really pretty good with this sort of thing, certainly easier than rolling your own. It was easy to test this: I copied your sample to a file, then:

Imports Microsoft.VisualBasic.FileIO
...
Using parser = New TextFieldParser("C:\Temp\CSVPARSER.TXT")
    parser.Delimiters = New String() {","}
    parser.TextFieldType = FieldType.Delimited
    parser.HasFieldsEnclosedInQuotes = True

    While parser.EndOfData = False
        data = parser.ReadFields

        ' use pipe to show column breaks:
        Dim s = String.Join("|", data)
        Console.WriteLine(s)

    End While
End Using

HasFieldsEnclosedInQuotes = True would be important in this case. Result:

1997|Ford|E350|Super, "luxurious" truck

The comma after super looks out of place - and may well be - but it is inside quotes in the original: 1997,Ford,E350,"Super, ""luxurious"" truck"

There are other libraries/packages which also do well with various CSV layouts and formats.




回答2:


I've had to parse these types of files before. Here's what I ended up writing. Basically, you scan the incoming text one character at a time. If it's a quote, just make note of it unless the last character was also a quote. If you're in quoted text, the delimiter is ignored.

    Protected Function FlexSplitLine(incoming As String, fieldDelimiter As String, quoteDelimiter As String) As String()
        Dim rval As New List(Of String)
        Dim index As Integer
        Dim Word As New System.Text.StringBuilder
        Dim inQuote As Boolean
        Dim QuoteChar As Char
        Dim CommaChar As Char

        index = 0

        If quoteDelimiter Is Nothing OrElse quoteDelimiter.Length = 0 Then
            quoteDelimiter = """"
        End If

        If fieldDelimiter Is Nothing OrElse fieldDelimiter.Length = 0 Then
            fieldDelimiter = ","
        End If

        QuoteChar = quoteDelimiter(0)
        CommaChar = fieldDelimiter(0)

        Do While index < incoming.Length
            If incoming(index) = QuoteChar Then
                If index < incoming.Length - 1 AndAlso incoming(index + 1) = QuoteChar Then
                    Word.Append(QuoteChar)
                    index += 1
                Else
                    inQuote = Not inQuote
                End If
            ElseIf incoming(index) = CommaChar AndAlso Not inQuote Then
                rval.Add(Word.ToString)
                Word.Length = 0
            Else
                Word.Append(incoming(index))
            End If

            index += 1
        Loop

        If inQuote Then
            Throw New IndexOutOfRangeException("Ran past the end of the line while looking for the ending quote character.")
        End If

        rval.Add(Word.ToString)

        Return rval.ToArray
    End Function


来源:https://stackoverflow.com/questions/35468148/handling-consecutive-quotes-when-splitting-csv-lines

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!