How to ensure that a file has unique line in vb.net if the size of the file is very big

前端 未结 2 729
面向向阳花
面向向阳花 2021-01-24 18:48

Language: vb.net File size: 1GB, and stuff.

Encoding of the text file: UTF8 (so each character is represented by different numbers of b

相关标签:
2条回答
  • 2021-01-24 19:30

    Depending on how long the lines are, you may be able to compute an MD5 hash value for each line and store than in a HashMap:

    Using sr As New StreamReader("myFile")
        Dim lines As New HashSet(Of String)
        Dim md5 As New Security.Cryptography.MD5Cng()
    
        While sr.BaseStream.Position < sr.BaseStream.Length
            Dim l As String = sr.ReadLine()
            Dim hash As String = String.Join(String.Empty, md5.ComputeHash(System.Text.Encoding.UTF8.GetBytes(l)).Select(Function(x) x.ToString("x2")))
    
            If lines.Contains(hash) Then
                'Lines are not unique
                Exit While
            Else
                lines.Add(hash)
            End If
        End While
    End Using
    

    Untested, but this may be fast enough for your needs. I can't think of something much faster that still maintains some semblance of conciseness :)

    0 讨论(0)
  • 2021-01-24 19:46

    This is the contemporary answer

    Public Sub makeUniqueForLargeFiles(ByVal strFileSource As String)
        Using sr As New System.IO.StreamReader(strFileSource)
            Dim changeFileName = reserveFileName(strFileSource, False, True)
            Using sw As New System.IO.StreamWriter(reserveFileName(strFileSource, False, True), False, defaultEncoding)
                sr.Peek()
                Dim lines As New Generic.Dictionary(Of Integer, System.Collections.Generic.List(Of Long))
                While sr.BaseStream.Position < sr.BaseStream.Length
                    Dim offset = sr.BaseStream.Position
                    Dim l As String = sr.ReadLine()
                    Dim nextOffset = sr.BaseStream.Position
                    Dim hash = l.GetHashCode
                    Do ' a trick to put the for each in a "nest" that we can exit from
                        If lines.ContainsKey(hash) Then
                            Using sr2 = New System.IO.StreamReader(strFileSource)
                                For Each offset1 In lines.Item(hash)
                                    sr2.BaseStream.Position = offset1
                                    Dim l2 = sr2.ReadLine
                                    If l = l2 Then
                                        Exit Do 'will sr2.dispose be called here?
                                    End If
                                Next
                            End Using
                        Else
                            lines.Add(hash, New Generic.List(Of Long))
                        End If
                        lines.Item(hash).Add(offset)
                        sw.WriteLine(l)
                    Loop While False
                    sr.BaseStream.Position = nextOffset
                End While
            End Using
        End Using
    End Sub
    
    0 讨论(0)
提交回复
热议问题