Language: vb.net File size: 1GB, and stuff.
Encoding of the text file: UTF8 (so each character is represented by different numbers of b
Depending on how long the lines are, you may be able to compute an MD5 hash value for each line and store than in a HashMap
:
Using sr As New StreamReader("myFile")
Dim lines As New HashSet(Of String)
Dim md5 As New Security.Cryptography.MD5Cng()
While sr.BaseStream.Position < sr.BaseStream.Length
Dim l As String = sr.ReadLine()
Dim hash As String = String.Join(String.Empty, md5.ComputeHash(System.Text.Encoding.UTF8.GetBytes(l)).Select(Function(x) x.ToString("x2")))
If lines.Contains(hash) Then
'Lines are not unique
Exit While
Else
lines.Add(hash)
End If
End While
End Using
Untested, but this may be fast enough for your needs. I can't think of something much faster that still maintains some semblance of conciseness :)
This is the contemporary answer
Public Sub makeUniqueForLargeFiles(ByVal strFileSource As String)
Using sr As New System.IO.StreamReader(strFileSource)
Dim changeFileName = reserveFileName(strFileSource, False, True)
Using sw As New System.IO.StreamWriter(reserveFileName(strFileSource, False, True), False, defaultEncoding)
sr.Peek()
Dim lines As New Generic.Dictionary(Of Integer, System.Collections.Generic.List(Of Long))
While sr.BaseStream.Position < sr.BaseStream.Length
Dim offset = sr.BaseStream.Position
Dim l As String = sr.ReadLine()
Dim nextOffset = sr.BaseStream.Position
Dim hash = l.GetHashCode
Do ' a trick to put the for each in a "nest" that we can exit from
If lines.ContainsKey(hash) Then
Using sr2 = New System.IO.StreamReader(strFileSource)
For Each offset1 In lines.Item(hash)
sr2.BaseStream.Position = offset1
Dim l2 = sr2.ReadLine
If l = l2 Then
Exit Do 'will sr2.dispose be called here?
End If
Next
End Using
Else
lines.Add(hash, New Generic.List(Of Long))
End If
lines.Item(hash).Add(offset)
sw.WriteLine(l)
Loop While False
sr.BaseStream.Position = nextOffset
End While
End Using
End Using
End Sub