问题
so I have been given a task of writing a vb program where I read in a large .txt file (anywhere from 500mb to 2GB) and this files usually starts with a 13 digit number then loads of other info after per line. (e.g "1578597500548 info info info info etc.") I must let a user enter a 13 digit number and then my program search's the large file for that number at beginning of each line and if its found write the full line into a new .txt file!
My current program works perfectly but I'm noticing my adding to the list/streamreader part takes up around 90% of the process time. Averaging around 27secs per run. Any ideas how to speed up? Here's what I have written.
Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click
Dim wtr As IO.StreamWriter
Dim listy As New List(Of String)
Dim i = 0
stpw.Reset()
stpw.Start()
'reading in file of large data 700mb and larger
Using Reader As New StreamReader("G:\USER\FOLDER\tester.txt")
While Reader.EndOfStream = False
listy.Add(Reader.ReadLine)
End While
End Using
'have a textbox which finds user query number
Dim result = From n In listy
Where n.StartsWith(TextBox1.Text)
Select n
'writes results found into new file
wtr = New StreamWriter("G:\USER\searched-number.txt")
For Each word As String In result
wtr.WriteLine(word)
Next
wtr.Close()
stpw.Stop()
Debug.WriteLine(stpw.Elapsed.TotalMilliseconds)
Application.Exit()
End Sub
UPDATE I've taken some suggestion about not putting it into a list first and just searching on memory, Time is about 5 seconds faster, still takes 23 seconds to complete and also its writing out the line above the digit im searching so if you could please tell me where i'm going wrong. Thanks guys!
wtr = New StreamWriter("G:\Karl\searchednumber.txt")
Using Reader As New StreamReader("G:\Karl\AC\tester.txt")
While Reader.EndOfStream = False
lineIn = Reader.ReadLine
If Reader.ReadLine.StartsWith(TextBox1.Text) Then
wtr.WriteLine(lineIn)
Else
Continue While
End If
End While
wtr.Close()
End Using
回答1:
Index the file when the program loads.
Create a Dictionary(Of ULong, Long)
, and when the program loads read through the file. For each line, add an entry to the dictionary showing the 13 digit value at the front of each line as the ULong key and the position in the file stream as the Long value.
Then, when a user puts in a key, you can check the dictionary, which will be almost instance, and then seek to the exact location on disk you need.
Building the file index at program start may take a few moments, but you'll only ever have to do it once. Right now, you either need to search through the entire thing every time a user wants to do a search, or keep several hundred megabytes of text file data in memory. Once you have the index, looking up a value in the dictionary and then seeking directly to it should appear to happen almost instantly.
I just saw this comment:
there could be more than 1 occurrences of a 13 digit number so must search the whole file.
Based on that, the index should be a Dictionary(Of ULong, List(Of Long))
, where adding a value to entry first creates a list instance if one doesn't already exist, then adds the new value to the list.
Here's a basic attempt typed directly into the reply window without the aid of testing data or Visual Studio that likely therefore still contains several bugs:
Public Class MyFileIndexer
Private initialCapacity As Integer = 1
Private Property FilePath As String
Private Index As Dictionary(Of ULong, List(Of Long))
Public Sub New(filePath As String)
Me.FilePath = filePath
RebuildIndex()
End Sub
Public Sub RebuildIndex()
Index = New Dictionary(Of ULong, List(Of Long))()
Using sr As New StreamReader(FilePath)
Dim Line As String = sr.ReadLine()
Dim position As Long = 0
While Line IsNot Nothing
'Process this line
If Line.Length > 13 Then
Dim key As ULong = ULong.Parse(Line.SubString(0, 13))
Dim item As List(Of Long)
If Not Index.TryGetValue(key, item) Then
item = New List(Of Long)(initialCapacity)
Index.Add(key, item)
End If
item.Add(position)
End If
'Prep for next line
position = sr.BaseStream.Position
Line = sr.ReadLine()
End While
End Using
End Sub
'Expect key to be a 13-character numeric string
Public Function Search(key As String) As List(Of String)
'Will throw an exception if parsing fails. Be prepared for that.
Dim realKey As ULong = ULong.Parse(key)
Return Search(realKey)
End Function
Public Function Search(key As ULong) As List(Of String)
Dim lines As List(Of Long)
If Not Index.TryGetValue(key, lines) Then Return Nothing
Dim result As New List(Of String)()
Using sr As New StreamReader(FilePath)
For Each position As Long In lines
sr.BaseStream.Seek(position, SeekOrigin.Begin)
result.Add(sr.ReadLine())
Next position
End Using
Return Result
End Function
End Class
'Somewhere public, when your application starts up:
Public Index As New MyFileIndexer("G:\USER\FOLDER\tester.txt")
Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click
Dim lines As List(Of String) = Nothing
Try
lines = Index.Search(TextBox1.Text)
Catch
'Do something here
End Try
If lines IsNot Nothing Then
Using sw As New StreamWriter($"G:\USER\{TextBox1.Text}.txt")
For Each line As String in lines
sw.WriteLine(line)
Next
End Using
End If
End Sub
And for fun, here's a generic version of the class that lets you supply your own key selector function to index any file that stores a key with each line, which I could see being generally useful for, say, larger csv data sets.
Public Class MyFileIndexer(Of TKey)
Private initialCapacity As Integer = 1
Private Property FilePath As String
Private Index As Dictionary(Of TKey, List(Of Long))
Private GetKey As Func(Of String, TKey)
Public Sub New(filePath As String, Func(Of String, TKey) keySelector)
Me.FilePath = filePath
Me.GetKey = keySelector
RebuildIndex()
End Sub
Public Sub RebuildIndex()
Index = New Dictionary(Of TKey, List(Of Long))()
Using sr As New StreamReader(FilePath)
Dim Line As String = sr.ReadLine()
Dim position As Long = 0
While Line IsNot Nothing
Dim key As TKey = GetKey(Line)
Dim item As List(Of Long)
If Not Index.TryGetValue(key, item) Then
item = New List(Of Long)(initialCapacity)
Index.Add(key, item)
End If
item.Add(position)
'Prep for next line
position = sr.BaseStream.Position
Line = sr.ReadLine()
End While
End Using
End Sub
Public Function Search(key As TKey) As List(Of String)
Dim lines As List(Of Long)
If Not Index.TryGetValue(key, lines) Then Return Nothing
Dim result As New List(Of String)()
Using sr As New StreamReader(FilePath)
For Each position As Long In lines
sr.BaseStream.Seek(position, SeekOrigin.Begin)
result.Add(sr.ReadLine())
Next position
End Using
Return Result
End Function
End Class
来源:https://stackoverflow.com/questions/53520276/reading-large-text-file-very-slow