Java : How to find string patterns in a LARGE binary file?

风流意气都作罢 提交于 2019-12-24 18:02:34

问题


I'm trying to write a program that will read a VERY LARGE binary file and try to find the occurrence of 2 different strings and then print the indexes that matches the patterns. For the example's sake let's assume the character sequences are [H,e,l,l,o] and [H,e,l,l,o, ,W,o,r,l,d].

I was able to code this for small binary files because I was reading each character as a byte and then saving it in an Arraylist. Then starting from the beginning of the Arraylist, I was comparing the byte arraylist(byte[] data) with the byte[] pattern.

I need to find a way to do the same but WITHOUT writing the entire binary file in memory for comparison. That means I should be able to compare while reading each character (I should not save the entire binary file in memory). Assume the binary file only contains characters.

Any suggestions on how this can be achieved ? Thank you all in advance.


回答1:


Google "finite state machine".

Or, read the file one byte at a time, if the byte just doesn't match the first character of the search term, go on to the next byte. If it does match, now you're looking for the next character in the sequence. I.e., your state has gone from 0, to 1. If your state equals (or passes) the length of the search string, you found it!

Implementation/debugging left to the reader.




回答2:


Seems like you are really looking for Aho-Corasick string matching algorithm.

The algorithm builds an automaton from the given dictionary you have, and then allows you to find matches using a single scan of your input string.

The wikipedia article links to this java implementation




回答3:


There are specialised algorithms for this but let's try a simple one first.

You can start with making the comparison on the fly, always after reading the next byte. Once you do that, it's easy to spot that you don't need to keep any bytes that are from earlier than your longest pattern.

So you can just use a buffer that is as long as your longest pattern, put new bytes in at one end and drop them at the other.

As I said, there are algorithms more effective than this but it's a good start.




回答4:


Use a FileInputStream wrapped in a BufferedInputStream and compare each byte. Keep a buffer the length of the sequence you're looking for so you backtrack if it doesn't match at some point. If the sequence you're looking for is too large, you could save the offset and re-open the file for reading.

  • Working with streams: http://docs.oracle.com/javase/tutorial/essential/io/
  • String matching algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm

Or if you just want something to copy and paste you could look at this SO question.



来源:https://stackoverflow.com/questions/23350456/java-how-to-find-string-patterns-in-a-large-binary-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!