Searching for a sequence of Bytes in a Binary File with Java

前端 未结 4 2063
故里飘歌
故里飘歌 2020-12-01 05:08

I have a sequence of bytes that I have to search for in a set of Binary files using Java.

Example: I\'m searching for the byte sequence DEADBEEF (in hex

相关标签:
4条回答
  • 2020-12-01 05:53

    No, there is no built-in method to do that. But, directly copied from HERE (with two fixes applied to the original code):

    /**
     * Knuth-Morris-Pratt Algorithm for Pattern Matching
     */
    class KMPMatch {
        /**
         * Finds the first occurrence of the pattern in the text.
         */
        public static int indexOf(byte[] data, byte[] pattern) {
            if (data.length == 0) return -1;
    
            int[] failure = computeFailure(pattern);    
            int j = 0;
    
            for (int i = 0; i < data.length; i++) {
                while (j > 0 && pattern[j] != data[i]) {
                    j = failure[j - 1];
                }
                if (pattern[j] == data[i]) { j++; }
                if (j == pattern.length) {
                    return i - pattern.length + 1;
                }
            }
            return -1;
        }
    
        /**
         * Computes the failure function using a boot-strapping process,
         * where the pattern is matched against itself.
         */
        private static int[] computeFailure(byte[] pattern) {
            int[] failure = new int[pattern.length];
    
            int j = 0;
            for (int i = 1; i < pattern.length; i++) {
                while (j > 0 && pattern[j] != pattern[i]) {
                    j = failure[j - 1];
                }
                if (pattern[j] == pattern[i]) {
                    j++;
                }
                failure[i] = j;
            }
    
            return failure;
        }
    }
    
    0 讨论(0)
  • 2020-12-01 06:02

    For those who prefer libraries, there is an implementation (source below) of the Knuth-Morris-Pratt algorithm in Twitter's Elephant-Bird open source library (Apache license).

    You can find the library on Github at: https://github.com/twitter/elephant-bird

    package com.twitter.elephantbird.util;
    
    import java.io.IOException;
    import java.io.InputStream;
    import java.util.Arrays;
    
    /**
     * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
     * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
     */
    public class StreamSearcher {
    
      protected byte[] pattern_;
      protected int[] borders_;
    
      // An upper bound on pattern length for searching. Throws exception on longer patterns
      public static final int MAX_PATTERN_LENGTH = 1024;
    
      public StreamSearcher(byte[] pattern) {
        setPattern(pattern);
      }
    
      /**
       * Sets a new pattern for this StreamSearcher to use.
       * @param pattern
       *          the pattern the StreamSearcher will look for in future calls to search(...)
       */
      public void setPattern(byte[] pattern) {
        if (pattern.length > MAX_PATTERN_LENGTH) {
          throw new IllegalArgumentException("The maximum pattern length is " + MAX_PATTERN_LENGTH);
        }
    
        pattern_ = Arrays.copyOf(pattern, pattern.length);
        borders_ = new int[pattern_.length + 1];
        preProcess();
      }
    
      /**
       * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
       * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
       * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
       * another reasonable default, i.e. leave the stream unchanged.
       *
       * @return bytes consumed if found, -1 otherwise.
       * @throws IOException
       */
      public long search(InputStream stream) throws IOException {
        long bytesRead = 0;
    
        int b;
        int j = 0;
    
        while ((b = stream.read()) != -1) {
          bytesRead++;
    
          while (j >= 0 && (byte)b != pattern_[j]) {
            j = borders_[j];
          }
          // Move to the next character in the pattern.
          ++j;
    
          // If we've matched up to the full pattern length, we found it.  Return,
          // which will automatically save our position in the InputStream at the point immediately
          // following the pattern match.
          if (j == pattern_.length) {
            return bytesRead;
          }
        }
    
        // No dice, Note that the stream is now completely consumed.
        return -1;
      }
    
      /**
       * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
       * and aids in implementation of the Knuth-Moore-Pratt string search.
       * <p>
       * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
       */
      protected void preProcess() {
        int i = 0;
        int j = -1;
        borders_[i] = j;
        while (i < pattern_.length) {
          while (j >= 0 && pattern_[i] != pattern_[j]) {
            j = borders_[j];
          }
          borders_[++i] = ++j;
        }
      }
    }
    
    0 讨论(0)
  • 2020-12-01 06:05
    private int bytesIndexOf(byte[] source, byte[] search, int fromIndex) {
        boolean find = false;
        int i;
        for (i = fromIndex; i < (source.length - search.length); i++) {
            if (source[i] == search[0]) {
                find = true;
                for (int j = 0; j < search.length; j++) {
                    if (source[i + j] != search[j]) {
                        find = false;
                    }
                }
            }
            if (find) {
                break;
            }
        }
        if (!find) {
            return -1;
        }
        return i;
    }
    
    0 讨论(0)
  • 2020-12-01 06:07

    You can find sequence of bytes from giga-bytes order file using bigdoc.

    Lib and Example here on Github at: https://github.com/riversun/bigdoc

    package org.example;
    
    import java.io.File;
    import java.util.List;
    
    import org.riversun.bigdoc.bin.BigFileSearcher;
    
    public class Example {
    
        public static void main(String[] args) throws Exception {
    
            byte[] searchBytes = "hello world.".getBytes("UTF-8");
    
            File file = new File("/var/tmp/yourBigfile.bin");
    
            BigFileSearcher searcher = new BigFileSearcher();
    
            List<Long> findList = searcher.searchBigFile(file, searchBytes);
    
            System.out.println("positions = " + findList);
        }
    }
    

    If you want to search it on memory,check this. Examples here on Github at: https://github.com/riversun/finbin

     import java.util.List;
    
     import org.riversun.finbin.BigBinarySearcher;
    
     public class Example {
    
         public static void main(String[] args) throws Exception {
    
             BigBinarySearcher bbs = new BigBinarySearcher();
    
             byte[] iamBigSrcBytes = "Hello world.It's a small world.".getBytes("utf-8");
    
             byte[] searchBytes = "world".getBytes("utf-8");
    
             List<Integer> indexList = bbs.searchBytes(iamBigSrcBytes, searchBytes);
    
             System.out.println("indexList=" + indexList);
         }
     }
    

    Returns all the matched positions in the array of bytes

    It also can withstand a large array of bytes:)

    0 讨论(0)
提交回复
热议问题