Filter (search and replace) array of bytes in an InputStream

后端 未结 6 1656
清歌不尽
清歌不尽 2020-11-30 05:49

I have an InputStream which takes the html file as input parameter. I have to get the bytes from the input stream .

I have a string: \"XYZ\". I\'d like

相关标签:
6条回答
  • 2020-11-30 06:20

    Not sure you have chosen the best approach to solve your problem.

    That said, I don't like to (and have as policy not to) answer questions with "don't" so here goes...

    Have a look at FilterInputStream.

    From the documentation:

    A FilterInputStream contains some other input stream, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality.


    It was a fun exercise to write it up. Here's a complete example for you:

    import java.io.*;
    import java.util.*;
    
    class ReplacingInputStream extends FilterInputStream {
    
        LinkedList<Integer> inQueue = new LinkedList<Integer>();
        LinkedList<Integer> outQueue = new LinkedList<Integer>();
        final byte[] search, replacement;
    
        protected ReplacingInputStream(InputStream in,
                                       byte[] search,
                                       byte[] replacement) {
            super(in);
            this.search = search;
            this.replacement = replacement;
        }
    
        private boolean isMatchFound() {
            Iterator<Integer> inIter = inQueue.iterator();
            for (int i = 0; i < search.length; i++)
                if (!inIter.hasNext() || search[i] != inIter.next())
                    return false;
            return true;
        }
    
        private void readAhead() throws IOException {
            // Work up some look-ahead.
            while (inQueue.size() < search.length) {
                int next = super.read();
                inQueue.offer(next);
                if (next == -1)
                    break;
            }
        }
    
        @Override
        public int read() throws IOException {    
            // Next byte already determined.
            if (outQueue.isEmpty()) {
                readAhead();
    
                if (isMatchFound()) {
                    for (int i = 0; i < search.length; i++)
                        inQueue.remove();
    
                    for (byte b : replacement)
                        outQueue.offer((int) b);
                } else
                    outQueue.add(inQueue.remove());
            }
    
            return outQueue.remove();
        }
    
        // TODO: Override the other read methods.
    }
    

    Example Usage

    class Test {
        public static void main(String[] args) throws Exception {
    
            byte[] bytes = "hello xyz world.".getBytes("UTF-8");
    
            ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
    
            byte[] search = "xyz".getBytes("UTF-8");
            byte[] replacement = "abc".getBytes("UTF-8");
    
            InputStream ris = new ReplacingInputStream(bis, search, replacement);
    
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
    
            int b;
            while (-1 != (b = ris.read()))
                bos.write(b);
    
            System.out.println(new String(bos.toByteArray()));
    
        }
    }
    

    Given the bytes for the string "Hello xyz world" it prints:

    Hello abc world
    
    0 讨论(0)
  • 2020-11-30 06:26

    There isn't any built-in functionality for search-and-replace on byte streams (InputStream).

    And, a method for completing this task efficiently and correctly is not immediately obvious. I have implemented the Boyer-Moore algorithm for streams, and it works well, but it took some time. Without an algorithm like this, you have to resort to a brute-force approach where you look for the pattern starting at every position in the stream, which can be slow.

    Even if you decode the HTML as text, using a regular expression to match patterns might be a bad idea, since HTML is not a "regular" language.

    So, even though you've run into some difficulties, I suggest you pursue your original approach of parsing the HTML as a document. While you are having trouble with the character encoding, it will probably be easier, in the long run, to fix the right solution than it will be to jury-rig the wrong solution.

    0 讨论(0)
  • 2020-11-30 06:30

    The following approach will work but I don't how big the impact is on the performance.

    1. Wrap the InputStream with a InputStreamReader,
    2. wrap the InputStreamReader with a FilterReader that replaces the strings, then
    3. wrap the FilterReader with a ReaderInputStream.

    It is crucial to choose the appropriate encoding, otherwise the content of the stream will become corrupted.

    If you want to use regular expressions to replace the strings, then you can use Streamflyer, a tool of mine, which is a convenient alternative to FilterReader. You will find an example for byte streams on the webpage of Streamflyer. Hope this helps.

    0 讨论(0)
  • 2020-11-30 06:35

    I came up with this simple piece of code when I needed to serve a template file in a Servlet replacing a certain keyword by a value. It should be pretty fast and low on memory. Then using Piped Streams I guess you can use it for all sorts of things.

    /JC

    public static void replaceStream(InputStream in, OutputStream out, String search, String replace) throws IOException
    {
        replaceStream(new InputStreamReader(in), new OutputStreamWriter(out), search, replace);
    }
    
    public static void replaceStream(Reader in, Writer out, String search, String replace) throws IOException
    {
        char[] searchChars = search.toCharArray();
        int[] buffer = new int[searchChars.length];
    
        int x, r, si = 0, sm = searchChars.length;
        while ((r = in.read()) > 0) {
    
            if (searchChars[si] == r) {
                // The char matches our pattern
                buffer[si++] = r;
    
                if (si == sm) {
                    // We have reached a matching string
                    out.write(replace);
                    si = 0;
                }
            } else if (si > 0) {
                // No match and buffered char(s), empty buffer and pass the char forward
                for (x = 0; x < si; x++) {
                    out.write(buffer[x]);
                }
                si = 0;
                out.write(r);
            } else {
                // No match and nothing buffered, just pass the char forward
                out.write(r);
            }
        }
    
        // Empty buffer
        for (x = 0; x < si; x++) {
            out.write(buffer[x]);
        }
    }
    
    0 讨论(0)
  • 2020-11-30 06:41

    I needed a solution to this, but found the answers here incurred too much memory and/or CPU overhead. The below solution significantly outperforms the others here in these terms based on simple benchmarking.

    This solution is especially memory-efficient, incurring no measurable cost even with >GB streams.

    That said, this is not a zero-CPU-cost solution. The CPU/processing-time overhead is probably reasonable for all but the most demanding/resource-sensitive scenarios, but the overhead is real and should be considered when evaluating the worthiness of employing this solution in a given context.

    In my case, our max real-world file size that we are processing is about 6MB, where we see added latency of about 170ms with 44 URL replacements. This is for a Zuul-based reverse-proxy running on AWS ECS with a single CPU share (1024). For most of the files (under 100KB), the added latency is sub-millisecond. Under high-concurrency (and thus CPU contention), the added latency could increase, however we are currently able to process hundreds of the files concurrently on a single node with no humanly-noticeable latency impact.

    The solution we are using:

    import java.io.IOException;
    import java.io.InputStream;
    
    public class TokenReplacingStream extends InputStream {
    
        private final InputStream source;
        private final byte[] oldBytes;
        private final byte[] newBytes;
        private int tokenMatchIndex = 0;
        private int bytesIndex = 0;
        private boolean unwinding;
        private int mismatch;
        private int numberOfTokensReplaced = 0;
    
        public TokenReplacingStream(InputStream source, byte[] oldBytes, byte[] newBytes) {
            assert oldBytes.length > 0;
            this.source = source;
            this.oldBytes = oldBytes;
            this.newBytes = newBytes;
        }
    
        @Override
        public int read() throws IOException {
    
            if (unwinding) {
                if (bytesIndex < tokenMatchIndex) {
                    return oldBytes[bytesIndex++];
                } else {
                    bytesIndex = 0;
                    tokenMatchIndex = 0;
                    unwinding = false;
                    return mismatch;
                }
            } else if (tokenMatchIndex == oldBytes.length) {
                if (bytesIndex == newBytes.length) {
                    bytesIndex = 0;
                    tokenMatchIndex = 0;
                    numberOfTokensReplaced++;
                } else {
                    return newBytes[bytesIndex++];
                }
            }
    
            int b = source.read();
            if (b == oldBytes[tokenMatchIndex]) {
                tokenMatchIndex++;
            } else if (tokenMatchIndex > 0) {
                mismatch = b;
                unwinding = true;
            } else {
                return b;
            }
    
            return read();
    
        }
    
        @Override
        public void close() throws IOException {
            source.close();
        }
    
        public int getNumberOfTokensReplaced() {
            return numberOfTokensReplaced;
        }
    
    }
    
    0 讨论(0)
  • 2020-11-30 06:43

    I needed something like this as well and decided to roll my own solution instead of using the example above by @aioobe. Have a look at the code. You can pull the library from maven central, or just copy the source code.

    This is how you use it. In this case, I'm using a nested instance to replace two patterns two fix dos and mac line endings.

    new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");

    Here's the full source code:

    /**
     * Simple FilterInputStream that can replace occurrances of bytes with something else.
     */
    public class ReplacingInputStream extends FilterInputStream {
    
        // while matching, this is where the bytes go.
        int[] buf=null;
        int matchedIndex=0;
        int unbufferIndex=0;
        int replacedIndex=0;
    
        private final byte[] pattern;
        private final byte[] replacement;
        private State state=State.NOT_MATCHED;
    
        // simple state machine for keeping track of what we are doing
        private enum State {
            NOT_MATCHED,
            MATCHING,
            REPLACING,
            UNBUFFER
        }
    
        /**
         * @param is input
         * @return nested replacing stream that replaces \n\r (DOS) and \r (MAC) line endings with UNIX ones "\n".
         */
        public static InputStream newLineNormalizingInputStream(InputStream is) {
            return new ReplacingInputStream(new ReplacingInputStream(is, "\n\r", "\n"), "\r", "\n");
        }
    
        /**
         * Replace occurances of pattern in the input. Note: input is assumed to be UTF-8 encoded. If not the case use byte[] based pattern and replacement.
         * @param in input
         * @param pattern pattern to replace.
         * @param replacement the replacement or null
         */
        public ReplacingInputStream(InputStream in, String pattern, String replacement) {
            this(in,pattern.getBytes(StandardCharsets.UTF_8), replacement==null ? null : replacement.getBytes(StandardCharsets.UTF_8));
        }
    
        /**
         * Replace occurances of pattern in the input.
         * @param in input
         * @param pattern pattern to replace
         * @param replacement the replacement or null
         */
        public ReplacingInputStream(InputStream in, byte[] pattern, byte[] replacement) {
            super(in);
            Validate.notNull(pattern);
            Validate.isTrue(pattern.length>0, "pattern length should be > 0", pattern.length);
            this.pattern = pattern;
            this.replacement = replacement;
            // we will never match more than the pattern length
            buf = new int[pattern.length];
        }
    
        @Override
        public int read(byte[] b, int off, int len) throws IOException {
            // copy of parent logic; we need to call our own read() instead of super.read(), which delegates instead of calling our read
            if (b == null) {
                throw new NullPointerException();
            } else if (off < 0 || len < 0 || len > b.length - off) {
                throw new IndexOutOfBoundsException();
            } else if (len == 0) {
                return 0;
            }
    
            int c = read();
            if (c == -1) {
                return -1;
            }
            b[off] = (byte)c;
    
            int i = 1;
            try {
                for (; i < len ; i++) {
                    c = read();
                    if (c == -1) {
                        break;
                    }
                    b[off + i] = (byte)c;
                }
            } catch (IOException ee) {
            }
            return i;
    
        }
    
        @Override
        public int read(byte[] b) throws IOException {
            // call our own read
            return read(b, 0, b.length);
        }
    
        @Override
        public int read() throws IOException {
            // use a simple state machine to figure out what we are doing
            int next;
            switch (state) {
            case NOT_MATCHED:
                // we are not currently matching, replacing, or unbuffering
                next=super.read();
                if(pattern[0] == next) {
                    // clear whatever was there
                    buf=new int[pattern.length]; // clear whatever was there
                    // make sure we start at 0
                    matchedIndex=0;
    
                    buf[matchedIndex++]=next;
                    if(pattern.length == 1) {
                        // edgecase when the pattern length is 1 we go straight to replacing
                        state=State.REPLACING;
                        // reset replace counter
                        replacedIndex=0;
                    } else {
                        // pattern of length 1
                        state=State.MATCHING;
                    }
                    // recurse to continue matching
                    return read();
                } else {
                    return next;
                }
            case MATCHING:
                // the previous bytes matched part of the pattern
                next=super.read();
                if(pattern[matchedIndex]==next) {
                    buf[matchedIndex++]=next;
                    if(matchedIndex==pattern.length) {
                        // we've found a full match!
                        if(replacement==null || replacement.length==0) {
                            // the replacement is empty, go straight to NOT_MATCHED
                            state=State.NOT_MATCHED;
                            matchedIndex=0;
                        } else {
                            // start replacing
                            state=State.REPLACING;
                            replacedIndex=0;
                        }
                    }
                } else {
                    // mismatch -> unbuffer
                    buf[matchedIndex++]=next;
                    state=State.UNBUFFER;
                    unbufferIndex=0;
                }
                return read();
            case REPLACING:
                // we've fully matched the pattern and are returning bytes from the replacement
                next=replacement[replacedIndex++];
                if(replacedIndex==replacement.length) {
                    state=State.NOT_MATCHED;
                    replacedIndex=0;
                }
                return next;
            case UNBUFFER:
                // we partially matched the pattern before encountering a non matching byte
                // we need to serve up the buffered bytes before we go back to NOT_MATCHED
                next=buf[unbufferIndex++];
                if(unbufferIndex==matchedIndex) {
                    state=State.NOT_MATCHED;
                    matchedIndex=0;
                }
                return next;
    
            default:
                throw new IllegalStateException("no such state " + state);
            }
        }
    
        @Override
        public String toString() {
            return state.name() + " " + matchedIndex + " " + replacedIndex + " " + unbufferIndex;
        }
    
    }
    
    0 讨论(0)
提交回复
热议问题