Creating a sliding window iterator of slices of chars from a String

后端 未结 3 1204
失恋的感觉
失恋的感觉 2021-01-17 23:57

I am looking for the best way to go from String to Windows using the windows function provided for slices.

I understa

3条回答
  •  生来不讨喜
    2021-01-18 00:46

    The problem that you are facing is that String is really represented as something like a Vec under the hood, with some APIs to let you access chars. In UTF-8 the representation of a code point can be anything from 1 to 4 bytes, and they are all compacted together for space-efficiency.

    The only slice you could get directly of an entire String, without copying everything, would be a &[u8], but you wouldn't know if the bytes corresponded to whole or just parts of code points.

    The char type corresponds exactly to a code point, and therefore has a size of 4 bytes, so that it can accommodate any possible value. So, if you build a slice of char by copying from a String, the result could be up to 4 times larger.

    To avoid making a potentially large, temporary memory allocation, you should consider a more lazy approach – iterate through the String, making slices at exactly the char boundaries. Something like this:

    fn char_windows<'a>(src: &'a str, win_size: usize) -> impl Iterator {
        src.char_indices()
            .flat_map(move |(from, _)| {
                src[from ..].char_indices()
                    .skip(win_size - 1)
                    .next()
                    .map(|(to, c)| {
                        &src[from .. from + to + c.len_utf8()]
                    })
        })
    }
    

    This will give you an iterator where the items are &str, each with 3 chars:

    let mut windows = char_windows(&tst, 3);
    
    for win in windows {
        println!("{:?}", win);
    }
    

    The nice thing about this approach is that it hasn't done any copying at all - each &str produced by the iterator is still a slice into the original source String.


    All of that complexity is because Rust uses UTF-8 encoding for strings by default. If you absolutely know that your input string doesn't contain any multi-byte characters, you can treat it as ASCII bytes, and taking slices becomes easy:

    let tst = String::from("abcdefg");
    let inter = tst.as_bytes();
    let mut windows = inter.windows(3);
    

    However, you now have slices of bytes, and you'll need to turn them back into strings to do anything with them:

    for win in windows {
        println!("{:?}", String::from_utf8_lossy(win));
    }
    

提交回复
热议问题