Creating a sliding window iterator of slices of chars from a String

后端 未结 3 1205
失恋的感觉
失恋的感觉 2021-01-17 23:57

I am looking for the best way to go from String to Windows using the windows function provided for slices.

I understa

相关标签:
3条回答
  • 2021-01-18 00:43

    You can use itertools to walk over windows of any iterator, up to a width of 4:

    extern crate itertools; // 0.7.8
    
    use itertools::Itertools;
    
    fn main() {
        let input = "日本語";
    
        for (a, b) in input.chars().tuple_windows() {
            println!("{}, {}", a, b);
        }
    }
    

    See also:

    • Are there equivalents to slice::chunks/windows for iterators to loop over pairs, triplets etc?
    0 讨论(0)
  • 2021-01-18 00:46

    The problem that you are facing is that String is really represented as something like a Vec<u8> under the hood, with some APIs to let you access chars. In UTF-8 the representation of a code point can be anything from 1 to 4 bytes, and they are all compacted together for space-efficiency.

    The only slice you could get directly of an entire String, without copying everything, would be a &[u8], but you wouldn't know if the bytes corresponded to whole or just parts of code points.

    The char type corresponds exactly to a code point, and therefore has a size of 4 bytes, so that it can accommodate any possible value. So, if you build a slice of char by copying from a String, the result could be up to 4 times larger.

    To avoid making a potentially large, temporary memory allocation, you should consider a more lazy approach – iterate through the String, making slices at exactly the char boundaries. Something like this:

    fn char_windows<'a>(src: &'a str, win_size: usize) -> impl Iterator<Item = &'a str> {
        src.char_indices()
            .flat_map(move |(from, _)| {
                src[from ..].char_indices()
                    .skip(win_size - 1)
                    .next()
                    .map(|(to, c)| {
                        &src[from .. from + to + c.len_utf8()]
                    })
        })
    }
    

    This will give you an iterator where the items are &str, each with 3 chars:

    let mut windows = char_windows(&tst, 3);
    
    for win in windows {
        println!("{:?}", win);
    }
    

    The nice thing about this approach is that it hasn't done any copying at all - each &str produced by the iterator is still a slice into the original source String.


    All of that complexity is because Rust uses UTF-8 encoding for strings by default. If you absolutely know that your input string doesn't contain any multi-byte characters, you can treat it as ASCII bytes, and taking slices becomes easy:

    let tst = String::from("abcdefg");
    let inter = tst.as_bytes();
    let mut windows = inter.windows(3);
    

    However, you now have slices of bytes, and you'll need to turn them back into strings to do anything with them:

    for win in windows {
        println!("{:?}", String::from_utf8_lossy(win));
    }
    
    0 讨论(0)
  • 2021-01-18 00:53

    This solution will work for your purpose. (playground)

    fn main() {
        let tst = String::from("abcdefg");
        let inter = tst.chars().collect::<Vec<char>>();
        let mut windows = inter.windows(3);
    
        // prints ['a', 'b', 'c']
        println!("{:?}", windows.next().unwrap());
        // prints ['b', 'c', 'd']
        println!("{:?}", windows.next().unwrap());
        // etc...
        println!("{:?}", windows.next().unwrap());
    }
    

    String can iterate over its chars, but it's not a slice, so you have to collect it into a vec, which then coerces into a slice.

    0 讨论(0)
提交回复
热议问题