Is it possible to decode bytes to UTF-8, converting errors to escape sequences in Rust?

泄露秘密 提交于 2019-12-01 02:54:10

问题


In Rust it's possible to get UTF-8 from bytes by doing this:

if let Ok(s) = str::from_utf8(some_u8_slice) {
    println!("example {}", s);
}

This either works or it doesn't, but Python has the ability to handle errors, e.g.:

s = some_bytes.decode(encoding='utf-8', errors='surrogateescape');

In this example the argument surrogateescape converts invalid utf-8 sequences to escape-codes, so instead of ignoring or replacing text that can't be decoded, they are replaced with a byte literal expression, which is valid utf-8. see: Python docs for details.

Does Rust have a way to get a UTF-8 string from bytes which escapes errors instead of failing entirely?


回答1:


Yes, via String::from_utf8_lossy:

fn main() {
    let text = [104, 101, 0xFF, 108, 111];
    let s = String::from_utf8_lossy(&text);
    println!("{}", s); // he�lo
}

If you need more control over the process, you can use std::str::from_utf8, as suggested by the other answer. However, there's no reason to double-validate the bytes as it suggests.

A quickly hacked-up example:

use std::str;

fn example(mut bytes: &[u8]) -> String {
    let mut output = String::new();

    loop {
        match str::from_utf8(bytes) {
            Ok(s) => {
                // The entire rest of the string was valid UTF-8, we are done
                output.push_str(s);
                return output;
            }
            Err(e) => {
                let (good, bad) = bytes.split_at(e.valid_up_to());

                if !good.is_empty() {
                    let s = unsafe {
                        // This is safe because we have already validated this
                        // UTF-8 data via the call to `str::from_utf8`; there's
                        // no need to check it a second time
                        str::from_utf8_unchecked(good)
                    };
                    output.push_str(s);
                }

                if bad.is_empty() {
                    //  No more data left
                    return output;
                }

                // Do whatever type of recovery you need to here
                output.push_str("<badbyte>");

                // Skip the bad byte and try again
                bytes = &bad[1..];
            }
        }
    }
}

fn main() {
    let r = example(&[104, 101, 0xFF, 108, 111]);
    println!("{}", r); // he<badbyte>lo
}

You could extend this to take values to replace bad bytes with, a closure to handle the bad bytes, etc. For example:

fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
    // ...    
                handler(&mut output, bad);
    // ...
}
let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
    use std::fmt::Write;
    write!(output, "\\U{{{}}}", bytes[0]).unwrap()
});
println!("{}", r); // he\U{255}lo

See also:

  • How do I convert a Vector of bytes (u8) to a string
  • How to print a u8 slice as text if I don't care about the particular encoding?.



回答2:


You can either:

  1. Construct it yourself by using the strict UTF-8 decoding which returns an error indicating the position where the decoding failed, which you can then escape. But that's inefficient since you will decode each failed attempt twice.

  2. Try 3rd party crates which provide more customizable charset decoders.



来源:https://stackoverflow.com/questions/41455206/is-it-possible-to-decode-bytes-to-utf-8-converting-errors-to-escape-sequences-i

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!