In Rust it\'s possible to get UTF-8 from bytes by doing this:
if let Ok(s) = str::from_utf8(some_u8_slice) {
println!(\"example {}\", s);
}
Yes, via String::from_utf8_lossy:
fn main() {
let text = [104, 101, 0xFF, 108, 111];
let s = String::from_utf8_lossy(&text);
println!("{}", s); // he�lo
}
If you need more control over the process, you can use std::str::from_utf8, as suggested by the other answer. However, there's no reason to double-validate the bytes as it suggests.
A quickly hacked-up example:
use std::str;
fn example(mut bytes: &[u8]) -> String {
let mut output = String::new();
loop {
match str::from_utf8(bytes) {
Ok(s) => {
// The entire rest of the string was valid UTF-8, we are done
output.push_str(s);
return output;
}
Err(e) => {
let (good, bad) = bytes.split_at(e.valid_up_to());
if !good.is_empty() {
let s = unsafe {
// This is safe because we have already validated this
// UTF-8 data via the call to `str::from_utf8`; there's
// no need to check it a second time
str::from_utf8_unchecked(good)
};
output.push_str(s);
}
if bad.is_empty() {
// No more data left
return output;
}
// Do whatever type of recovery you need to here
output.push_str("<badbyte>");
// Skip the bad byte and try again
bytes = &bad[1..];
}
}
}
}
fn main() {
let r = example(&[104, 101, 0xFF, 108, 111]);
println!("{}", r); // he<badbyte>lo
}
You could extend this to take values to replace bad bytes with, a closure to handle the bad bytes, etc. For example:
fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
// ...
handler(&mut output, bad);
// ...
}
let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
use std::fmt::Write;
write!(output, "\\U{{{}}}", bytes[0]).unwrap()
});
println!("{}", r); // he\U{255}lo
See also:
You can either:
Construct it yourself by using the strict UTF-8 decoding which returns an error indicating the position where the decoding failed, which you can then escape. But that's inefficient since you will decode each failed attempt twice.
Try 3rd party crates which provide more customizable charset decoders.