As an educational exercise, I\'m looking at porting cvs-fast-export to Rust.
Its basic mode of operation is to parse a number of CVS master files into a intermediate for
Your analysis is correct. The ultimate issue is that when modifying the HashSet
, the compiler cannot guarantee that the mutations will not affect the existing allocations. Indeed, in general they might affect them, unless you add another layer of indirection, as you have identified.
This is a prime example of a place that unsafe
is useful. You, the programmer, can assert that the code will only ever be used in a particular way, and that particular way will allow the variable to be stable through any mutations. You can use the type system and module visibility to help enforce these conditions.
Note that String
already introduces a heap allocation. So long as you don't change the String
once it's allocated, you don't need an extra Box
.
Something like this seems like an OK start:
use std::{cell::RefCell, collections::HashSet, mem};
struct EasyInterner(RefCell<HashSet<String>>);
impl EasyInterner {
fn new() -> Self {
EasyInterner(RefCell::new(HashSet::new()))
}
fn intern<'a>(&'a self, s: &str) -> &'a str {
let mut set = self.0.borrow_mut();
if !set.contains(s) {
set.insert(s.into());
}
let interned = set.get(s).expect("Impossible missing string");
// TODO: Document the pre- and post-conditions that the code must
// uphold to make this unsafe code valid instead of copying this
// from Stack Overflow without reading it
unsafe { mem::transmute(interned.as_str()) }
}
}
fn main() {
let i = EasyInterner::new();
let a = i.intern("hello");
let b = i.intern("world");
let c = i.intern("hello");
// Still strings
assert_eq!(a, "hello");
assert_eq!(a, c);
assert_eq!(b, "world");
// But with the same address
assert_eq!(a.as_ptr(), c.as_ptr());
assert!(a.as_ptr() != b.as_ptr());
// This shouldn't compile; a cannot outlive the interner
// let x = {
// let i = EasyInterner::new();
// let a = i.intern("hello");
// a
// };
let the_pointer;
let i = {
let i = EasyInterner::new();
{
// Introduce a scope to contstrain the borrow of `i` for `s`
let s = i.intern("inner");
the_pointer = s.as_ptr();
}
i // moving i to a new location
// All outstanding borrows are invalidated
};
// but the data is still allocated
let s = i.intern("inner");
assert_eq!(the_pointer, s.as_ptr());
}
However, it may be much more expedient to use a crate like:
I somewhat disagree with @Shepmaster on the use of unsafe
here.
While right now it does not cause issue, should someone decide in the future to change the use of HashSet
to include some pruning (for example, to only ever keep a hundred authors in there), then unsafe
will bite you sternly.
In the absence of a strong performance reason, I would simply use a Rc<XXX>
. You can alias it easily enough: type InternedXXX = Rc<XXX>;
.
use std::collections::HashSet;
use std::hash::Hash;
use std::rc::Rc;
#[derive(PartialEq, Eq, Debug, Hash, Clone)]
struct CvsNumber(Rc<Vec<u16>>);
fn intern<T:Eq + Hash + Clone>(set: &mut HashSet<T>, item: T) -> T {
if !set.contains(&item) {
let dupe = item.clone();
set.insert(dupe);
item
} else {
set.get(&item).unwrap().clone()
}
}
fn main() {
let mut set: HashSet<CvsNumber> = HashSet::new();
let c1 = CvsNumber(Rc::new(vec![1, 2]));
let c2 = intern(&mut set, c1);
let c3 = CvsNumber(Rc::new(vec![1, 2]));
let c4 = intern(&mut set, c3);
}