问题
I need to check if all items are unique in an array of serde_json::Value
. Since this type does not implement Hash
I came up with the following solution:
use serde_json::{json, Value};
use std::collections::HashSet;
fn is_unique(items: &[Value]) -> bool {
let mut seen = HashSet::with_capacity(items.len());
for item in items.iter() {
if !seen.insert(item.to_string()) {
return false;
}
}
true
}
fn main() {
let value1 = json!([1, 2]);
assert!(is_unique(&value1.as_array().unwrap()));
let value2 = json!([1, 1]);
assert!(!is_unique(&value2.as_array().unwrap()));
}
I assume that it should only work if serde_json
is built with preserve_order
feature (to have objects serialized in the same order every time), but I am not 100% sure about it.
Main usage context:
JSON Schema validation. "uniqueItems" keyword implementation.
Related usage case
Deduplication of JSON arrays to optimize JSON Schema inference on them.
For example, the input data is [1, 2, {"foo": "bar"}]
. A straightforward inference might output this:
{
"type": "array",
"items": {
"anyOf": [
{"type": "integer"},
{"type": "integer"},
{"type": "object", "required": ["foo"]}
]
}
}
values in items/anyOf
can be reduced to only two values.
Question: What would be the most time-efficient and correct way to check that there are no duplicates in an arbitrary JSON array?
I used serde_json = "1.0.48"
Rust: 1.42.0
Playground
回答1:
Converting each array item to a string is rather expensive – it requires at least one string allocation per item, and quite likely more than that. It's also difficult to make sure mappings (or "objects" in JSON language) are represented in a canonical form.
A faster and more robust alternative is to implement Hash
for Value
yourself. You need to define a newtype wrapper, since you can't implement a foreign trait on a foreign type. Here's a simple example implementation:
use serde_json::Value;
use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;
#[derive(PartialEq)]
struct HashValue<'a>(pub &'a Value);
impl Eq for HashValue<'_> {}
impl Hash for HashValue<'_> {
fn hash<H: Hasher>(&self, state: &mut H) {
use Value::*;
match self.0 {
Null => state.write_u32(3_221_225_473), // chosen randomly
Bool(ref b) => b.hash(state),
Number(ref n) => {
if let Some(x) = n.as_u64() {
x.hash(state);
} else if let Some(x) = n.as_i64() {
x.hash(state);
} else if let Some(x) = n.as_f64() {
// `f64` does not implement `Hash`. However, floats in JSON are guaranteed to be
// finite, so we can use the `Hash` implementation in the `ordered-float` crate.
ordered_float::NotNan::new(x).unwrap().hash(state);
}
}
String(ref s) => s.hash(state),
Array(ref v) => {
for x in v {
HashValue(x).hash(state);
}
}
Object(ref map) => {
let mut hash = 0;
for (k, v) in map {
// We have no way of building a new hasher of type `H`, so we
// hardcode using the default hasher of a hash map.
let mut item_hasher = DefaultHasher::new();
k.hash(&mut item_hasher);
HashValue(v).hash(&mut item_hasher);
hash ^= item_hasher.finish();
}
state.write_u64(hash);
}
}
}
}
The value for None
is chosen randomly to make it unlikely to collide with other entries. To calculate hashes for floating point numbers, I used the ordered-float
crate. For mappings, the code calculates a hash for each key/value pair and simply XORs these hashes together, which is order-independent. It's a bit unfortunate that we need to hardcode the hasher used for hashing the map entries. We could abstract that out by defining our own version of the Hash
trait, and then derive concrete implementations of std::hash::Hash
from our custom Hash
trait, but this complicates the code quite a bit, so I wouldn't do that unless you need to.
We can't derive Eq
, since Value
does not implement Eq
. However, I believe this is just an oversight, so I filed an issue to add an Eq implementation (which the PR has been accepted for, so it will land in some future release).
回答2:
Depends if JSON array is sorted or not. If it is sorted you can use binary search to check the value is matched with other values. To sort you can use merge sort. Total complexity will be O(nlogn + logn). Or you can iterate sequentially and check for duplicate rows O(n^2).
来源:https://stackoverflow.com/questions/60882381/what-is-the-fastest-correct-way-to-detect-that-there-are-no-duplicates-in-a-json