What is the fastest correct way to detect that there are no duplicates in a JSON array?

 ̄綄美尐妖づ 提交于 2021-01-27 22:19:21

问题


I need to check if all items are unique in an array of serde_json::Value. Since this type does not implement Hash I came up with the following solution:

use serde_json::{json, Value};
use std::collections::HashSet;

fn is_unique(items: &[Value]) -> bool {
    let mut seen = HashSet::with_capacity(items.len());
    for item in items.iter() {
        if !seen.insert(item.to_string()) {
            return false;
        }
    }
    true
}

fn main() {
    let value1 = json!([1, 2]);
    assert!(is_unique(&value1.as_array().unwrap()));
    let value2 = json!([1, 1]);
    assert!(!is_unique(&value2.as_array().unwrap()));
}

I assume that it should only work if serde_json is built with preserve_order feature (to have objects serialized in the same order every time), but I am not 100% sure about it.

Main usage context:

JSON Schema validation. "uniqueItems" keyword implementation.

Related usage case

Deduplication of JSON arrays to optimize JSON Schema inference on them.

For example, the input data is [1, 2, {"foo": "bar"}]. A straightforward inference might output this:

{
    "type": "array", 
    "items": {
        "anyOf": [
            {"type": "integer"}, 
            {"type": "integer"},
            {"type": "object", "required": ["foo"]}
        ]
    }
}

values in items/anyOf can be reduced to only two values.

Question: What would be the most time-efficient and correct way to check that there are no duplicates in an arbitrary JSON array?

I used serde_json = "1.0.48"

Rust: 1.42.0

Playground


回答1:


Converting each array item to a string is rather expensive – it requires at least one string allocation per item, and quite likely more than that. It's also difficult to make sure mappings (or "objects" in JSON language) are represented in a canonical form.

A faster and more robust alternative is to implement Hash for Value yourself. You need to define a newtype wrapper, since you can't implement a foreign trait on a foreign type. Here's a simple example implementation:

use serde_json::Value;
use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;

#[derive(PartialEq)]
struct HashValue<'a>(pub &'a Value);

impl Eq for HashValue<'_> {}

impl Hash for HashValue<'_> {
    fn hash<H: Hasher>(&self, state: &mut H) {
        use Value::*;
        match self.0 {
            Null => state.write_u32(3_221_225_473), // chosen randomly
            Bool(ref b) => b.hash(state),
            Number(ref n) => {
                if let Some(x) = n.as_u64() {
                    x.hash(state);
                } else if let Some(x) = n.as_i64() {
                    x.hash(state);
                } else if let Some(x) = n.as_f64() {
                    // `f64` does not implement `Hash`. However, floats in JSON are guaranteed to be
                    // finite, so we can use the `Hash` implementation in the `ordered-float` crate.
                    ordered_float::NotNan::new(x).unwrap().hash(state);
                }
            }
            String(ref s) => s.hash(state),
            Array(ref v) => {
                for x in v {
                    HashValue(x).hash(state);
                }
            }
            Object(ref map) => {
                let mut hash = 0;
                for (k, v) in map {
                    // We have no way of building a new hasher of type `H`, so we
                    // hardcode using the default hasher of a hash map.
                    let mut item_hasher = DefaultHasher::new();
                    k.hash(&mut item_hasher);
                    HashValue(v).hash(&mut item_hasher);
                    hash ^= item_hasher.finish();
                }
                state.write_u64(hash);
            }
        }
    }
}

The value for None is chosen randomly to make it unlikely to collide with other entries. To calculate hashes for floating point numbers, I used the ordered-float crate. For mappings, the code calculates a hash for each key/value pair and simply XORs these hashes together, which is order-independent. It's a bit unfortunate that we need to hardcode the hasher used for hashing the map entries. We could abstract that out by defining our own version of the Hash trait, and then derive concrete implementations of std::hash::Hash from our custom Hash trait, but this complicates the code quite a bit, so I wouldn't do that unless you need to.

We can't derive Eq, since Value does not implement Eq. However, I believe this is just an oversight, so I filed an issue to add an Eq implementation (which the PR has been accepted for, so it will land in some future release).




回答2:


Depends if JSON array is sorted or not. If it is sorted you can use binary search to check the value is matched with other values. To sort you can use merge sort. Total complexity will be O(nlogn + logn). Or you can iterate sequentially and check for duplicate rows O(n^2).



来源:https://stackoverflow.com/questions/60882381/what-is-the-fastest-correct-way-to-detect-that-there-are-no-duplicates-in-a-json

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!