Multiple indexes for a Java Collection - most basic solution?

后端 未结 14 1680
小鲜肉
小鲜肉 2020-11-29 22:08

I\'m looking for the most basic solution to create multiple indexes on a Java Collection.

Required functionality:

  • When a Value is removed, all index en
相关标签:
14条回答
  • 2020-11-29 22:37

    Each index will basically be a separate Map. You can (and probably should) abstract this behind a class that manages the searches, indexing, updates and removals for you. It wouldn't be hard to do this fairly generically. But no, there's no standard out of the box class for this although it can easily be built from the Java Collections classes.

    0 讨论(0)
  • 2020-11-29 22:38

    You need to check out Boon. :)

    http://rick-hightower.blogspot.com/2013/11/what-if-java-collections-and-java.html

    You can add n number of search indexes and lookup indexes. It also allows you to efficiently query primitive properties.

    Here is an example take from the wiki (I am the author).

      repoBuilder.primaryKey("ssn")
              .searchIndex("firstName").searchIndex("lastName")
              .searchIndex("salary").searchIndex("empNum", true)
              .usePropertyForAccess(true);
    

    You can override that by providing a true flag as the second argument to searchIndex.

    Notice empNum is a searchable unique index.

    What if it were easy to query a complex set of Java objects at runtime? What if there were an API that kept your object indexes (really just TreeMaps, and HashMaps) in sync.? Well then you would have Boon's data repo. This article shows how to use Boon's data repo utilities to query Java objects. This is part one. There can be many, many parts. :) Boon's data repo makes doing index based queries on collections a lot easier. Why Boon data repo

    Boon's data repo allows you to treat Java collections more like a database at least when it comes to querying the collections. Boon's data repo is not an in-memory database, and cannot substitute arranging your objects into data structures optimized for your application. If you want to spend your time providing customer value and building your objects and classes and using the Collections API for your data structures, then DataRepo is meant for you. This does not preclude breaking out the Knuth books and coming up with an optimized data structure. It just helps keep the mundane things easy so you can spend your time making the hard things possible. Born out of need

    This project came out of a need. I was working on a project that planned to store large collection of domain objects in-memory for speed, and somebody asked an all to important question that I overlooked. How are we going to query this data. My answer was we will use the Collections API and the Streaming API. Then I tried to do this... Hmmm... I also tired using the JDK 8 stream API on a large data set, and it was slow. (Boon's data repo works with JDK7 and JDK8). It was a linear search/filter. This is by design, but for what I was doing, it did not work. I needed indexes to support arbitrary queries. Boon's data repo augments the streaming API.

    Boon's data repo does not endeavor to replace the JDK 8 stream API, and in fact it works well with it. Boon's data repo allows you to create indexed collections. The indexes can be anything (it is pluggable). At this moment in time, Boon's data repo indexes are based on ConcurrentHashMap and ConcurrentSkipListMap. By design, Boon's data repo works with standard collection libraries. There is no plan to create a set of custom collections. One should be able to plug in Guava, Concurrent Trees or Trove if one desires to do so. It provides a simplified API for doing so. It allows linear search for a sense of completion but I recommend using it primarily for using indexes and then using the streaming API for the rest (for type safety and speed).

    sneak peak before the step by step

    Let's say you have a method that creates 200,000 employee objects like this:

     List<Employee> employees = TestHelper.createMetricTonOfEmployees(200_000);
    

    So now we have 200,000 employees. Let's search them...

    First wrap Employees in a searchable query:

       employees = query(employees);
    

    Now search:

      List<Employee> results = query(employees, eq("firstName", firstName));
    

    So what is the main difference between the above and the stream API?

      employees.stream().filter(emp -> emp.getFirstName().equals(firstName)
    

    About a factor of 20,000% faster to use Boon's DataRepo! Ah the power of HashMaps and TreeMaps. :) There is an API that looks just like your built-in collections. There is also an API that looks more like a DAO object or a Repo Object.

    A simple query with the Repo/DAO object looks like this:

      List<Employee> employees = repo.query(eq("firstName", "Diana"));
    

    A more involved query would look like this:

      List<Employee> employees = repo.query(
          and(eq("firstName", "Diana"), eq("lastName", "Smith"), eq("ssn", "21785999")));
    

    Or this:

      List<Employee> employees = repo.query(
          and(startsWith("firstName", "Bob"), eq("lastName", "Smith"), lte("salary", 200_000),
                  gte("salary", 190_000)));
    

    Or even this:

      List<Employee> employees = repo.query(
          and(startsWith("firstName", "Bob"), eq("lastName", "Smith"), between("salary", 190_000, 200_000)));
    

    Or if you want to use JDK 8 stream API, this works with it not against it:

      int sum = repo.query(eq("lastName", "Smith")).stream().filter(emp -> emp.getSalary()>50_000)
          .mapToInt(b -> b.getSalary())
          .sum();
    

    The above would be much faster if the number of employees was quite large. It would narrow down the employees whose name started with Smith and had a salary above 50,000. Let's say you had 100,000 employees and only 50 named Smith so now you narrow to 50 quickly by using the index which effectively pulls 50 employees out of 100,000, then we do the filter over just 50 instead of the whole 100,000.

    Here is a benchmark run from data repo of a linear search versus an indexed search in nano seconds:

    Name index  Time 218 
    Name linear  Time 3709120 
    Name index  Time 213 
    Name linear  Time 3606171 
    Name index  Time 219 
    Name linear  Time 3528839
    

    Someone recently said to me: "But with the streaming API, you can run the filter in parralel).

    Let's see how the math holds up:

    3,528,839 / 16 threads vs. 219
    
    201,802 vs. 219 (nano-seconds).
    

    Indexes win, but it was a photo finish. NOT! :)

    It was only 9,500% faster instead of 40,000% faster. So close.....

    I added some more features. They are make heavy use of indexes. :)

    repo.updateByFilter(values(value("firstName", "Di")), and( eq("firstName", "Diana"), eq("lastName", "Smith"), eq("ssn", "21785999") ) );

    The above would be equivalent to

    UPDATE Employee e SET e.firstName='Di' WHERE e.firstName = 'Diana' and e.lastName = 'Smith' and e.ssn = '21785999'

    This allows you to set multiple fields at once on multiple records so if you were doing a bulk update.

    There are overloaded methods for all basic types so if you have one value to update on each items returned from a Filter:

      repo.updateByFilter("firstName", "Di",
              and( eq("firstName", "Diana"),
              eq("lastName", "Smith"),
                      eq("ssn", "21785999") ) );
    

    Here is some basic selection capabilities:

      List <Map<String, Object>> list =
         repo.query(selects(select("firstName")), eq("lastName", "Hightower"));
    

    You can have as many selects as you like. You can also bring the list back sorted:

      List <Map<String, Object>> list =
         repo.sortedQuery("firstName",selects(select("firstName")),
           eq("lastName", "Hightower"));
    

    You can select properties of related properties (i.e., employee.department.name).

      List <Map<String, Object>> list = repo.query(
              selects(select("department", "name")),
              eq("lastName", "Hightower"));
    
      assertEquals("engineering", list.get(0).get("department.name"));
    

    The above would try to use the fields of the classes. If you want to use the actual properties (emp.getFoo() vs. emp.foo), then you need to use the selectPropertyPath.

      List <Map<String, Object>> list = repo.query(
              selects(selectPropPath("department", "name")),
              eq("lastName", "Hightower"));
    

    Note that select("department", "name") is much faster than selectPropPath("department", "name"), which could matter in a tight loop.

    By default all search indexes and lookup indexes allow duplicates (except for primary key index).

      repoBuilder.primaryKey("ssn")
              .searchIndex("firstName").searchIndex("lastName")
              .searchIndex("salary").searchIndex("empNum", true)
              .usePropertyForAccess(true);
    

    You can override that by providing a true flag as the second argument to searchIndex.

    Notice empNum is a searchable unique index.

    If you prefer or need, you can get even simple searches back as maps:

      List<Map<String, Object>> employees = repo.queryAsMaps(eq("firstName", "Diana"));
    

    I am not sure if this is a feature or a misfeature. My thought was that once you are dealing with data, you need to present that data in a way that does not ties consumers of data to your actual API. Having a Map of String / basic types seems to be a way to achieve this. Note that the object to map conversion goes deep as in:

      System.out.println(employees.get(0).get("department"));
    

    Yields:

    {class=Department, name=engineering}
    

    This can be useful for debugging and ad hoc queries for tooling. I am considering adding support to easily convert to a JSON string.

    Added the ability to query collection properties. This should work with collections and arrays as deeply nested as you like. Read that again because it was a real MF to implement!

      List <Map<String, Object>> list = repo.query(
              selects(select("tags", "metas", "metas2", "metas3", "name3")),
              eq("lastName", "Hightower"));
    
      print("list", list);
    
      assertEquals("3tag1", idx(list.get(0).get("tags.metas.metas2.metas3.name3"), 0));
    

    The print out of the above looks like this:

    list [{tags.metas.metas2.metas3.name3=[3tag1, 3tag2, 3tag3,
    3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3,
    3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3,
    3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3,
    3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3,
    3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3,
    3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3,
    3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3,
    3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3,
    3tag1, 3tag2, 3tag3, 3tag1, 3tag2, 3tag3]},
    ...
    

    I created several relationship classes to test this:

    public class Employee {
    List <Tag> tags = new ArrayList<>();
    {
        tags.add(new Tag("tag1"));
        tags.add(new Tag("tag2"));
        tags.add(new Tag("tag3"));
    
    }
    ...
    public class Tag {
    ...
    List<Meta> metas = new ArrayList<>();
    {
        metas.add(new Meta("mtag1"));
        metas.add(new Meta("mtag2"));
        metas.add(new Meta("mtag3"));
    
    }
    
    }
    public class Meta {
     ...
       List<Meta2> metas2 = new ArrayList<>();
       {
           metas2.add(new Meta2("2tag1"));
           metas2.add(new Meta2("2tag2"));
           metas2.add(new Meta2("2tag3"));
    
       }
    
    }
    
    ...
    public class Meta2 {
    
    
    
    List<Meta3> metas3 = new ArrayList<>();
    {
        metas3.add(new Meta3("3tag1"));
        metas3.add(new Meta3("3tag2"));
        metas3.add(new Meta3("3tag3"));
    
    }
    public class Meta3 {
    
    ...
    

    You can also search by type:

      List<Employee> results = sortedQuery(queryableList, "firstName", typeOf("SalesEmployee"));
    
      assertEquals(1, results.size());
      assertEquals("SalesEmployee", results.get(0).getClass().getSimpleName());
    

    The above finds all employees with the simple classname of SalesEmployee. It also works with full class name as in:

      List<Employee> results = sortedQuery(queryableList, "firstName", typeOf("SalesEmployee"));
    
      assertEquals(1, results.size());
      assertEquals("SalesEmployee", results.get(0).getClass().getSimpleName());
    

    You can search by the actual class too:

      List<Employee> results = sortedQuery(queryableList, "firstName", instanceOf(SalesEmployee.class));
    
      assertEquals(1, results.size());
      assertEquals("SalesEmployee", results.get(0).getClass().getSimpleName());
    

    You can also query classes that implement certain interfaces:

      List<Employee> results = sortedQuery(queryableList, "firstName",      
                                  implementsInterface(Comparable.class));
    
      assertEquals(1, results.size());
      assertEquals("SalesEmployee", results.get(0).getClass().getSimpleName());
    

    You can also index nested fields/properties and they can be collection fields or property non collection fields as deeply nested as you would like:

      /* Create a repo, and decide what to index. */
      RepoBuilder repoBuilder = RepoBuilder.getInstance();
    
      /* Look at the nestedIndex. */
      repoBuilder.primaryKey("id")
              .searchIndex("firstName").searchIndex("lastName")
              .searchIndex("salary").uniqueSearchIndex("empNum")
              .nestedIndex("tags", "metas", "metas2", "name2");
    

    Later you can use the nestedIndex to search.

      List<Map<String, Object>> list = repo.query(
              selects(select("tags", "metas", "metas2", "name2")),
              eqNested("2tag1", "tags", "metas", "metas2", "name2"));
    

    The safe way to use the nestedIndex is to use eqNested. You can use eq, gt, gte, etc. if you have the index like so:

      List<Map<String, Object>> list = repo.query(
              selects(select("tags", "metas", "metas2", "name2")),
              eq("tags.metas.metas2.name2", "2tag1"));
    

    You can also add support for subclasses

      List<Employee> queryableList = $q(h_list, Employee.class, SalesEmployee.class,  
                      HourlyEmployee.class);
      List<Employee> results = sortedQuery(queryableList, "firstName", eq("commissionRate", 1));
      assertEquals(1, results.size());
      assertEquals("SalesEmployee", results.get(0).getClass().getSimpleName());
    
      results = sortedQuery(queryableList, "firstName", eq("weeklyHours", 40));
      assertEquals(1, results.size());
      assertEquals("HourlyEmployee", results.get(0).getClass().getSimpleName());
    

    The data repo has a similar feature in its DataRepoBuilder.build(...) method for specifying subclasses. This allows you to seemless query fields form subclasses and classes in the same repo or searchable collection.

    0 讨论(0)
  • 2020-11-29 22:39

    Here is how i am achieving this, right now only put,remove and get methods are working for rest you need to override desired methods.

    Example:

    MultiKeyMap<MultiKeyMap.Key,String> map = new MultiKeyMap<>();
    MultiKeyMap.Key key1 = map.generatePrimaryKey("keyA","keyB","keyC");
    MultiKeyMap.Key key2 = map.generatePrimaryKey("keyD","keyE","keyF");
    
    map.put(key1,"This is value 1");
    map.put(key2,"This is value 2");
    
    Log.i("MultiKeyMapDebug",map.get("keyA"));
    Log.i("MultiKeyMapDebug",map.get("keyB"));
    Log.i("MultiKeyMapDebug",map.get("keyC"));
    
    Log.i("MultiKeyMapDebug",""+map.get("keyD"));
    Log.i("MultiKeyMapDebug",""+map.get("keyE"));
    Log.i("MultiKeyMapDebug",""+map.get("keyF"));
    

    Output:

    MultiKeyMapDebug: This is value 1
    MultiKeyMapDebug: This is value 1
    MultiKeyMapDebug: This is value 1
    MultiKeyMapDebug: This is value 2
    MultiKeyMapDebug: This is value 2
    MultiKeyMapDebug: This is value 2
    

    MultiKeyMap.java:

    /**
     * Created by hsn on 11/04/17.
     */
    
    
    public class MultiKeyMap<K extends MultiKeyMap.Key, V> extends HashMap<MultiKeyMap.Key, V> {
    
        private Map<String, MultiKeyMap.Key> keyMap = new HashMap<>();
    
        @Override
        public V get(Object key) {
            return super.get(keyMap.get(key));
        }
    
        @Override
        public V put(MultiKeyMap.Key key, V value) {
            List<String> keyArray = (List<String>) key;
            for (String keyS : keyArray) {
                keyMap.put(keyS, key);
            }
            return super.put(key, value);
        }
    
        @Override
        public V remove(Object key) {
            return super.remove(keyMap.get(key));
        }
    
        public Key generatePrimaryKey(String... keys) {
            Key singleKey = new Key();
            for (String key : keys) {
                singleKey.add(key);
            }
            return singleKey;
        }
    
        public class Key extends ArrayList<String> {
    
        }
    
    }
    
    0 讨论(0)
  • 2020-11-29 22:41

    If you want multiple indexes on your data, you can create and maintain multiple hash maps or use a library like Data Store:

    https://github.com/jparams/data-store

    Example:

    Store<Person> store = new MemoryStore<>() ;
    store.add(new Person(1, "Ed", 3));
    store.add(new Person(2, "Fred", 7));
    store.add(new Person(3, "Freda", 5));
    store.index("name", Person::getName);
    Person person = store.getFirst("name", "Ed");
    

    With data store you can create case-insensitive indexes and all sorts of cool stuff. Worth checking out.

    0 讨论(0)
  • 2020-11-29 22:43

    I'm not sure I understand the question, but I think what you're asking for is multiple ways to map from different, unique keys to values and appropriate clean-up when a value goes away.

    I see that you don't want to roll your own, but there's a simple enough composition of map and multimap (I used the Guava multimap below, but the Apache one should work as well) to do what you want. I have a quick and dirty solution below (skipped the constructors, since that depends on what sort of underlying map/multimap you want to use):

    package edu.cap10.common.collect;
    
    import java.util.Collection;
    import java.util.Map;
    
    import com.google.common.collect.ForwardingMap;
    import com.google.common.collect.Multimap;
    
    public class MIndexLookupMap<T> extends ForwardingMap<Object,T>{
    
        Map<Object,T> delegate;
        Multimap<T,Object> reverse;
    
        @Override protected Map<Object, T> delegate() { return delegate; }
    
        @Override public void clear() {
            delegate.clear();
            reverse.clear();
        }
    
        @Override public boolean containsValue(Object value) { return reverse.containsKey(value); }
    
        @Override public T put(Object key, T value) {
            if (containsKey(key) && !get(key).equals(value)) reverse.remove(get(key), key); 
            reverse.put(value, key);
            return delegate.put(key, value);
        }
    
        @Override public void putAll(Map<? extends Object, ? extends T> m) {
            for (Entry<? extends Object,? extends T> e : m.entrySet()) put(e.getKey(),e.getValue());
        }
    
        public T remove(Object key) {
            T result = delegate.remove(key);
            reverse.remove(result, key);
            return result;
        }
    
        public void removeValue(T value) {
            for (Object key : reverse.removeAll(value)) delegate.remove(key);
        }
    
        public Collection<T> values() {
            return reverse.keySet();
        }   
    
    }
    

    removal is O(number of keys), but everything else is the same order as a typical map implementation (some extra constant scaling, since you also have to add things to the reverse).

    I just used Object keys (should be fine with appropriate implementations of equals() and hashCode() and key distinction) - but you could also have a more specific type of key.

    0 讨论(0)
  • 2020-11-29 22:46

    I've written a Table interface that includes methods like

    V put(R rowKey, C columnKey, V value) 
    V get(Object rowKey, Object columnKey) 
    Map<R,V> column(C columnKey) 
    Set<C> columnKeySet()
    Map<C,V> row(R rowKey)
    Set<R> rowKeySet()
    Set<Table.Cell<R,C,V>> cellSet()
    

    We'd like to include it in a future Guava release, but I don't know when that would happen. http://code.google.com/p/guava-libraries/issues/detail?id=173

    0 讨论(0)
提交回复
热议问题