Updating record when referenced by multiple data structures

后端 未结 7 1211
独厮守ぢ
独厮守ぢ 2020-12-10 06:50

Suppose I have a record, e.g. Person, and I want to be able to look this person up through multiple data structures. Maybe there\'s an index by name, another in

7条回答
  •  有刺的猬
    2020-12-10 07:41

    We have two challenges. The first is "How do [we] propagate [a] change across ... various lookup structures". The second is to minimize the work done when we perform lookups.

    Let's make some working code so that we have something concrete to discuss.

    To begin with, let's look at what an "update" or "change" is. An update or change starts in one state, and ends up in another state. It's a function from the previous state to the next state. It basically is type Update = State -> State. In Haskell, we can make the state disappear by hiding it in some Monad; this is a very common practice, so despite the fact it looks "impure" it is very "Haskell-ish". You can read more about this idea by reading about the state monad.

    Here's a class similar to MonadState that lets us talk about values we can allocate (new), update (set), and inspect (get).

    -- Class for a typed dictionary in a monadic context
    class (Monad m) => MonadReference m where
        type Reference :: * -> *
        new :: (Typeable a) => a -> m (Reference a)
        set :: (Typeable a) => (Reference a) -> a -> m ()
        get :: (Typeable a) => (Reference a) -> m a
    

    We'll use this to write some very simple example code.

    data Person = Person {
        name :: String
    } deriving (Show, Typeable)
    
    data Company = Company {
        legalName :: String
    } deriving (Show, Typeable)
    
    -- the only thing we need MonadIO for in this exmple is printing output
    example1 :: (MonadIO m, MonadReference m) => m ()
    example1 = do
        -- alice :: Reference Person
        alice <- new $ Person { name = "Alice" }
        bob <- new $ Person { name = "Bob" }
        -- company :: Reference Company
        company <- new $ Company { legalName = "Eve's Surveillance" }
        (liftIO . print) =<< get alice
        (liftIO . print) =<< get bob
        (liftIO . print) =<< get company
        (liftIO . putStrLn) ""
        set alice Person { name = "Mike" }
        set company Company { legalName = "Mike's Meddling" }
        (liftIO . print) =<< get alice
        (liftIO . print) =<< get bob
        (liftIO . print) =<< get company
    

    We've used new, get, and set to create some References, inspect them, and modify them.

    To get this to work, we need a bit of boring boilerplate. We'll borrow IORef for our implementation of a Reference to run this code without writing too much code ourselves.

    {-# LANGUAGE TypeFamilies, DeriveDataTypeable #-}
    module Main (
        main
    ) where
    
    import Data.Typeable
    import Data.Traversable
    import Control.Applicative
    import Data.IORef
    
    --transformers package:
    import Control.Monad.IO.Class
    
    main = example1
    
    -- Instead of implementing a dictionary, for an example we'll just use IORefs when we have IO.
    instance MonadReference IO where
        type Reference = IORef
        new = newIORef
        set = writeIORef
        get = readIORef
    

    Now, on top of updating people, we'd also like to update the people in a multiple data structures. We'll look at two data structures: a list, [Person], and a tuple, (Person,Company). Now, we could make a list of References to people, say (people :: [Reference Person]) = [alice, bob], but this isn't very useful. For example, we don't really know how to show it. It would be more useful if Reference weren't intermingled inside the list. Naively, Reference [Person] would be more useful. But it would mean nothing to set this Reference, so clearly we have the wrong type. Reference [Person] would just let us call get to turn it into an m [Person], so we could skip that and just use m [Person]. Here's an example that does that:

    -- the only thing we need MonadIO for in this exmple is printing output
    example2 :: (MonadIO m, MonadReference m) => m ()
    example2 = do
        -- alice :: Reference Person
        alice <- new $ Person { name = "Alice" }
        bob <- new $ Person { name = "Bob" }
        -- company :: Reference Company
        company <- new $ Company { legalName = "Eve's Surveillance" }
        (liftIO . print) =<< get alice
        (liftIO . print) =<< get bob
        (liftIO . print) =<< get company
        let people = do
            a <- get alice
            b <- get bob
            return [a, b]
        let structure2 = do
            a <- get alice
            c <- get company
            return (a, c)
        (liftIO . print) =<< people
        (liftIO . print) =<< structure2
        (liftIO . putStrLn) ""
        set alice Person { name = "Mike" }
        set company Company { legalName = "Mike's Meddling" }
        (liftIO . print) =<< get alice
        (liftIO . print) =<< get bob
        (liftIO . print) =<< get company
        (liftIO . print) =<< people
        (liftIO . print) =<< structure2
    

    Now we know quite a bit about what a library or libraries for doing this should look like. Here are some of the requirements we might have already imagined:

    • We need something that keeps the state of all of the entities
    • We need a way to go from one state to a new state that has a new entity
    • We need a way to update an entity stored in the state
    • We need a way to retrieve an entity from the state

    Here are some requirements that emerge from experimenting with some code:

    • We need a way to make a state dependent value that depends on the current state of an entity. We saw this in get alice, get bob, and get company.
    • We need a way to make a state dependent value out of something constant. We saw this in the use of the (:), [], and (,) constructors.
    • We need a way to combine together multiple state dependent values into new state dependent values.

    There's are also a few problems with our example. If we embrace MonadReference m => m a as the type of a state dependent value of type a, there's nothing to stop something that we think is getting the value from the state from also modifying it.

    • A state dependent value shouldn't be able to modify the state.

    We also have performance problems. All of our state dependent values are being completely recalculated every time we use them. A good performance requirement might be:

    • A state dependent value shouldn't be calculated unless state that it depends on has been modified.

    Armed with these new requirements, we can make new interfaces. After we make new interfaces, we can equip them with a naive implementation. After we have a naive implementation, we can address our requirements for performance, and make a performant implementation.

    Some exercise that could prepare us for the next steps include reading about or playing with Control.Applicative, the publisher-subscriber design pattern, the operational monad and transformer Program and ProgramT or the free monad and transformer Free, FreeF, and FreeT, Data.Traversable, Control.Lens, and the knockout.js javascript library.

    Update: The new interfaces

    Based on our new requirements for what state dependent values are, we can write a new interface:

    -- Class for a monad with state dependent values
    class (MonadReference m, Applicative Computed, Monad Computed) => MonadComputed m where
        type Computed :: * -> *
        track :: (Typeable a) => Reference a -> m (Computed a)
        runComputed :: (Typeable a) => (Computed a) -> m a
    

    These address our new requirements as follows:

    • track makes a state dependent value that depends on a Reference, which satisfies our first new requirement.
    • Applicative's pure and Monad's return both provide a method by which to create new Computed values that contain a constant.
    • Applicative's <*> and Monad's >>= provide methods by which to combine computed values into new computed values.
    • The Computed type provides a means for an implementation to exclude unwanted types.

    Now we can write new example code in terms of this interface. We'll construct computed values three different ways: Using Data.Traversable's sequenceA on lists with the Applicative instance for Computed, using the Monad instance for Computed, and finally using the Applicative instance for Computed.

    -- the only thing we need MonadIO for in this exmple is printing output
    example :: (MonadIO m, MonadComputed m) => m ()
    example = do
        -- aliceRef :: Reference Person
        aliceRef <- new $ Person { name = "Alice" }
        -- alice :: Computed Person
        alice <- track aliceRef
        bobRef <- new $ Person { name = "Bob" }
        bob <- track bobRef
        -- companyRef :: Reference Company
        companyRef <- new $ Company { legalName = "Eve's Surveillance" }
        -- company :: Computed Company
        company <- track companyRef
        (liftIO . print) =<< runComputed alice
        (liftIO . print) =<< runComputed bob
        (liftIO . print) =<< runComputed company
        let people = Traversable.sequenceA [alice, bob]
        let structure2 = do
            a <- alice
            c <- company
            return (a, c)
        let structure3 = (pure (,)) <*> structure2 <*> bob
        (liftIO . print) =<< runComputed people
        (liftIO . print) =<< runComputed structure2
        (liftIO . print) =<< runComputed structure3
        (liftIO . putStrLn) ""
        set aliceRef Person { name = "Mike" }
        set companyRef Company { legalName = "Mike's Meddling" }
        (liftIO . print) =<< runComputed alice
        (liftIO . print) =<< runComputed bob
        (liftIO . print) =<< runComputed company
        (liftIO . print) =<< runComputed people
        (liftIO . print) =<< runComputed structure2
        (liftIO . print) =<< runComputed structure3
    

    Note that if we didn't want or need to track aliceRef and track bobRef independently, we could create a list of Computed values by mapM track [aliceRef, bobRef].

    Now we can make another simple implementation for IO, so that we can run our example and see that we are on the right track. We'll use operational's Program type to make this simple and get us both an Applicative and a Monad instance.

    -- Evaluate computations built in IO
    instance MonadComputed IO where
        -- Store the syntax tree in a Program from operational
        type Computed = Program IORef
        track = return . singleton
        runComputed c = case view c of 
            Return x -> return x
            ref :>>= k -> do
                value <- readIORef ref
                runComputed (k value)
    

    At this point the entire running example is:

    {-# LANGUAGE TypeFamilies, DeriveDataTypeable, FlexibleContexts #-}
    module Main (
        main
    ) where
    
    import Data.Typeable
    import qualified Data.Traversable as Traversable
    import Control.Applicative
    import Data.IORef
    
    --transformers package:
    import Control.Monad.IO.Class
    
    --operational package:
    import Control.Monad.Operational
    
    main = example
    
    data Person = Person {
        name :: String
    } deriving (Show, Typeable)
    
    data Company = Company {
        legalName :: String
    } deriving (Show, Typeable)
    
    -- the only thing we need MonadIO for in this exmple is printing output
    example :: (MonadIO m, MonadComputed m) => m ()
    example = do
        -- aliceRef :: Reference Person
        aliceRef <- new $ Person { name = "Alice" }
        -- alice :: Computed Person
        alice <- track aliceRef
        bobRef <- new $ Person { name = "Bob" }
        bob <- track bobRef
        -- companyRef :: Reference Company
        companyRef <- new $ Company { legalName = "Eve's Surveillance" }
        -- company :: Computed Company
        company <- track companyRef
        (liftIO . print) =<< runComputed alice
        (liftIO . print) =<< runComputed bob
        (liftIO . print) =<< runComputed company
        let people = Traversable.sequenceA [alice, bob]
        let structure2 = do
            a <- alice
            c <- company
            return (a, c)
        let structure3 = (pure (,)) <*> structure2 <*> bob
        (liftIO . print) =<< runComputed people
        (liftIO . print) =<< runComputed structure2
        (liftIO . print) =<< runComputed structure3
        (liftIO . putStrLn) ""
        set aliceRef Person { name = "Mike" }
        set companyRef Company { legalName = "Mike's Meddling" }
        (liftIO . print) =<< runComputed alice
        (liftIO . print) =<< runComputed bob
        (liftIO . print) =<< runComputed company
        (liftIO . print) =<< runComputed people
        (liftIO . print) =<< runComputed structure2
        (liftIO . print) =<< runComputed structure3
    
    
    -- Class for a typed dictionary in a monadic context
    class (Monad m) => MonadReference m where
        type Reference :: * -> *
        new :: (Typeable a) => a -> m (Reference a)
        set :: (Typeable a) => Reference a -> a -> m ()
        get :: (Typeable a) => Reference a -> m a
    
    
    -- Class for a monad with state dependent values
    class (MonadReference m, Applicative Computed, Monad Computed) => MonadComputed m where
        type Computed :: * -> *
        track :: (Typeable a) => Reference a -> m (Computed a)
        runComputed :: (Typeable a) => (Computed a) -> m a
    
    -- Instead of implementing a dictionary, for an example we'll just use IORefs when we have IO.
    instance MonadReference IO where
        type Reference = IORef
        new = newIORef
        set = writeIORef
        get = readIORef    
    
    -- Evaluate computations built in IO
    instance MonadComputed IO where
        -- Store the syntax tree in a Program from operational
        type Computed = Program IORef
        track = return . singleton
        runComputed c = case view c of 
            Return x -> return x
            ref :>>= k -> do
                value <- readIORef ref
                runComputed (k value)
    

    We still need to address our performance requirement to minimize the work done when we perform lookups. Our goal requirement was:

    • A state dependent value shouldn't be calculated unless state that it depends on has been modified.

    We can now clarify this to be in terms of our interface:

    • runComputed shouldn't be calculated unless a Computed value that it depends on has been modified since the last time runComputed was executed.

    We can now see that our desired solution is going to be something like cache invalidation or bottom-up query evaluation. I'd guess that in a language with lazy evaluation, they both work out to be about the same thing.

    Final Update: Performance

    Equipped with a new interface, we can now explore and address our performance goal. In doing so, I discovered that there's an additional, subtle requirement that we missed. We would like runComputed to reuse previously computed values if the value hasn't been changed. What we didn't notice is that Haskell's type system should and is preventing us from doing so. A value of type Computed a always means the same thing, it's never actually modified. So the computations that were building our structures will mean the same thing, "a computation constructed from these parts" even after we have executed runComputed. We need to slip in somewhere to put the side effect from the first runComputed. We can do this with the type m (Computed a) instead. The new method in MonadComputed m that does this is:

    share :: (Typeable a) => (Computed a) -> m (Computed a)
    

    The new Computed a we get back means something slightly different: "a possibly cached computation constructed from these parts". We were already doing something similar, but telling Haskell about it instead of telling our code. We wrote, for example:

        let people = Traversable.sequenceA [alice, bob]
    

    This let told the Haskell compiler that each time it encountered people it should use the same thunk. If we instead wrote Traversable.sequenceA [alice, bob] each time it would be used, the Haskell compiler probably wouldn't have created and maintained a pointer to a single thunk. This can be a nice thing to know when juggling memory. If you want to maintain something in memory and avoid computation, use let, if you want to recompute it to avoid holding onto the memory, don't use let. Here we explicitly want to hold on to our computed structures, so we are going to use our new equivalent, share

        people <- share $ Traversable.sequenceA [alice, bob]
    

    The remainder of the changes to the example code at the end are to demonstrate more possible updates.

    Now that we have the interface finalized, we can work on an implementation. This implementation will still take advantage of IO and IORefs. Our implementation is going to be based on subscribing to be notified of changes, and invalidating cached changes and their dependants when a change happens. This data structure stores a value and the subscribers that want to be notified:

    -- A published value for IO, using Weak references to the subscribers
    data Published a = Published {
        valueRef :: IORef a,
        subscribers :: IORef [Weak (IO ())]
    }
    

    Something that needs to be notified when something happens in IO could be as simple as IO (), but then the cycle between a dependant computation and a value would hold all the dependant computations in memory until the original value is forgotten. Instead a Weak pointer (from System.Mem.Weak) to the dependant's update action should allow the garbage collecter to collect these.

    First we'll implement MonadReference IO. Our code to handle References to entities is modified to peek through Published to get the value, and to execute all of the subscribers when the value is set.

    -- A new implementation that keeps an update list
    instance MonadReference IO where
        type Reference = Published
        new = newIORefPublished
        set = setIORefPublished
        get = readIORefPublished
    
    -- Separate implemenations for these, since we'd like to drop the Typeable constraint
    newIORefPublished value =
        do
            ref <- newIORef value
            subscribersRef <- newIORef []
            return Published { valueRef = ref, subscribers = subscribersRef }
    
    setIORefPublished published value =
        do
            writeIORef (valueRef published) value
            notify $ subscribers published                 
    
    
    --readIORefPublished = readIORef . valueRef
    
    readIORefPublished x = do
        putStrLn "getting"
        readIORef $ valueRef x
    

    Notifying the subscribers is a bit tricky. We need to forget about any subscriber that has been removed by garbage collection. I anticipated that a subscriber might be subscribing to things during its update action for the tricky case of binding, so when a subscriber is garbage collected, we don't assume that the new set of subscribers is the old set except for the garabage collected ones, instead we filter them as a separate cleanupWeakRefs step.

    notify :: IORef [Weak (IO ())] -> IO ()
    notify = go
        where
            go subscribersRef = do
                subscribers <- readIORef subscribersRef
                needsCleanup <- (liftM (any id)) (mapM notifySubscriber subscribers)
                when needsCleanup $ cleanupWeakRefs subscribersRef
            notifySubscriber weakSubscriber = do
                maybeSubscriber <- deRefWeak weakSubscriber
                case maybeSubscriber of
                    Nothing -> return True
                    Just subscriber -> subscriber >> return False
    
    
    cleanupWeakRefs :: IORef [Weak a] -> IO ()
    cleanupWeakRefs ref = do
        weaks <- readIORef ref
        newWeaks <- (liftM catMaybes) $ mapM testWeak weaks
        writeIORef ref newWeaks
        where
            testWeak weakRef = liftM (>> Just weakRef) $ deRefWeak weakRef
    

    We're done with our handling of entities, time to get on to the interesting and tricky part, the computations. Here's the complete data type for a computation or state dependent value:

    -- Data type for building computations
    data IORefComputed a where 
        Pure :: a -> IORefComputed a
        Apply :: IORefComputed (b -> a) -> IORefComputed b -> IORefComputed a
        Bound :: IORefComputed b -> (b -> IORefComputed a) -> IORefComputed a
        Tracked :: Published a -> IORefComputed a
        Shared :: Published (Either (IORefComputed a) a) -> IORefComputed a
    

    Pure represents values that don't depend on anything. Apply represents values built by applications of <*>. Bound represents values built using the Monad instance's >>=. Tracked are ordinary state dependent values made using track. Shared are the points at which we remember computations and are notified of changes to tracked values, made using share. We reuse the Published type to store a value and its subscribers, but the value we store is Either the computation that needs to be perfomed when the shared cache is dirty, (IORefComputed a), or the cached value when the cache is clean, a. Here are the instances that let the user use these:

    instance Monad IORefComputed where
        return = Pure
        (>>=) = Bound
        (>>) _ = id
    
    instance Applicative IORefComputed where
        pure = return
        (<*>) = Apply
    
    instance Functor IORefComputed where
        fmap = (<*>) . pure     
    
    -- Evaluate computations built in IO
    instance MonadComputed IO where
        type Computed = IORefComputed
        track = trackIORefComputed
        runComputed = evalIORefComputed
        share = shareIORefComputed
    
    -- Separate implementations, again to drop the Typeable constraint
    trackIORefComputed = return . Tracked
    

    Note: the optimization of >> almost certainly violates the Monad laws in the presence of _|_.

    Now we need to make the non-trivial implementations of runComputed and share. First we'll look at share, which does most of the new work:

    shareIORefComputed :: IORefComputed a -> IO (IORefComputed a)
    shareIORefComputed c =
        case c of          
            Apply cf cx -> do
                sharedf <- shareIORefComputed cf
                sharedx <- shareIORefComputed cx
                case (sharedf, sharedx) of
                    -- Optimize away constants
                    (Pure f, Pure x) -> return . Pure $ f x 
                    _ -> do
                        let sharedc = sharedf <*> sharedx
                        published <- newIORefPublished $ Left sharedc
                        -- What we are going to do when either argument changes
                        markDirty <- makeMarkDirty published published sharedc
                        subscribeTo sharedf markDirty
                        subscribeTo sharedx markDirty
                        return $ Shared published     
            Bound cx k -> do
                sharedx <- shareIORefComputed cx
                case cx of
                    -- Optimize away constants
                    (Pure x) -> shareIORefComputed $ k x
                    _ -> do
                        let dirtyc = sharedx >>= k
                        published <- newIORefPublished $ Left dirtyc
                        -- What we are going to do when the argument to k changes
                        markDirty <- makeMarkDirty published published dirtyc
                        subscribeTo sharedx markDirty            
                        return $ Shared published
            _ -> return c
    

    When we are asked to share an application of <*>, Apply, we first share both of its arguments. We optimize away the value if we can determine it to be constant. If we can't optimize it away, we make a new, initially dirty cache, and ask to be updated whenever either argument changes.

    Dealing with >>= is much more difficult. We share the argument to >>=, but we don't know what Computed value the function will return until we evaluate it with each argument. We say that it can be computed by evaluating the entire bind, and ask to have the cache invalidated whenever the argument changes. We will definitely want to improve this later.

    In all other cases there's nothing to be done to cache the value; it is either a constant or an already cached Tracked or Shared.

    If you are doubting the need for share, replace this definition with

    shareIORefComputed c = return c
    

    and run the examples. You'll see that every involved value is read every time we run runComputed. There's nothing you can do in runComputed to modify an existing Computed to know about a place it has been cached, because we can't change existing values in Haskell.

    Now we'll implement runComputed. The basic idea is that we evaluate things as before, but when we encounter a dirty shared cache we calculate its new value and update the cache. These updates do not trigger notification of the subscribers.

    evalIORefComputed :: IORefComputed a -> IO a
    evalIORefComputed c = 
        case c of
            Pure x -> return x
            Apply cf cx -> do
                f <- evalIORefComputed cf
                x <- evalIORefComputed cx
                return (f x) 
            Bound cx k -> do            
                value <- evalIORefComputed cx
                evalIORefComputed (k value)
            Tracked published -> readIORefPublished published
            Shared publishedThunk -> do
                thunk <- readIORefPublished publishedThunk
                case thunk of
                    Left computation@(Bound cx k) -> do
                        x <- evalIORefComputed cx
                        -- Make a shared version of the computed computation
                        currentExpression <- shareIORefComputed (k x)
                        let gcKeyedCurrentExpression = Left currentExpression
                        writeIORef (valueRef publishedThunk) gcKeyedCurrentExpression
                        markDirty <- makeMarkDirty publishedThunk gcKeyedCurrentExpression computation
                        subscribeTo currentExpression markDirty
                        evalIORefComputed c
                    Left computation -> do
                        value <- evalIORefComputed computation
                        writeIORef (valueRef publishedThunk) (Right value)
                        return value
                    Right x ->
                        return x
    

    This is straightforward except for what we do for a dirty shared >>=. We evaluate the argument, then we share the resulting computation. The trick is that we ask that the entire shared thunk be marked dirty when this new value is updated. We ask the garbage collected to forget about this when the dirty marking for this currentExpression is garbage collected. This provides a window during which the thunk might be marked dirty even if it no longer depends on currentExpression. Now a shared bind will be marked dirty by both changes to its argument, changes to the computed value that depended on its argument, and changes to computed values that recently depended on its argument and haven't been garbage collected yet.

    The remainder of the implementation is building the weak references to notifications and subscribing to a published value by prepending the new subscriber.

    makeMarkDirty :: Published (Either (IORefComputed a) a) -> k -> IORefComputed a -> IO (Weak (IO ())) 
    makeMarkDirty published key definition =
        do
            let markDirty = do
                existing <- readIORef (valueRef published)
                case existing of
                    Right _ -> setIORefPublished published $ Left definition
                    _ -> return ()
            mkWeak key markDirty Nothing
    
    
    subscribeTo :: IORefComputed a -> Weak (IO ()) -> IO ()
    subscribeTo (Tracked published) trigger = modifyIORef' (subscribers published) (trigger :)
    subscribeTo (Shared published) trigger = modifyIORef' (subscribers published) (trigger :)    
    subscribeTo _ _ = return ()            
    

    The complete compiling example code is on github. It requires the transformers package.

    If you run the example code, you'll notice that:

    • After the company name is changed, runComputed people performs only a single get to get the cached value
    • After no changes are made, all three structures each perform only a single get to get the entire cached structure
    • After bob is changed, , runComputed structure2 performs only a single get to get the cached value, and the work done to compute structure3 is less, even though bob is in structure3.
    • structure2, the one built with the Monad instance, requires the most work to compute, due to the extra intermediary shared values.

提交回复
热议问题