There it goes. This flow only necessitates a subset of the Vowpal Wabbit input format. First off after successful installation, we start off a Vowpal Wabbit daemon:
vw --cb_explore 2 --daemon --port 26542 --save_resume
In the above, we tell VW to start a Contextual Bandit model serving daemon, without any upfront training having been provided through old policy data. The model will be the default contextual bandits model of VW, and it will assume as above specified, just two actions to choose from. Vowpal will initially assign suggested actions by random, and will over time approach the optimal policy.
Let's just check the daemon is up: pgrep 'vw.*'
should return a list of processes.
At any time later if we wanted to stop the daemon and start it again we would simply pkill -9 -f 'vw.*--port 26542'
.
Now let us simulate decision points and costs obtained for the actions taken. In the following I use the terminal way of dispatching messages to the daemon, but you can exercise this with a tool like postman or your own code:
echo " | a b " | netcat localhost 26542
Here we just told Vowpal to suggest what action we should take for a context comprising the feature set (a
, b
).
Vowpal succinctly replies not with a chosen action, but with a distribution of predicted costs per each of the two actions our model was instructed to chose from:
0.975000 0.025000
These are of course only the result of some random initialization, as it's not seen any costs yet! Now our application using Vowpal is expected to choose uniformly at random according to this distribution ― this part is not implemented by Vowpal but left to application code. The Contextual Bandits model relies on us sampling from this distribution for choosing the action to be played against the environment ― if we do not follow this expectation ― the algorithm may not accomplish its learning.
So imagine we sampled from this distribution, and got action 1
, then executed that action in the real-world environment (for the same context a b
we asked Vowpal to recommend for). Imagine we got back a cost 0.7 this time. We have to communicate this cost back to Vowpal as feedback:
echo " 1:0.7:1 | a b " | netcat localhost 26542
Vowpal got our feedback, and gives us back its updated prediction for this context:
0.975000 0.025000
We don't care about it right now unless we wish to get a recommendation for the exact same context again, but we get its updated recommendation anyway.
Obviously it's the same recommendation as before, as our single feedback so far isn't enough for the model for learning anything. Repeated many times, and for different context features, the predictions returned from Vowpal will adapt and change. Repeat this process for many times and for many different contexts, and the model will begin shifting its predictions per what it has learned.
Note I mention costs and not rewards here, as unlike much of the literature of the algorithms implemented in Vowpal, the command-line version at least, takes costs as feedback, not rewards.