Following the available docs and resources, it is not really clear how to accomplish a simple getting-started flow where you'd launch Vowpal Wabbit as a daemon (possibly even without any pre-learnt model) and have it online learn and explore ― I'm looking for a flow where I'd feed in a context, get back a recommendation, and feed back a cost/reward.
So let me skip the technical descriptions of what's been tried and simply ask for a clear demonstration regarding what I might consider essential in this vein ―
- How to demo through a daemon, that learning is taking place, not in offline mode from batch data but purely from online interaction? any good suggestions?
- How to report back a cost/reward following a selected action, in daemon mode? once per action? in bulk? and either way, how?
- Somewhat related ― would you recommend a live system using the daemon, for contextual bandits? or rather some of the language API?
- Can you alternatively point at where the server code sits inside the gigantic code base? it can be a good place to start systematically exploring from.
I typically get a distribution (the size of the number of allowed actions) as a reply for every input sent. Typically the same distribution regardless of what I sent in. Maybe it takes a whole learning epoch with the default --cb_explore
algorithm, I wouldn't know, and am not sure the epoch duration can be set from outside.
I understand that so much has been put into enabling learning from past interactions, and from cbfied data. However I think there should also be some available explanation clearing those more-or-less pragmatic essentials above.
Thanks so much!
There it goes. This flow only necessitates a subset of the Vowpal Wabbit input format. First off after successful installation, we start off a Vowpal Wabbit daemon:
vw --cb_explore 2 --daemon --port 26542 --save_resume
In the above, we tell VW to start a Contextual Bandit model serving daemon, without any upfront training having been provided through old policy data. The model will be the default contextual bandits model of VW, and it will assume as above specified, just two actions to choose from. Vowpal will initially assign suggested actions by random, and will over time approach the optimal policy.
Let's just check the daemon is up: pgrep 'vw.*'
should return a list of processes.
At any time later if we wanted to stop the daemon and start it again we would simply pkill -9 -f 'vw.*--port 26542'
.
Now let us simulate decision points and costs obtained for the actions taken. In the following I use the terminal way of dispatching messages to the daemon, but you can exercise this with a tool like postman or your own code:
echo " | a b " | netcat localhost 26542
Here we just told Vowpal to suggest what action we should take for a context comprising the feature set (a
, b
).
Vowpal succinctly replies not with a chosen action, but with a distribution of predicted costs per each of the two actions our model was instructed to chose from:
0.975000 0.025000
These are of course only the result of some random initialization, as it's not seen any costs yet! Now our application using Vowpal is expected to choose uniformly at random according to this distribution ― this part is not implemented by Vowpal but left to application code. The Contextual Bandits model relies on us sampling from this distribution for choosing the action to be played against the environment ― if we do not follow this expectation ― the algorithm may not accomplish its learning.
So imagine we sampled from this distribution, and got action 1
, then executed that action in the real-world environment (for the same context a b
we asked Vowpal to recommend for). Imagine we got back a cost 0.7 this time. We have to communicate this cost back to Vowpal as feedback:
echo " 1:0.7:1 | a b " | netcat localhost 26542
Vowpal got our feedback, and gives us back its updated prediction for this context:
0.975000 0.025000
We don't care about it right now unless we wish to get a recommendation for the exact same context again, but we get its updated recommendation anyway.
Obviously it's the same recommendation as before, as our single feedback so far isn't enough for the model for learning anything. Repeated many times, and for different context features, the predictions returned from Vowpal will adapt and change. Repeat this process for many times and for many different contexts, and the model will begin shifting its predictions per what it has learned.
Note I mention costs and not rewards here, as unlike much of the literature of the algorithms implemented in Vowpal, the command-line version at least, takes costs as feedback, not rewards.
来源:https://stackoverflow.com/questions/48194011/how-to-demo-vowpal-wabbits-contextual-bandits-in-real-online-mode