I’m checking Vowpal Wabbit’s documentation for how it’s actually learning. Traditional Contextual Bandits learn by having F(context, action) = Reward, find action that maxim