Is it reasonable for l1/l2 regularization to cause all feature weights to be zero in vowpal wabbit?

江枫思渺然 提交于 2019-12-04 17:17:38

As you correctly suspected: the regularization term dominates the loss calculation, leading to this result. This is because the regularization argument that was passed on the command line --l1 0.05, is too large.

Why does it work this way? vw applies the --l1 (the same applies to --l2) regularization value directly to the calculated sum-of-gradients. i.e. the value used is absolute rather than relative. After some convergence, the sum-of-gradients often gets close to zero so the regularization value dominates it. As the learning rate plateaus (too early due to the large L1), the learner can't extract more information from further examples.

Setting --l1 to a high value, imposes a high floor on the convergence process.

As the vw-hypersearch result above shows, use of a much smaller --l regularization term can improve the end result significantly:

+----------+----------------+
| l1 value | final avg loss |
+----------+----------------+
| 5.1e-02  |       0.692970 |
| 3.1e-10  |       0.116083 |
+----------+----------------+
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!