Is it reasonable for l1/l2 regularization to cause all feature weights to be zero in vowpal wabbit?

问题

I got a weird result from vw, which uses online learning scheme for logistic regression. And when I add --l1 or --l2 regularization then I got all predictions at 0.5 (that means all features are 0)

Here's my command:

vw -d training_data.txt --loss_function logistic -f model_l1 --invert_hash model_readable_l1 --l1 0.05 --link logistic

...and here's learning process info:

using l1 regularization = 0.05
final_regressor = model_l1
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = training_data.txt
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.693147 0.693147            1            1.0  -1.0000   0.5000      120
0.423779 0.154411            2            2.0  -1.0000   0.1431      141
0.325755 0.227731            4            4.0  -1.0000   0.1584      139
0.422596 0.519438            8            8.0  -1.0000   0.4095      147
0.501649 0.580701           16           16.0  -1.0000   0.4638      139
0.509752 0.517856           32           32.0  -1.0000   0.4876      131
0.571194 0.632636           64           64.0   1.0000   0.2566      140
0.572743 0.574291          128          128.0  -1.0000   0.4292      139
0.597763 0.622783          256          256.0  -1.0000   0.4936      143
0.602377 0.606992          512          512.0   1.0000   0.4996      147
0.647667 0.692957         1024         1024.0  -1.0000   0.5000      119
0.670407 0.693147         2048         2048.0  -1.0000   0.5000      146
0.681777 0.693147         4096         4096.0  -1.0000   0.5000      115
0.687462 0.693147         8192         8192.0  -1.0000   0.5000      145
0.690305 0.693147        16384        16384.0  -1.0000   0.5000      145
0.691726 0.693147        32768        32768.0  -1.0000   0.5000      116
0.692437 0.693147        65536        65536.0  -1.0000   0.5000      117
0.692792 0.693147       131072       131072.0  -1.0000   0.5000      117
0.692970 0.693147       262144       262144.0  -1.0000   0.5000      147

BTW, the number of features are nearly 80,000 and each sample contains only tiny part of it(that why current features only 100 around).

Here's my guess, in objective function/loss function, the second term regularization loss might dominate the whole equation, which lead to this phenomenon?

loss = example_loss + regularization_loss

And I try another dataset (the other day's)

$vw-hypersearch -L 1e-10 5e-4 vw --l1 % training_data.txt 
vw-hypersearch: -L: using log-space search
trying 1.38099196677199e-06 ...................... 0.121092 (best)
trying 3.62058586892961e-08 ...................... 0.116472 (best)
trying 3.81427762457755e-09 ...................... 0.116095 (best)
trying 9.49219282204347e-10 ...................... 0.116084 (best)
trying 4.01833137620189e-10 ...................... 0.116083 (best)
trying 2.36222250814353e-10 ...................... 0.116083 (best)
loss(2.36222e-10) == loss(4.01833e-10): 0.116083
trying 3.08094024967111e-10 ...................... 0.116083 (best)
3.08094e-10 0.116083

回答1:

As you correctly suspected: the regularization term dominates the loss calculation, leading to this result. This is because the regularization argument that was passed on the command line --l1 0.05, is too large.

Why does it work this way? vw applies the --l1 (the same applies to --l2) regularization value directly to the calculated sum-of-gradients. i.e. the value used is absolute rather than relative. After some convergence, the sum-of-gradients often gets close to zero so the regularization value dominates it. As the learning rate plateaus (too early due to the large L1), the learner can't extract more information from further examples.

Setting --l1 to a high value, imposes a high floor on the convergence process.

As the vw-hypersearch result above shows, use of a much smaller --l regularization term can improve the end result significantly:

+----------+----------------+
| l1 value | final avg loss |
+----------+----------------+
| 5.1e-02  |       0.692970 |
| 3.1e-10  |       0.116083 |
+----------+----------------+

来源：https://stackoverflow.com/questions/32752833/is-it-reasonable-for-l1-l2-regularization-to-cause-all-feature-weights-to-be-zer

标签

classification

logistic-regression

vowpalwabbit

hyperparameters

regularized