I got a weird result from vw
, which uses online learning scheme for logistic regression. And when I add --l1
or --l2
regularization then I got all predictions at 0.5 (that means all features are 0)
Here's my command:
vw -d training_data.txt --loss_function logistic -f model_l1 --invert_hash model_readable_l1 --l1 0.05 --link logistic
...and here's learning process info:
using l1 regularization = 0.05
final_regressor = model_l1
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = training_data.txt
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.693147 0.693147 1 1.0 -1.0000 0.5000 120
0.423779 0.154411 2 2.0 -1.0000 0.1431 141
0.325755 0.227731 4 4.0 -1.0000 0.1584 139
0.422596 0.519438 8 8.0 -1.0000 0.4095 147
0.501649 0.580701 16 16.0 -1.0000 0.4638 139
0.509752 0.517856 32 32.0 -1.0000 0.4876 131
0.571194 0.632636 64 64.0 1.0000 0.2566 140
0.572743 0.574291 128 128.0 -1.0000 0.4292 139
0.597763 0.622783 256 256.0 -1.0000 0.4936 143
0.602377 0.606992 512 512.0 1.0000 0.4996 147
0.647667 0.692957 1024 1024.0 -1.0000 0.5000 119
0.670407 0.693147 2048 2048.0 -1.0000 0.5000 146
0.681777 0.693147 4096 4096.0 -1.0000 0.5000 115
0.687462 0.693147 8192 8192.0 -1.0000 0.5000 145
0.690305 0.693147 16384 16384.0 -1.0000 0.5000 145
0.691726 0.693147 32768 32768.0 -1.0000 0.5000 116
0.692437 0.693147 65536 65536.0 -1.0000 0.5000 117
0.692792 0.693147 131072 131072.0 -1.0000 0.5000 117
0.692970 0.693147 262144 262144.0 -1.0000 0.5000 147
BTW, the number of features are nearly 80,000 and each sample contains only tiny part of it(that why current features
only 100 around).
Here's my guess, in objective function/loss function, the second term regularization loss
might dominate the whole equation, which lead to this phenomenon?
loss = example_loss + regularization_loss
And I try another dataset (the other day's)
$vw-hypersearch -L 1e-10 5e-4 vw --l1 % training_data.txt
vw-hypersearch: -L: using log-space search
trying 1.38099196677199e-06 ...................... 0.121092 (best)
trying 3.62058586892961e-08 ...................... 0.116472 (best)
trying 3.81427762457755e-09 ...................... 0.116095 (best)
trying 9.49219282204347e-10 ...................... 0.116084 (best)
trying 4.01833137620189e-10 ...................... 0.116083 (best)
trying 2.36222250814353e-10 ...................... 0.116083 (best)
loss(2.36222e-10) == loss(4.01833e-10): 0.116083
trying 3.08094024967111e-10 ...................... 0.116083 (best)
3.08094e-10 0.116083
As you correctly suspected: the regularization term dominates the loss calculation, leading to this result. This is because the regularization argument that was passed on the command line --l1 0.05
, is too large.
Why does it work this way? vw
applies the --l1
(the same applies to --l2
) regularization value directly to the calculated sum-of-gradients. i.e. the value used is absolute rather than relative. After some convergence, the sum-of-gradients often gets close to zero so the regularization value dominates it. As the learning rate plateaus (too early due to the large L1), the learner can't extract more information from further examples.
Setting --l1
to a high value, imposes a high floor on the convergence process.
As the vw-hypersearch
result above shows, use of a much smaller --l
regularization term can improve the end result significantly:
+----------+----------------+
| l1 value | final avg loss |
+----------+----------------+
| 5.1e-02 | 0.692970 |
| 3.1e-10 | 0.116083 |
+----------+----------------+
来源:https://stackoverflow.com/questions/32752833/is-it-reasonable-for-l1-l2-regularization-to-cause-all-feature-weights-to-be-zer