gradient-descent

How to determine the learning rate and the variance in a gradient descent algorithm?

爱⌒轻易说出口 提交于 2019-12-03 02:39:36
I started to learn the machine learning last week. when I want to make a gradient descent script to estimate the model parameters, I came across a problem: How to choose a appropriate learning rate and variance。I found that,different (learning rate,variance) pairs may lead to different results, some times you even can't convergence. Also, if change to another training data set, a well-chose (learning rate,variance)pair probably will not work. For example(script below),when I set the learning rate to 0.001 and variance to 0.00001, for 'data1', I can get the suitable theta0_guess and theta1

why gradient descent when we can solve linear regression analytically

北城以北 提交于 2019-12-03 01:50:38
问题 what is the benefit of using Gradient Descent in the linear regression space? looks like the we can solve the problem (finding theta0-n that minimum the cost func) with analytical method so why we still want to use gradient descent to do the same thing? thanks 回答1: When you use the normal equations for solving the cost function analytically you have to compute: Where X is your matrix of input observations and y your output vector. The problem with this operation is the time complexity of

Selection of Mini-batch Size for Neural Network Regression

最后都变了- 提交于 2019-12-03 00:51:43
I am doing a neural network regression with 4 features. How do I determine the size of mini-batch for my problem? I see people use 100 ~ 1000 batch size for computer vision with 32*32*3 features for each image, does that mean I should use batch size of 1 million? I have billions of data and tens of GB of memory so there is no hard requirement for me not to do that. I also observed using a mini-batch with size ~ 1000 makes the convergence much faster than batch size of 1 million. I thought it should be the other way around, since the gradient calculated with a larger batch size is most

Why do we need to call zero_grad() in PyTorch? [duplicate]

孤人 提交于 2019-12-02 20:17:59
This question already has an answer here: Why do we need to explicitly call zero_grad()? 4 answers The method zero_grad() needs to be called during training. But the documentation is not very helpful | zero_grad(self) | Sets gradients of all model parameters to zero. Why do we need to call this method? In PyTorch , we need to set the gradients to zero before starting to do backpropragation because PyTorch accumulates the gradients on subsequent backward passes. This is convenient while training RNNs. So, the default action is to accumulate (i.e. sum) the gradients on every loss.backward() call

mxnet gradient descent for linear regression, variable types error

限于喜欢 提交于 2019-12-02 16:09:51
问题 I'm trying to implement a simple gradient descent for linear regression. It works normally if I compute the gradient manually (by using the analytical expression), but now i was trying to implement it with autograd from the mxnet module. This is the code from mxnet import autograd, np, npx npx.set_np() def main(): # learning algorithm parameters nr_epochs = 1000 alpha = 0.01 # read data, insert column of ones (to include bias with other parameters) data = pd.read_csv("dataset.txt", header=0,

why gradient descent when we can solve linear regression analytically

╄→гoц情女王★ 提交于 2019-12-02 14:02:47
what is the benefit of using Gradient Descent in the linear regression space? looks like the we can solve the problem (finding theta0-n that minimum the cost func) with analytical method so why we still want to use gradient descent to do the same thing? thanks When you use the normal equations for solving the cost function analytically you have to compute: Where X is your matrix of input observations and y your output vector. The problem with this operation is the time complexity of calculating the inverse of a nxn matrix which is O(n^3) and as n increases it can take a very long time to

Caffe: what will happen if two layers backprop gradients to the same bottom blob?

被刻印的时光 ゝ 提交于 2019-12-02 13:23:11
问题 I'm wondering what if I have a layer generating a bottom blob that is further consumed by two subsequent layers, both of which will generate some gradients to fill bottom.diff in the back propagation stage. Will both two gradients be added up to form the final gradient? Or, only one of them can live? In my understanding, Caffe layers need to memset the bottom.diff to all zeros before filling it with some computed gradients, right? Will the memset flush out the already computed gradients by

Where can I have a look at TensorFlow gradient descent main loop?

时光怂恿深爱的人放手 提交于 2019-12-02 10:48:19
问题 (Sorry if this sounds a bit naive) I want to have a look at the meat of the TensorFlow implementation for GradientDescent - and see for myself how are they handling termination condition, step-size adaptiveness, etc. I traced the code down for training_ops.apply_gradient_descent but I can't find the implementation :( 回答1: TensorFlow Optimizer interface, (which GradientDescentOptimizer implements) defines a a single step of minimization. Termination conditions or adjusting step size is

mxnet gradient descent for linear regression, variable types error

只谈情不闲聊 提交于 2019-12-02 08:29:39
I'm trying to implement a simple gradient descent for linear regression. It works normally if I compute the gradient manually (by using the analytical expression), but now i was trying to implement it with autograd from the mxnet module. This is the code from mxnet import autograd, np, npx npx.set_np() def main(): # learning algorithm parameters nr_epochs = 1000 alpha = 0.01 # read data, insert column of ones (to include bias with other parameters) data = pd.read_csv("dataset.txt", header=0, index_col=None, sep="\s+") data.insert(0, "x_0", 1, True) # insert column of "1"s as x_0 m = data.shape

Gradient calculation for softmax version of triplet loss

我的未来我决定 提交于 2019-12-01 08:51:49
问题 I have been trying to implement the softmax version of the triplet loss in Caffe described in Hoffer and Ailon, Deep Metric Learning Using Triplet Network, ICLR 2015. I have tried this but I am finding it hard to calculate the gradient as the L2 in exponent is not squared. Can someone please help me here? 回答1: Implementing the L2 norm using existing layers of caffe can save you all the hustle. Here's one way to compute ||x1-x2||_2 in caffe for "bottom"s x1 and x2 (assuming x1 and x2 are B -by