As a toy example I\'m trying to fit a function f(x) = 1/x
from 100 no-noise data points. The matlab default implementation is phenomenally successful with mean squa
I tried training for 50000 iterations it got to 0.00012 error. It takes about 180 seconds on Tesla K40.
It seems that for this kind of problem, first order gradient descent is not a good fit (pun intended), and you need Levenberg–Marquardt or l-BFGS. I don't think anyone implemented them in TensorFlow yet.
Edit
Use tf.train.AdamOptimizer(0.1)
for this problem. It gets to 3.13729e-05
after 4000 iterations. Also, GPU with default strategy also seems like a bad idea for this problem. There are many small operations and the overhead causes GPU version to run 3x slower than CPU on my machine.