Tensorflow gradient with respect to matrix

问题

Just for context, I'm trying to implement a gradient descent algorithm with Tensorflow.

I have a matrix X

[ x1 x2 x3 x4 ]
[ x5 x6 x7 x8 ]

which I multiply by some feature vector Y to get Z

      [ y1 ]
Z = X [ y2 ]  = [ z1 ]
      [ y3 ]    [ z2 ]
      [ y4 ]

I then put Z through a softmax function, and take the log. I'll refer to the output matrix as W.

All this is implemented as follows (little bit of boilerplate added so it's runnable)

sess = tf.Session()
num_features = 4
num_actions = 2

policy_matrix = tf.get_variable("params", (num_actions, num_features))
state_ph = tf.placeholder("float", (num_features, 1))
action_linear = tf.matmul(params, state_ph)
action_probs = tf.nn.softmax(action_linear, axis=0)
action_problogs = tf.log(action_probs)

W (corresponding to action_problogs) looks like

[ w1 ]
[ w2 ]

I'd like to find the gradient of w1 with respect to the matrix X- that is, I'd like to calculate

          [ d/dx1 w1 ]
d/dX w1 =      .
               .
          [ d/dx8 w1 ]

(preferably still looking like a matrix so I can add it to X, but I'm really not concerned about that)

I was hoping that tf.gradients would do the trick. I calculated the "gradient" like so

problog_gradient = tf.gradients(action_problogs, policy_matrix)

However, when I inspect problog_gradient, here's what I get

[<tf.Tensor 'foo_4/gradients/foo_4/MatMul_grad/MatMul:0' shape=(2, 4) dtype=float32>]

Note that this has exactly the same shape as X, but that it really shouldn't. I was hoping to get a list of two gradients, each with respect to 8 elements. I suspect that I'm instead getting two gradients, but each with respect to four elements.

I'm very new to tensorflow, so I'd appreciate and explanation of what's going on and how I might achieve the behavior I desire.

回答1:

The gradient expects a scalar function, so by default, it sums up the entries. That is the default behavior simply because all of the gradient descent algorithms need that type of functionality, and stochastic gradient descent (or variations thereof) are the preferred methods inside Tensorflow. You won't find any of the more advanced algorithms (like BFGS or something) because they simply haven't been implemented yet (and they would require a true Jacobian, which also hasn't been implemented). For what its worth, here is a functioning Jacobian implementation that I wrote:

def map(f, x, dtype=None, parallel_iterations=10):
    '''
    Apply f to each of the elements in x using the specified number of parallel iterations.

    Important points:
    1. By "elements in x", we mean that we will be applying f to x[0],...x[tf.shape(x)[0]-1].
    2. The output size of f(x[i]) can be arbitrary. However, if the dtype of that output
       is different than the dtype of x, then you need to specify that as an additional argument.
    '''
    if dtype is None:
        dtype = x.dtype

    n = tf.shape(x)[0]
    loop_vars = [
        tf.constant(0, n.dtype),
        tf.TensorArray(dtype, size=n),
    ]
    _, fx = tf.while_loop(
        lambda j, _: j < n,
        lambda j, result: (j + 1, result.write(j, f(x[j]))),
        loop_vars,
        parallel_iterations=parallel_iterations
    )
    return fx.stack()

def jacobian(fx, x, parallel_iterations=10):
    '''
    Given a tensor fx, which is a function of x, vectorize fx (via tf.reshape(fx, [-1])),
    and then compute the jacobian of each entry of fx with respect to x.
    Specifically, if x has shape (m,n,...,p), and fx has L entries (tf.size(fx)=L), then
    the output will be (L,m,n,...,p), where output[i] will be (m,n,...,p), with each entry denoting the
    gradient of output[i] wrt the corresponding element of x.
    '''
    return map(lambda fxi: tf.gradients(fxi, x)[0],
               tf.reshape(fx, [-1]),
               dtype=x.dtype,
               parallel_iterations=parallel_iterations)

While this implementation works, it does not work when you try to nest it. For instance, if you try to compute the Hessian by using jacobian( jacobian( ... )), then you get some strange errors. This is being tracked as Issue 675. I am still awaiting a response on why this throws an error. I believe that there is a deep-seated bug in either the while loop implementation or the gradient implementation, but I really have no idea.

Anyway, if you just need a jacobian, try the code above.