Recently, I’ve read the learning-to-learn
, including its codes and paper. It’s aiming at designing a general optimizer, like tf.train.AdamOptimizer
, not by human, but instead by machine.
The idea herein is simple. First we construct a trainable (rather than a fixed as usual) optimizer, naturally a recurrent neural network (RNN), since optimization process is sequential. The parameters of the RNN are opened up and trainable. Next, we construct a performance (or say, loss), which, also naturally, is the total loss along the optimization sequence. Verily, a better optimizer expects a smaller total loss. Then all is done. We train the parameters in the optimizer as a RNN by gradient descent as usual. They did so on dataset like CIFAR-10 with some specific model. As a result, it is found that the trained optimizer can be roubustly generalized to other datasets and models.
Let f(theta, D)
the loss of some specific model with parameter theta
, and D
some dataset. Let m(phi)
the trainable optimizer, as a RNN, with phi
its paramters, trainable. It accepts a gradient of f
and returns the difference of theta
, telling how theta
should be updated in the optimization proess. The total loss L
, as the loss for training m
, is
n_iters = ... # The number of iterations of the optimization of `f`.
def L(phi, n_iters=n_iters):
# Initialize
theta = ...
total_loss = f(theta, D)
for i in range(n_iters):
gradiant = compute_gradient( f(theta, D), theta )
theta += m(phi)(gradient)
total_loss += f(theta, D)
return total_loss
And then we use gradient descent method to train the phi
, like
best_phi = argmin(L, method='gradient_descent')
This looks simple and dull, but challangeable in TensorFlow. Indeed, in TensorFlow, the general work-flow likes:
graph = ...
with graph.as_default():
model = ...
theta = tf.Variable(...)
data = tf.placeholder(...)
targets = tf.placeholder(...)
predictions = model(data, theta)
loss = tf.reduce_sum(tf.square(predictions - targets))
... # Optimizer, train-Op, etc.
with tf.Session(graph=graph) as sess:
... # initialization, run the train-Op iteratively, etc.
As you see, the loss
(the f
in our notations) is not a function, as the name “loss-function” promises, but a quantity (a Tensor
explicitly). However, in the training of the RNN optimizer in the previous, the theta
that feeds into the loss f
is not a trainable (i.e. Variable
), but non-trainable and is kept updating by m(phi)
. This demands that, if you use this loss-quantity in training the RNN optimizer, you have to inspect the code and try to substitude the theta
from tf.Variable(...)
to tf.placeholder(...)
. This is what learning-to-learn
have done, by employing a magic: the mock
module. However, this is forbidden by your elegent programming habit: package codes and then treat as a black box.
A preferable way of implemeting this is using function. That is, using an “operation constructor”, like the tf.multiply
. For instance,
def make_loss(theta, data, targets, model, name=None)
"""Implements the loss `f`, as an operator constructor."""
# Some pre-defined function, for checking tensor, shape, dtype, etc.
# This function substitudes the function `tf.convert_to_tensor`, which
# will convert trainable tensor to non-trainable.
check_arguments(theta, data, targets)
with tf.name_scope(name, 'loss', [theta, data]):
# Just copy-and-paste the previous codes
predictions = model(data, theta)
loss = tf.reduce_sum(tf.square(predictions - targets))
return loss
graph = ...
with graph.as_default():
model = ...
theta_var = tf.Variable(...)
data = tf.placeholder(...)
targets = tf.placeholder(...)
loss = make_loss(theta_var, data, target, model)
... # Optimizer, train-Op, etc.
with tf.Session(graph=graph) as sess:
... # initialization, run the train-Op iteratively, etc.
The loss
now becomes a reusable function. For instance, if we expect theta
a tf.placeholder
, then just
with graph.as_default():
theta_ph = tf.placeholder(...)
new_loss = make_loss(theta_ph, data, target, model)
This process create a new sub-graph in the graph
, wherein the common parts, i.e. the data
, target
, and model
, are reused.
In the end, a piece of codes declares what I mean:
import numpy as np
import tensorflow as tf
def my_multiply(x, y, name=None):
"""Returns an `Op` for `x * y + 2`."""
with tf.name_scope(name, 'my_multiply', [x, y]):
return tf.multiply(x, y) + 2.
graph = tf.Graph()
with graph.as_default():
xs = [
tf.placeholder('float32', shape=[], name='x_1'),
tf.Variable(initial_value=1.5, name='x_2'),
]
y = tf.placeholder('float32', shape=[], name='y'),
my_multiply_ops = [my_multiply(x, y) for x in xs]
init = tf.global_variables_initializer()
with tf.Session(graph=graph) as sess:
sess.run(init)
writer = tf.summary.FileWriter('logs', sess.graph)
vals = sess.run(my_multiply_ops, feed_dict={xs[0]: 1., y: 2.})
print(vals)
writer.close()
which returns [array([4.], dtype=float32), array([5.], dtype=float32)]
, as expected. You can simply copy-and-paste, and then run it. With running in terminal tensorboard --logdir=logs
, can you visualize the graph, showing up what I said forsooth.
PS: About the style, the official documentation is highly recommanded. It leaks nothing but one point, that is the usage of tf.convert_to_tensor
. It convert trainable tensor to non-trainable, which is not preferable.