Specifically, I tried new recurrent neural network (RNN) called IRNN described in recent Hinton's paper "A Simple Way to Initialize Recurrent Networks of Rectified Linear Units ." It was difficult to train RNN to learn such a long dependency, but IRNN overcame it initializing recurrent weights by identity matrix, and using ReLU as activation function. Awesome!
In this post, I will write about my experiment of IRNN to recognize MNIST digits by putting 724 pixels to the recurrent net in sequential order (a experiment in the paper at section 4.2).
The technique and best parameter value in the paper is:
- Initialize recurrent weights matrix with identity matrix
- Initialize other weights matrix sampled from Gaussian distribution with mean of 0 and standard deviation (std) of 0.001
- Activation function is ReLU
- Train the network using SGD
- learning rate: 10^-8, gradient clipping value: 1, and mini batch size is 16.
The other setting is the same as the paper.
Problem is, it takes 50 mins to run each epoch (forward and backpropagate over whole dataset once) on my local environment (CPU). Perhaps, it's better to buy GPU or use AWS GPU instance. Anyway, I am currently running it wtih CPU for two days so far! The results is shown in the following figure. Though the learning is very slow, the net definitely learns ! Cool! I will continue my experiment.
In the paper, they continue to learn up to 1,000,000 steps. At first, I thought one step is just one update of parameter by 16 examples (a batch) but, after I tried by myself, I started to think the step is not just one update, it's the updates over a whole dataset. I am not sure what the word (step or iteration in the paper) literally means, but if it is just one update, my plot above cannot be explained when it is compared with the plot in the paper.
Sometimes technical words in deep learning or machine learning is confusing to foreigners like me who have studied science not in English. I sometimes not sure whether epoch, iteration, or step means just one batch update or all updates over a whole dataset. Is it depends on situation or is there clear distinction? Does anybody knows?
Anyway, I successfully and relatively easily implemented IRNN for pixel to pixel MNIST. I think chainer made huge difference. Implementing recurrent net is easier with chainer than with other popular deep learning libraries such as Theano or Torch.
I finish this post with my implementation code. In the next post, I may explain how to set up chainer (though it's super easy) and describe the code.