Firstly, my blog caught attention from chainer’s developers, and was introduced in official twitter account. Thanks, PFI.
Also, there is a Reddit on the paper I implemented. Richard Socher has already invented initialization trick with the identity matrix before the paper. The first version of the Hinton’s paper did not cite the Richard’s paper, but now the latest version of the paper does. You can check it on the arxiv.org contrasting version 1 and 2. It is surprising that the researchers like God (Hinton) failed to recognize the work from Stanford. I think this means that the deep learning is making progress in the very fast speed that even the top researchers sometimes cannot recognize famous works (like the paper from Stanford).
Last but not least, my friend (@masvsn) refined my code, and achieved 94% test accuracy. Seems like Adam makes difference. Thank you, @masvsn. I can learn his implementation techniques more from the refined code.
He is a Ph.D. student in computer vision. In fact, I know him in person (but he does not identify himself online so I will keep his name anonymous) and I have been sneaking in his weekly seminar on machine learning with the emphasis on deep learning. I had slightly known deep learning and general feed forward neural networks, but it was him that “deeply” taught me the latest deep learning progress with useful and practical techniques. In particular, his implementation exercises helped me a lot to enhance my implementation skills. Perhaps I could not have implemented IRNN if I had not attended his seminar. I really appreciate him.I refined @satoshi2373's code with batchsize=128 and using Adam then achieved over 95% test accuracy at epoch 80 pic.twitter.com/Hwkt79sbmg— francis (@masvsn) 2015, 6月 19
One thing, I am wondering why IRNN has no sign to overfit so far. This is the latest plot from my own implementation. I have been running it for a week. Yeah, it is still learning very slowly.
Both his plot and mine have no symptom to overfit. I cannot judge this from the original paper because they only have test accuracy plot, not train accuracy. However, if this is the case with other problems, IRNN has great ability to generalize. If anybody knows, please let me know on comment.
I finish this post with his code. Thank you @masvsn, again.