The Azimuth Project
Deep learning


Wikipedia describes the field as:

Deep learning (also called deep structural learning or hierarchical learning) is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using model architectures composed of multiple non-linear transformations.

Various deep learning architectures such as deep neural networks, convolutional deep neural networks, and deep belief networks have been applied to fields like computer vision, automatic speech recognition, natural language processing, and music/audio signal recognition where they have been shown to produce state-of-the-art results on various tasks.

Alternatively, “deep learning” has been characterized as “just a buzzword for”,[5] or “largely a rebranding of”, neural networks.[6]


A comprehensive historical survey of methods can be found in:

In the new millennium, deep NNs have finally attracted wide-spread attention, mainly by outperforming alternative machine learning methods such as kernel machines (Vapnik, 1995; Scholkopf et al., 1998) in numerous important applications.

Simple, low-complexity, problem-solving NNs

The bias/variance dilema is often addressed thorough strong prior assumptions. Weight decay“ encourages near-zero weights by penalizing larger weights.

In a Bayesian framework, weight decay can be derived from Gaussian or Laplacian weight priors (Hinton and van Camp (1993)).

Many UL methods automatically and robustly generate distributed, sparse representations of input patterns through well-known feature detectors such as off-center-on-surround-like structures, as well as orientation sensitive edge detectors and Gabor filter. They extract simple features related to those oserved in early visual pre-procesing states of biological systems (Jone and Palmer, (1987)).

1991: The fundamental deep learning problem of gradient descent

This is the problem of vanishing or exploding gradients (aka.the long time lag problem).

Hessian-free optimization (Sec. 5.6.2) can alleviate the problem for FNNs (Møller, 1993; Pearlmutter, 1994; Schraudolph, 2002; Martens, 2010) (Sec. 5.6.2) and RNNs (Martens and Sutskever, 2011)(Sec. 5.20).

Hessian-free optimisation

Schmidthuber (2014) describes how this works.

The space of NN weight matrices can also be searched without relying on error gradients, thus avoiding the Fundamental Deep Learning Problem altogether. Random weight guessing sometimes works better than more sophisticated methods (Hochreiter and Schmidhuber, 1996). Certain more complex problems are better solved by using Universal Search (Levin, 1973b) for weight matrix-computing programs written in a universal programming language (Schmidhuber, 1997). Some are better solved by using linear methods to obtain optimal weights for connections to output events (Sec. 2), and evolving weights of connections to other events—this is called Evolino (Schmidhuber et al., 2007).

Also in 2011 it was shown (Martens and Sutskever, 2011) that Hessian-free optimization…can alleviate the Fundamental Deep Learning Problem (Sec. 5.9) in recurrent neural networks (RNNs), outperforming standard gradient-based long short-term memory (LSTM) RNNs (Sec. 5.13) on several tasks.

There are also RNN algorithms (Jaeger, 2004; Schmidhuber et al., 2007; Pascanu et al., 2013b; Koutnk et al., 2014) that also at least sometimes yield better results than steepest descent for LSTM RNNs.

More recently, LSTM RNNs won several international pattern recognition competitions and set benchmark records on large and complex data sets, e.g., Sec. 5.17, 5.21, 5.22. Gradient-based LSTM is no panacea though—other methods sometimes outperformed it at least on certain tasks.

Convolutional neural networks


A key challenge in designing convolutional network models is sizing them appropriately. Many factors are involved in these decisions, including number of layers, feature maps, kernel sizes, etc. Complicating this further is the fact that each of these influence not only the numbers and dimensions of the activation units, but also the total number of parameters. In this paper we focus on assessing the in dependent contributions of three of these linked variables: The numbers of layers, feature maps, and parameters. To accomplish this, we employ a recursive convolutional network whose weights are tied between layers; this allows us to vary each of the three factors in a controlled setting. We find that while increasing the numbers of layers and parameters each have clear benefit, the number of feature maps (and hence dimensionality of the representation) appears ancillary, and finds most of its benefit through the introduction of more weights. Our results (i) empir ically confirm the notion that adding layers alone increases computational power, within the context of convolutional layers, and (ii) suggest that precise sizing ofconvolutional feature map dimensions is itself of little concern; more attention should be paid to the number of parameters in these layers instead.


  • number of layers
  • feature maps
  • kernel sizes

we can design a recursive model to have the same number of layers and parameters as the standard convolutional model, and thereby see if the number of feature maps (which differs) is important or not. Or we can match the number of feature maps and parameters to see if the number of layers (and number of non-linearities) matters.

recent gains have been found by using stacks of multiple unpooled convolution layers. For example, the convolutional model proposed by Krizhevsky et al. [13]

for ImageNet classification has five convolutional layers which turn out to be key to its performance. In [30], Zeiler and Fergus reimplemented this model and adjusted different parts in turn. One of the largests effects came from changing two convolutional layers in the middle of the model: removing them resulted in a 4.9% drop in performance, while expanding them improved performance by 3.0%. By comparison, removing the top two densely connected layers yielded a 4.3% drop, and expanding them a 1.7% gain, even though they have far more parameters. Hence the use of multiple convolutional layers is vital and the development of superior models relies on understanding their properties.

The model we employ has relations to recurrent neural networks. These are are well-studied models [11, 21, 27], naturally suited to temporal and sequential data. For example, they have recently been shown to deliver excellent performance for phoneme recognition [8]

Our network can be considered a purely discriminative, convolutional version of LISTA or DrSAE.

In the multilayer convolutional network used here all layers beyond the first have the same size and connection topology.


See also artificial intelligence.



Torch7 is a scientific computing framework with wide support for machine learning algorithms. It is easy to use and provides a very efficient implementation, thanks to an easy and fast scripting language, LuaJIT, and an underlying C implementation.

Among other things, it provides:

  • a powerful N-dimensional array
  • lots of routines for indexing, slicing, transposing, …
  • amazing interface to C, via LuaJIT
  • linear algebra routines
  • neural network, and energy-based models
  • numeric optimization routines

Torch implementation of the softmax algorithm for convolutional neural networks in Lua.

Deep learning in Haskell.


Convolutional neural networks

  • Michael Mathieu, Mikael Henaff, Yann LeCun, Fast Training of Convolutional Networks through FFTs

Hessian-free networks

‘ Hierarchical neural networks

  •* Ranzato, M. A., Huang, F., Boureau, Y., and LeCun, Y. (2007). Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Proc. Computer Vision and Pattern Recognition Conference (CVPR’07), pages 1–8. IEEE Press
  • Raiko, T., Valpola, H., and LeCun, Y. (2012). Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics, pages 924–932.
  •* Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2006). Efficient learning of sparse representations with an energy-based model. In et al., J. P., editor, Advances in Neural Information Processing Systems (NIPS 2006). MIT Press.

Monoidal neural networks


Survey Papers on Deep Learning

Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), pp.1-127, 2009.

Yoshua Bengio, Aaron Courville, Pascal Vincent, pRepresentation Learning: A Review and New Perspectives](, Arxiv, 2012.

Deep Learning Code Tutorials

The Deep Learning Tutorials are a walk-through with code for several important Deep Architectures (in progress; teaching material for Yoshua Bengio’s IFT6266 course).

Unsupervised Feature and Deep Learning

Stanford’s Unsupervised Feature and Deep Learning tutorials has wiki pages and matlab code examples for several basic concepts and algorithms used for unsupervised feature learning and deep learning.

‘ Videos

Deep Learning Representations

  • Yoshua Bengio’s Google tech talk on Deep Learning Representations at Google Montreal (Google Montreal, 11/13/2012)
  • Deep Learning with Multiplicative Interactions Geoffrey Hinton’s talk at the Redwood Center for Theoretical Neuroscience (UC Berkeley, March 2010).

Recent developments on Deep Learning

Geoffrey Hinton’s GoogleTech Talk, March 2010. Learning Deep Hierarchies of Representations * A general presentation done by Yoshua Bengio in September 2009, also at Google.

A New Generation of Neural Networks

Geoffrey Hinton’s December 2007 Google TechTalk.

Deep Belief Networks

Geoffrey Hinton’s 2007 NIPS Tutorial [updated 2009] on Deep Belief Networks 3 hour video , ppt, pdf , readings

Training deep networks efficiently

Geoffrey Hinton’s talk at Google about dropout and “Brain, Sex and Machine Learning”.

Deep Learning and NLP

  • Yoshua Bengio and Richard Socher’s talk, “Deep Learning for NLP(without magic)” at ACL 2012. Tutorial on Learning Deep Architectures
  • Yoshua Bengio and Yann LeCun’s presentation at “ICML Workshop on Learning Feature Hiearchies” on June 18th 2009.

Energy-based Learning

  • [LeCun et al 2006]. A Tutorial on Energy-Based Learning, in Bakir et al. (eds) “Predicting Structured Outputs”, MIT Press 2006: a 60-page tutorial on energy-based learning, with an emphasis on structured-output models. The tutorial includes an annotated bibliography of discriminative learning, with a simple view of CRF, maximum-margin Markov nets, and graph transformer networks.

  • A 2006 Tutorial an Energy-Based Learning given at the 2006 CIAR Summer School: Neural Computation & Adaptive Perception.[Energy-Based Learning: Slides in DjVu (5.2MB), Slides in PDF (18.2MB)][Deep Learning for Generic Object Recognition:Slides in DjVu (3.8MB), Slides in PDF (11.6MB)]

  • ECCV 2010 Tutorial

  • Feature learning for Image Classification (by Kai Yu and Andrew Ng): introducing a paradigm of feature learning from unlabeled images, with an emphasis on applications to supervised image classification.

  • NIPS 2010 Workshop

Deep Learning and Unsupervised Feature Learning Basic concepts about unsupervised feature learning and deep learning methods with links to papers and code.

Geoffrey Hinton’s Online Neural networks Course on Coursera.