Wikipedia describes the field as:
Deep learning (also called deep structural learning or hierarchical learning) is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using model architectures composed of multiple non-linear transformations.
Various deep learning architectures such as deep neural networks, convolutional deep neural networks, and deep belief networks have been applied to fields like computer vision, automatic speech recognition, natural language processing, and music/audio signal recognition where they have been shown to produce state-of-the-art results on various tasks.
Alternatively, “deep learning” has been characterized as “just a buzzword for”,[5] or “largely a rebranding of”, neural networks.[6]
A comprehensive historical survey of methods can be found in:
In the new millennium, deep NNs have finally attracted wide-spread attention, mainly by outperforming alternative machine learning methods such as kernel machines (Vapnik, 1995; Scholkopf et al., 1998) in numerous important applications.
The bias/variance dilema is often addressed thorough strong prior assumptions. Weight decay“ encourages near-zero weights by penalizing larger weights.
In a Bayesian framework, weight decay can be derived from Gaussian or Laplacian weight priors (Hinton and van Camp (1993)).
Many UL methods automatically and robustly generate distributed, sparse representations of input patterns through well-known feature detectors such as off-center-on-surround-like structures, as well as orientation sensitive edge detectors and Gabor filter. They extract simple features related to those oserved in early visual pre-procesing states of biological systems (Jone and Palmer, (1987)).
This is the problem of vanishing or exploding gradients (aka.the long time lag problem).
Hessian-free optimization (Sec. 5.6.2) can alleviate the problem for FNNs (Møller, 1993; Pearlmutter, 1994; Schraudolph, 2002; Martens, 2010) (Sec. 5.6.2) and RNNs (Martens and Sutskever, 2011)(Sec. 5.20).
Schmidthuber (2014) describes how this works.
The space of NN weight matrices can also be searched without relying on error gradients, thus avoiding the Fundamental Deep Learning Problem altogether. Random weight guessing sometimes works better than more sophisticated methods (Hochreiter and Schmidhuber, 1996). Certain more complex problems are better solved by using Universal Search (Levin, 1973b) for weight matrix-computing programs written in a universal programming language (Schmidhuber, 1997). Some are better solved by using linear methods to obtain optimal weights for connections to output events (Sec. 2), and evolving weights of connections to other events—this is called Evolino (Schmidhuber et al., 2007).
Also in 2011 it was shown (Martens and Sutskever, 2011) that Hessian-free optimization…can alleviate the Fundamental Deep Learning Problem (Sec. 5.9) in recurrent neural networks (RNNs), outperforming standard gradient-based long short-term memory (LSTM) RNNs (Sec. 5.13) on several tasks.
There are also RNN algorithms (Jaeger, 2004; Schmidhuber et al., 2007; Pascanu et al., 2013b; Koutnk et al., 2014) that also at least sometimes yield better results than steepest descent for LSTM RNNs.
More recently, LSTM RNNs won several international pattern recognition competitions and set benchmark records on large and complex data sets, e.g., Sec. 5.17, 5.21, 5.22. Gradient-based LSTM is no panacea though—other methods sometimes outperformed it at least on certain tasks.
Abstract
A key challenge in designing convolutional network models is sizing them appropriately. Many factors are involved in these decisions, including number of layers, feature maps, kernel sizes, etc. Complicating this further is the fact that each of these influence not only the numbers and dimensions of the activation units, but also the total number of parameters. In this paper we focus on assessing the in dependent contributions of three of these linked variables: The numbers of layers, feature maps, and parameters. To accomplish this, we employ a recursive convolutional network whose weights are tied between layers; this allows us to vary each of the three factors in a controlled setting. We find that while increasing the numbers of layers and parameters each have clear benefit, the number of feature maps (and hence dimensionality of the representation) appears ancillary, and finds most of its benefit through the introduction of more weights. Our results (i) empir ically confirm the notion that adding layers alone increases computational power, within the context of convolutional layers, and (ii) suggest that precise sizing ofconvolutional feature map dimensions is itself of little concern; more attention should be paid to the number of parameters in these layers instead.
Decisions
we can design a recursive model to have the same number of layers and parameters as the standard convolutional model, and thereby see if the number of feature maps (which differs) is important or not. Or we can match the number of feature maps and parameters to see if the number of layers (and number of non-linearities) matters.
…
recent gains have been found by using stacks of multiple unpooled convolution layers. For example, the convolutional model proposed by Krizhevsky et al. [13]
for ImageNet classification has five convolutional layers which turn out to be key to its performance. In [30], Zeiler and Fergus reimplemented this model and adjusted different parts in turn. One of the largests effects came from changing two convolutional layers in the middle of the model: removing them resulted in a 4.9% drop in performance, while expanding them improved performance by 3.0%. By comparison, removing the top two densely connected layers yielded a 4.3% drop, and expanding them a 1.7% gain, even though they have far more parameters. Hence the use of multiple convolutional layers is vital and the development of superior models relies on understanding their properties.
The model we employ has relations to recurrent neural networks. These are are well-studied models [11, 21, 27], naturally suited to temporal and sequential data. For example, they have recently been shown to deliver excellent performance for phoneme recognition [8]
Our network can be considered a purely discriminative, convolutional version of LISTA or DrSAE.
In the multilayer convolutional network used here all layers beyond the first have the same size and connection topology.
See also artificial intelligence.
Torch7 is a scientific computing framework with wide support for machine learning algorithms. It is easy to use and provides a very efficient implementation, thanks to an easy and fast scripting language, LuaJIT, and an underlying C implementation.
Among other things, it provides:
- a powerful N-dimensional array
- lots of routines for indexing, slicing, transposing, …
- amazing interface to C, via LuaJIT
- linear algebra routines
- neural network, and energy-based models
- numeric optimization routines
Torch implementation of the softmax algorithm for convolutional neural networks in Lua.
Deep learning in Haskell.
SchmidtHuberJ. and Martens, J. (2010). Deep learning via Hessian-free optimization. In Fernkranz, J. and Joachims, T., editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 735–742, Haifa, Israel. Omnipress.
Ryan Kiros, Training Hessian-free optimisation (2013)
Razvan Pascanu and Yoshua Bengio, Revisiting natural gradient for deep networks (2014)
Oriol Vinyals and Daniel Povey, Krylov subspace descent for deep learning (2011)
W. Ross Morrow, Hessian-free methods for checking the second order sufficient conditions in equality-constrained optimisation and equilibrium problems (2011)
Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), pp.1-127, 2009.
Yoshua Bengio, Aaron Courville, Pascal Vincent, pRepresentation Learning: A Review and New Perspectives](http://arxiv.org/abs/1206.5538), Arxiv, 2012.
The Deep Learning Tutorials are a walk-through with code for several important Deep Architectures (in progress; teaching material for Yoshua Bengio’s IFT6266 course).
Stanford’s Unsupervised Feature and Deep Learning tutorials has wiki pages and matlab code examples for several basic concepts and algorithms used for unsupervised feature learning and deep learning.
Geoffrey Hinton’s GoogleTech Talk, March 2010. Learning Deep Hierarchies of Representations * A general presentation done by Yoshua Bengio in September 2009, also at Google.
Geoffrey Hinton’s December 2007 Google TechTalk.
Geoffrey Hinton’s 2007 NIPS Tutorial [updated 2009] on Deep Belief Networks 3 hour video , ppt, pdf , readings
Geoffrey Hinton’s talk at Google about dropout and “Brain, Sex and Machine Learning”.
[LeCun et al 2006]. A Tutorial on Energy-Based Learning, in Bakir et al. (eds) “Predicting Structured Outputs”, MIT Press 2006: a 60-page tutorial on energy-based learning, with an emphasis on structured-output models. The tutorial includes an annotated bibliography of discriminative learning, with a simple view of CRF, maximum-margin Markov nets, and graph transformer networks.
A 2006 Tutorial an Energy-Based Learning given at the 2006 CIAR Summer School: Neural Computation & Adaptive Perception.[Energy-Based Learning: Slides in DjVu (5.2MB), Slides in PDF (18.2MB)][Deep Learning for Generic Object Recognition:Slides in DjVu (3.8MB), Slides in PDF (11.6MB)]
ECCV 2010 Tutorial
Feature learning for Image Classification (by Kai Yu and Andrew Ng): introducing a paradigm of feature learning from unlabeled images, with an emphasis on applications to supervised image classification.
NIPS 2010 Workshop
Deep Learning and Unsupervised Feature Learning Basic concepts about unsupervised feature learning and deep learning methods with links to papers and code.
Geoffrey Hinton’s Online Neural networks Course on Coursera.