From "Scaling learning algorithms towards AI", Y. Le Cun and Y. Bengio, in Large-scale kernel machines, 2007.
"A common explanation for the difficulty of deep network learning is the presence of local minima or plateaus in the loss function. Gradient-based optimization methods that start from random initial conditions appear to often get trapped in poor local minima or plateaus. The problem seems particularly dire for narrow networks (with few hidden units or with a bottleneck) and for networks with many symmetries (i.e., fully-connected networks in which hidden units are exchangeable). The solution recently introduced by Hinton et al.  for training deep layered networks is based on a greedy, layer-wise unsupervised learning phase . The unsupervised learning phase provides an initial configuration of the parameters with which a gradient-based supervised learning phase is initialized. The main idea of the unsupervised phase is to pair each feed-forward layer with a feed-back layer that attempts to reconstruct the input of the layer from its output. This reconstruction criterion guarantees that most of the information contained in the input is preserved in the output of the layer. The resulting architecture is a so-called Deep Belief Networks (DBN). After the initial unsupervised training of each feed-forward/feed-back pair, the feed-forward half of the network is refined using a gradient-descent based supervised method (back-propagation). This training strategy holds great promise as a principle to break through the problem of training deep networks. ...This strategy has not yet been much exploited in machine learning, but it is at the basis of the greedy layer-wise constructive learning algorithm for DBNs. More precisely, each layer is trained in an unsupervised way so as to capture the main features of the distribution it sees as input. It produces an internal representation for its input that can be used as input for the next layer. In a DBN, each layer is trained as a Restricted Boltzmann Machine [Teh and Hinton, 2001] using the Contrastive Divergence [Hinton, 2002] approximation of the log-likelihood gradient. The outputs of each layer (i.e., hidden units) constitute a factored and distributed representation that estimates causes for the input of the layer. After the layers have been thus initialized, a final output layer is added on top of the network (e.g., predicting the class probabilities), and the whole deep network is fine-tuned by a gradient-based optimization of the prediction error. The only difference with an ordinary multi-layer neural network resides in the initialization of the parameters, which is not random, but is performed through unsupervised training of each layer in a sequential fashion."
Emphasis added by O.S.