1 Introduction
Recently deep learning has emerged as a useful technique for data classification as well as finding feature representations. We consider the scenario of multiclass classification. A deep neural network maps each feature vector to one of the class labels by the connection of nodes in a multilayer structure. Between two adjacent layers a weight matrix maps the inputs (values in the previous layer) to the outputs (values in the current layer). Assume the training set includes , , where is the feature vector and is the label vector. If is associated with label , then
where is the number of classes and are possible labels. After collecting all weights and biases as the model vector
and having a loss function
, a neuralnetwork problem can be written as the following optimization problem.(1) 
where
(2) 
The regularization term avoids overfitting the training data, while the parameter balances the regularization term and the loss term. The function is nonconvex because of the connection between weights in different layers. This nonconvexity and the large number of weights have caused tremendous difficulties in training largescale deep neural networks. To apply an optimization algorithm for solving (2), the calculation of function, gradient, and Hessian can be expensive. Currently, stochastic gradient (SG) methods are the most commonly used way to train deep neural networks (e.g., Bottou, 1991; LeCun et al., 1998b; Bottou, 2010; Zinkevich et al., 2010; Dean et al., 2012; Moritz et al., 2015). In particular, some expensive operations can be efficiently conducted in GPU environments (e.g., Ciresan et al., 2010; Krizhevsky et al., 2012; Hinton et al., 2012). Besides stochastic gradient methods, some works such as Martens (2010); Kiros (2013); He et al. (2016) have considered a Newton method of using Hessian information. Other optimization methods such as ADMM have also been considered (Taylor et al., 2016).
When the model or the data set is large, distributed training is needed. Following the design of the objective function in (2), we note it is easy to achieve data parallelism: if data instances are stored in different computing nodes, then each machine can calculate the local sum of training losses independently.^{1}^{1}1Training deep neural networks with data parallelism has been considered in SG, Newton and other optimization methods. For example, He et al. (2015) implement a parallel Newton method by letting each node store a subset of instances. However, achieving model parallelism is more difficult because of the complicated structure of deep neural networks. In this work, by considering that the model is distributedly stored we propose a novel distributed Newton method for deep learning. By variable and featurewise data partitions, and some careful designs, we are able to explicitly use the Jacobian matrix for matrixvector products in the Newton method. Some techniques are incorporated to reduce the running time as well as the memory consumption. First, to reduce the communication cost, we propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, we consider subsampled GaussNewton matrices for reducing the running time as well as the communication cost. Third, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks.
To be focused, among the various types of neural networks, we consider the standard feedforward networks in this work. We do not consider other types such as the convolution networks that are popular in computer vision.
This work is organized as follows. Section 2 introduces existing Hessianfree Newton methods for deep learning. In Section 3, we propose a distributed Newton method for training neural networks. We then develop novel techniques in Section 4 to reduce running time and memory consumption. In Section 5 we analyze the cost of the proposed algorithm. Additional implementation techniques are given in Section 6. Then Section 7 reviews some existing optimization methods, while experiments in Section 8 demonstrate the effectiveness of the proposed method. Programs used for experiments in this paper are available at
http://www.csie.ntu.edu.tw/~cjlin/papers/dnn.
Supplementary materials including a list of symbols and additional experiments can be found at the same web address.
2 Hessianfree Newton Method for Deep Learning
In this section, we begin with introducing feedforward neural networks and then review existing Hessianfree Newton methods to solve the optimization problem.
2.1 Feedforward Networks
A multilayer neural network maps each feature vector to a class vector via the connection of nodes. There is a weight vector between two adjacent layers to map the input vector (the previous layer) to the output vector (the current layer). The network in Figure 1 is an example.
Let denote the number of nodes at the th layer. We use to represent the structure of the network.^{4}^{4}4Note that is the number of features and is the number of classes. The weight matrix
and the bias vector
at the th layer areLet
be the feature vector for the th instance, and and denote vectors of the th instance at the th layer, respectively. We can use
(3) 
to derive the value of the next layer, where
is the activation function.
If ’s columns are concatenated to the following vector
then we can define
as the weight vector of a whole deep neural network. The total number of parameters is
Because is the output vector of the th data, by a loss function to compare it with the label vector , a neural network solves the following regularized optimization problem
where
(4) 
is a regularization parameter, and is a convex function of . Note that we rewrite the loss function in (2) as because is decided by and . In this work, we consider the following loss function
(5) 
The gradient of is
(6) 
where
(7) 
is the Jacobian of , which is a function of . The Hessian matrix of is
(8) 
where
is the identity matrix and
(9) 
From now on for simplicity we let
2.2 Hessianfree Newton Method
For the standard Newton methods, at the th iteration, we find a direction minimizing the following secondorder approximation of the function value:
(10) 
where is the Hessian matrix of . To solve (10), first we calculate the gradient vector by a backward process based on (3) through the following equations:
(11)  
(12)  
(13)  
(14) 
Note that formally the summation in (13) should be
but it is simplified because is associated with only .
If is positive definite, then (10) is equivalent to solving the following linear system:
(15) 
Unfortunately, for the optimization problem (10), it is well known that the objective function may be nonconvex and therefore is not guaranteed to be positive definite. Following Schraudolph (2002), we can use the GaussNewton matrix as an approximation of the Hessian. That is, we remove the last term in (2.1) and obtain the following positivedefinite matrix.
(16) 
Note that from (9), each , is positive semidefinite if we require that is a convex function of . Therefore, instead of using (15), we solve the following linear system to find a for deep neural networks.
(17) 
where is the GaussNewton matrix at the th iteration and we add a term because of considering the LevenbergMarquardt method (see details in Section 4.5).
For deep neural networks, because the total number of weights may be very large, it is hard to store the GaussNewton matrix. Therefore, Hessianfree algorithms have been applied to solve (17). Examples include Martens (2010); Ngiam et al. (2011). Specifically, conjugate gradient (CG) methods are often used so that a sequence of GaussNewton matrix vector products are conducted. Martens (2010); Wang et al. (2015) use operator (Pearlmutter, 1994) to implement the product without storing the GaussNewton matrix.
Because the use of operators for the Newton method is not the focus of this work, we leave some detailed discussion in Sections II–III in supplementary materials.
3 Distributed Training by Variable Partition
The main computational bottleneck in a Hessianfree Newton method is the sequence of matrixvector products in the CG procedure. To reduce the running time, parallel matrixvector multiplications should be conducted. However, the operator discussed in Section 2 and Section II in supplementary materials is inherently sequential. In a forward process results in the current layer must be finished before the next. In this section, we propose an effective distributed algorithm for training deep neural networks.
3.1 Variable Partition
Instead of using the operator to calculate the matrixvector product, we consider the whole Jacobian matrix and directly use the GaussNewton matrix in (16) for the matrixvector products in the CG procedure. This setting is possible because of the following reasons.

A distributed environment is used.

With some techniques we do not need to explicitly store every element of the Jacobian matrix.
Details will be described in the rest of this paper. To begin we split each to partitions
Because the number of columns in is the same as the number of variables in the optimization problem, essentially we partition the variables to
subsets. Specifically, we split neurons in each layer to several groups. Then weights connecting one group of the current layer to one group of the next layer form a subset of our variable partition. For example, assume we have a
 neural network in Figure 2. By splitting the three layers to , , groups, we have a total number of partitions . The partition in Figure 2 is responsible for a submatrix of . In addition, we distribute the variable to partitions corresponding to the first neuron subgroup of the th layer. For example, the variables of is split to in the partition and in the partition .By the variable partition, we achieve model parallelism. Further, because from (2.1), our data points are split in a featurewise way to nodes corresponding to partitions between layers 0 and 1. Therefore, we have data parallelism.
With the variable partition, the second term in the GaussNewton matrix (16) for the th instance can be represented as
In the CG procedure to solve (17), the product between the GaussNewton matrix and a vector is
However, after the variable partition, each may still be a huge matrix. The total space for storing is roughly
If , the number of data instances, is so large such that
than storing requires more space than the GaussNewton matrix. To reduce the memory consumption, we will propose effective techniques in Sections 3.3, 4.3, and 6.1.
3.2 Distributed Function Evaluation
From (3) we know how to evaluate the function value in a single machine, but the implementation in a distributed environment is not trivial. Here we check the details from the perspective of an individual partition. Consider a partition that involves neurons in sets and from layers and , respectively. Thus
Because (3) is a forward process, we assume that
are available at the current partition. The goal is to generate
and pass them to partitions between layers and . To begin, we calculate
(19) 
Then, from (3), the following local values can be calculated for
(20) 
After the local sum in (20) is obtained, we must sum up values in partitions between layers and .
(21) 
where , , and
The resulting values should be broadcasted to partitions between layers and that correspond to the neuron subset . We explain details of (21) and the broadcast operation in Section 3.2.1.
3.2.1 Allreduce and Broadcast Operations
The goal of (21) is to generate and broadcast values to some partitions between layers and , so a reduce operation seems to be sufficient. However, we will explain in Section 3.3 that for the Jacobian evaluation and then the product between GaussNewton matrix and a vector, the partitions between layers and corresponding to also need for calculating
(22) 
To this end, we consider an allreduce operation so that not only are values reduced from some partitions between layers and , but also the result is broadcasted to them. After this is done, we make the same result available in partitions between layers and by choosing the partition corresponding to the first neuron subgroup of layer to conduct a broadcast operation. Note that for partitions between layers and (i.e., the last layer), a broadcast operation is not needed.
Consider the example in Figure 2. For partitions , , and , all of them must get calculated via (21):
(23) 
The three local sums are available at partitions , and respectively. We first conduct an allreduce operation so that are available at partitions , , and . Then we choose to broadcast values to , , and .
Depending on the system configurations, suitable ways can be considered for implementing the allreduce and the broadcast operations (Thakur et al., 2005). In Section IV of supplementary materials we give details of our implementation.
To derive the loss value, we need one final reduce operation. For the example in Figure 2, in the end we have respectively available in partitions
We then need the following reduce operation
(24) 
and let have the loss term in the objective value.
We have discussed the calculation of the loss term in the objective value, but we also need to obtain the regularization term . One possible setting is that before the lossterm calculation we run a reduce operation to sum up all local regularization terms. For example, in one partition corresponding to neuron subgroups at layer and at layer , the local value is
(25) 
On the other hand, we can embed the calculation into the forward process for obtaining the loss term. The idea is that we append the local regularization term in (25) to the vector in (20) for an allreduce operation in (21). The cost is negligible because we only increase the length of each vector by one. After the allreduce operation, we broadcast the resulting vector to partitions between layers and that corresponding to the neuron subgroup . We cannot let each partition collect the broadcasted value for subsequent allreduce operations because regularization terms in previous layers would be calculated several times. To this end, we allow only the partition corresponding to in layer and the first neuron subgroup in layer to collect the value and include it with the local regularization term for the subsequent allreduce operation. By continuing the forward process, in the end we get the whole regularization term.
We use Figure 2 to give an illustration. The allreduce operation in (23) now also calculates
(26) 
The resulting value is broadcasted to
Then only collects the value and generate the following local sum:
In the end we have

contains regularization terms from

contains regularization terms from

contains regularization terms from
We can then extend the reduce operation in (24) to generate the final value of the regularization term.
3.3 Distributed Jacobian Calculation
From (7) and similar to the way of calculating the gradient in (11)(14), the Jacobian matrix satisfies the following properties.
(28)  
(29) 
where , and . However, these formulations do not reveal how they are calculated in a distributed setting. Similar to Section 3.2, we check details from the perspective of any variable partition. Assume the current partition involves neurons in sets and from layers and , respectively. Then we aim to obtain the following Jacobian components.
Before showing how to calculate them, we first get from (3) that
(30)  
(31)  
(32) 
From (28)(32), the elements for the local Jacobian matrix can be derived by
(33)  
and  
(34) 
where , , , and .
We discuss how to have values in the righthand side of (33) and (34) available at the current computing node. From (19), we have
available in the forward process of calculating the function value. Further, in (21)(22) to obtain for layers and , we use an allreduce operation rather than a reduce operation so that
are available at the current partition between layers and . Therefore, in (33)(34) can be obtained. The remaining issue is to generate . We will show that they can be obtained by a backward process. Because the discussion assumes that currently we are at a partition between layers and , we show details of generating and dispatching them to partitions between and . From (3) and (30), can be calculated by
(35) 
Therefore, we consider a backward process of using to generate . In a distributed system, from (32) and (35),
Comments
There are no comments yet.