Deep Gradient Compression for Distributed Training



Motivation

e.g. Uber framework horoboard requries

Enable DT with less expensive network, e.g, AWS 1gbits eitherenet

Demacrotise DL training

Distributed training basic

Gradient basic

- SGD
- Optimation problem - single node: like find direction when climbing downhill
- multiple nodes - each node have images,  finds their own direction how to merge together? they need to communicate and exchange the gradient via network
- Exchange can be bulky, e.g., alexnet 240 mB weights renet 100 MB, every iteration, every node has to exchange 100 MB of gradient to each other which make the bottle neck of the infrastructure

In sychronized training, each node need to know every other nodes' computed gradient.

Parallel computing and Load distribution
1. data parallelism
different chunk of data to different nodes, easier to impleemnt, same model on each node ( CNN or RNN), node 1 may have batch of training images 1-32, node 2 may have next 32 images, etc. All the node are sharing the same model but they are fed with differnt chuck of data and they are calculating local gradients according to their own chunk of data and then they exchange gradients to each other.

Can be implemented in two ways:
a. parameter server - centralized , it receives the gradients from all nodes then sum it up and calculate average and update local weights then broadcast to all the training nodes.

b. All-reduce operation(de-centralized) - every node receives every other's calculated gradients and then calculate the average. still have a master training node ( like a tree structure ) this is one of the basic implementation. more advanced like butterfly structure


2. Model parallelism
differnt chuck of model to differnt nodes

Deep gradient compression 
The basic idea is to reduce the  the gradient we need to send out
surprisingly, we found, only 0.1% of gradient that really need to be sent out over the network. the major potion can be hold locally

Some of the gradients are very small, not zero but small. so, sort the gradients, only send out the top 0.1% largest gradients. but just send 0,1% will hurt the prediction accuracy. 
If we don't send the small gradients, it will hurt the accuracy => locally accumulate the gradients for more iteration until it gets large enough then sent them out, In this way, accuracy can be recovered.
Almost equivalent to increase batch size in N iteration.

Still suffer 1.6% lost of accuracy ( Cifar 10, 92.9-91.3) for image classification and -3.3% accuracy for language model ( speech recognition )

batch size ( 8k or more ) causes genertion issue, unknown cause

Futher
Momentum - using part of prev gradients and current gradient to do weighted average. which give a new vector called velocity. We should do local accumulation of velocity rather then local accumulation of gradients.
images : from -1.6% => -0.3%, but didn't converge for speech recognition

Improvement III

Gradient clipping to prevent gradient explosion. 
from sparse, sum, clipping TO sparse, local clipping then sum it up 
It helps speech ( LSTM) can be converge and improve to -2.0%

Improvement IV
Long tail accumulation ( 2k iteration), it is necessary to cut or mask the gradients.
0.3->0.1 for CV and 2->0.5 for speech, the mechanism is periodoliyg cut the stale velocity

Warm up traing
expotentail increasing the sparse until it reach 99.9%
+0.37% for CV and +0.4 speech