WOW 时刻: Measuring the Effects of Data Parallelism on Neural Network Training

Recent hardware developments have made unprecedented amounts of data parallelism available for accelerating neural network training.

Among the simplest ways to harness next-generation accelerators is to increase the batch size in standard mini-batch neural network training algorithms.

In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured in the number of steps necessary to reach a goal out-of-sample error. Eventually, increasing the batch size will no longer reduce the number of training steps required, but the exact relationship between the batch size and how many training steps are necessary is of critical importance to practitioners, researchers, and hardware designers alike.

We study how this relationship varies with the training algorithm, model, and dataset and find extremely large variation between workloads. Along the way, we reconcile disagreements in the literature on whether batch size affects model quality.

Finally, we discuss the implications of our results for efforts to train neural networks much faster in the future.

1 Introduction

Neural networks have proven highly effective at solving a wide variety of prediction tasks, including image classification, machine translation, and speech recognition.

Larger models trained on larger datasets are partly responsible for these recent successes and, in general, we expect that models trained on more data will continue to yield improvements in predictive performance hestness2017deep.

Although modern GPUs and custom neural network accelerators let us train state of the art models faster than ever before, training time still limits both the predictive performance of these techniques and how widely they can be applied. For many important problems, the best models are still improving at the end of training because researchers cannot afford to train for more than a few days or weeks at a time. In extreme cases, training must end before completing a single pass over the data [e.g.][]anil2018large. One way to reduce training time is to increase the rate at which data is processed during training. This can facilitate dramatic improvements in model quality, not only by allowing more data to be processed, but also by decreasing the experiment iteration time and allowing researchers to try new ideas and configurations more rapidly. Faster training also allows neural networks to be deployed in applications where models have to be updated frequently, for instance when new models have to be produced when training data get added or removed regularly.

Data parallelism offers a straightforward, popular means of accelerating neural network training. For our purposes, data parallelism refers to distributing training examples across multiple processors to compute gradient updates (or higher-order derivative information) and then aggregating these locally computed updates. As long as the training objective decomposes into a sum over training examples, data parallelism is model agnostic and applicable to any neural network architecture. In contrast, the maximum degree of model parallelism (distributing parameters and computation across different processors for the same training examples) depends on the model size and structure. Although data parallelism can be simple to implement, ultimately, large scale systems should consider all types of parallelism at their disposal. In this work, we focus on the costs and benefits of data parallelism in the synchronous training setting.

Hardware for training neural networks is trending towards ever-increasing capacity for data parallelism. Specialized systems using GPUs or custom ASICs [e.g.][]jouppi2017datacenter combined with high-performance interconnect technology are unlocking unprecedented scales of data parallelism where the costs and benefits have not yet been well studied. On the one hand, if data parallelism can provide a significant speedup at the limits of today’s systems, we should build much bigger systems. On the other hand, if additional data parallelism comes with minimal benefits or significant costs, we might consider designing systems to maximize serial execution speed, exploit other types of parallelism, or even prioritize separate design goals such as power use or cost.

There is considerable debate in the literature about the costs and benefits of data parallelism in neural network training and several papers take seemingly contradictory positions. Some authors contend that large-scale data parallelism is harmful in a variety of ways, while others contend that it is beneficial. The range of conjectures, suggestive empirical results, and folk knowledge seems to cover most of the available hypothesis space. Answering these questions definitively has only recently become important (as increasing amounts of data parallelism have become practical) so it is perhaps unsurprising that the literature remains equivocal, especially in the absence of sufficiently comprehensive experimental data.

In this work, we attempt to provide the most rigorous and extensive experimental study on the effects of data parallelism on neural network training to date. In order to achieve this goal, we consider realistic workloads up to the current limits of data parallelism. We try to avoid making assumptions about how the optimal metaparameters vary as a function of batch size. Finally, in order to guide future work, we consider any remaining limitations in our methodology, and we discuss what we see as the most interesting unanswered questions that arise from our experiments.

1.1 Scope

We restrict our attention to variants of mini-batch stochastic gradient descent (SGD), which are the dominant algorithms for training neural networks. These algorithms iteratively update the model’s parameters in the direction opposite an estimate of the gradient of the training objective. The gradient is estimated at each step using a different subset, or batch, of training examples. See Section 2.2 for a more detailed description of these algorithms. A data-parallel implementation computes gradients for different training examples in each batch in parallel, and so, in the context of mini-batch SGD and its variants, we equate the batch size with the amount of data parallelism.¹¹1Mini-batch SGD can be implemented in a variety of ways, including data-serially, but a data-parallel implementation is always possible in principle. We restrict our attention to synchronous SGD because of its popularity and advantages over asynchronous SGD chen2016revisiting.

Practitioners are primarily concerned with out-of-sample error (see Section 2.1) and the cost they pay to achieve that error. Cost can be measured in a variety of ways, including training time and hardware costs. Training time can be decomposed into number of steps multiplied by average time per step, and hardware cost into number of steps multiplied by average hardware cost per step. The average time and hardware costs depend on the practitioner’s hardware model, but the number of training steps is hardware-agnostic and can be used to compute the total costs for any hardware model given its average per-step costs. Furthermore, for an idealized data-parallel system, the wall time is conveniently proportional to the number of steps. Therefore, we focus on number of training steps as our main unit of measurement for training cost.

An alternative hardware-agnostic measure of training cost is the number of training examples processed, or equivalently the number of passes (epochs) over the training dataset. This measure is common in the literature, and describes the case where the per-step costs are proportional to the number of examples processed. However, the time cost, at least, typically grows sub-linearly relative to the batch size in a data-parallel implementation. In fact, in systems such as TPU pods²²2A TPU pod is an accelerator designed for machine learning workloads. See https://www.blog.google/products/google-cloud/google-cloud-offer-tpus-machine-learning/. there can be a range of batch sizes for which the time per step is almost constant. Under a realistic data-parallel hardware model, a neural network that trains in fewer steps with a larger batch size incurs a lower time cost, even if it processes more epochs of training data. Indeed, we will point out cases where enforcing a budget on training epochs in prior work may have painted an incomplete picture of the costs and benefits of data parallelism.

In light of practitioners’ primary concerns of out-of-sample error and the resources needed to achieve it, we believe the following questions are the most important to study in order to understand the costs and benefits of data parallelism with mini-batch SGD and its variants:

What is the relationship between batch size and number of training steps to reach a goal out-of-sample error?
What governs this relationship?
Do large batch sizes incur a cost in out-of-sample error?

1.2 Contributions of this work

We show that the relationship between batch size and number of training steps to reach a goal out-of-sample error has the same characteristic form across six different families of neural network, three training algorithms, and seven datasets.

Specifically, for each workload (model, training algorithm, and dataset), increasing the batch size initially decreases the required number of training steps proportionally, but eventually there are diminishing returns until finally increasing the batch size no longer changes the required number of training steps. To the best of our knowledge, we are the first to experimentally validate this relationship across models, training algorithms, and datasets while independently tuning the learning rate, momentum, and learning rate schedule (where applicable) for each batch size. Unlike prior work that made strong assumptions about these metaparameters, our results reveal a universal relationship that holds across all workloads we considered, across different error goals, and when considering either training error or out-of-sample error.
We show that the maximum useful batch size varies significantly between workloads and depends on properties of the model, training algorithm, and dataset. Specifically, we show that:
- SGD with momentum (as well as Nesterov momentum) can make use of much larger batch sizes than plain SGD, suggesting future work to study the batch size scaling properties of other algorithms.
- Some models allow training to scale to much larger batch sizes than others. We include experimental data on the relationship between various model properties and the maximum useful batch size, demonstrating that the relationship is not as simple as one might hope from previous work (e.g. wider models do not always scale better to larger batch sizes).
- The effect of the dataset on the maximum useful batch size tends to be smaller than the effects of the model and training algorithm, but this effect does not depend on dataset size in a consistent way.
The optimal values of training metaparameters, such as the learning rate, do not consistently follow any simple relationships with the batch size, despite the popularity of various heuristics to adjust them.

Previously suggested learning rate heuristics do not hold across problems or across all batch sizes. Assuming a simple heuristic – such as linearly scaling the learning rate with the batch size – may result in worse solutions or divergent training for batch sizes sufficiently far from the base batch size.
Finally, by reviewing the specifics of the experimental protocols used in prior work, we at least partially reconcile conflicting stances in the literature on whether increasing the batch size degrades model quality. Specifically, we show that assumptions about computational budgets and the procedures for selecting metaparameters at different batch sizes can explain many of the disagreements in the literature.

We find no evidence that increasing the batch size necessarily degrades model quality, but additional regularization techniques may become important at larger batch sizes.

2 Setup and background

2.1 Learning

Throughout this paper, a data distribution is a probability distribution

D

over a data domain

Z

. For example, we might consider a supervised learning task over a domain

Z = X \times Y

, where

X

is the set of 32-by-32-pixel color images and

Y

the possible labels denoting what appears in the image. A training set

z_{1}, \dots, z_{n} \in Z

is a collection of examples from the data domain, conventionally assumed to be drawn i.i.d. from the data distribution

D

A machine learning model is a function that, given parameters

θ

from some set

Θ \subset R^{d}

, and given a data point

z \in Z

, produces a prediction whose quality is measured by a differentiable non-negative scalar-valued loss function.³³3Technically, the loss need only be sub-differentiable, and extending our setup to this end is straightforward. We denote by

ℓ (θ; z)

the loss of a prediction made by the model, under parameters

θ

, on the data point

z

. We denote by

L

the out-of-sample loss or expected loss:

L (θ)

= E z \sim D [ℓ (θ; z)],

(1)

and by

^L

the empirical average loss under a dataset

S = {z_{1}, \dots, z_{n}}

^L (θ; S)

= \frac{1}{n} n \sum i = 1 ℓ (θ; z_{i}) .

(2)

When

S

is the training set, we call

^L

the average training loss. We will say that the data source

D

, loss

ℓ

, and model with parameter set

Θ

together specify a learning task, in which our aim is to find parameters

θ

that achieve low out-of-sample loss (1), while given access only to

n

training examples. A common approach is to find parameters of low average training loss (2) as an estimate of the out-of-sample loss shalev2014understanding.

When minimizing average training loss

^L

, it is common to add regularization penalties to the objective function. For a differentiable penalty

R : Θ \to R_{+}

and regularization weight

λ > 0

, the training objective might be:

J (θ)

=^L (θ; S) + λ R (θ),

(3)

where

S

is the training set.

In practice, we often approach a task by replacing its loss with another that is more amenable to training. For instance, in supervised classification, we might be tasked with learning under the 0/1 loss, which is an indicator of whether a prediction is correct (e.g. matches a ground-truth label), but we train by considering instead a surrogate loss (e.g. the logistic loss) that is more amenable to continuous optimization. When the surrogate loss bounds the original, achieving low loss under the surrogate implies low loss under the original. To distinguish the two, we say error to describe the original loss (e.g. 0/1), and we save loss to refer to the surrogate used in training.

2.2 Algorithms

The dominant algorithms for training neural networks are based on mini-batch stochastic gradient descent robbins1951stochastic, kiefer1952stochastic, rumelhart1986learning, bottou2008tradeoffs, lecun2015deep. Given an initial point

θ_{0} \in Θ

, mini-batch SGD attempts to decrease the objective

J

via the sequence of iterates:⁴⁴4In experiments, we pick any of the iterates

θ_{t}

for which we estimate that

L (θ_{t})

is low according to a validation dataset.

θ_{t}

\leftarrow θ_{t - 1} - η_{t} g (θ_{t - 1}; B_{t}),

where each

B_{t}

is a random subset of training examples, the sequence

{η_{t}}

of positive scalars is called the learning rate, and where, for any

θ \in Θ

and

B \subset S

g (θ; B)

= \frac{1}{| B |} \sum z \in B \nabla ℓ (θ; z) + λ \nabla R (θ) .

(4)

When the examples

B

are a uniformly random subset of training examples,

g (θ; B)

forms an unbiased estimate of the gradient of the objective

J

that we call a stochastic gradient. In our larger-scale experiments, when we sample subsequent batches

B_{t}

, we actually follow the common practice of cycling through permutations of the training set shamir2016without.

Variants of SGD commonly used with neural networks include SGD with momentum polyak1964some, rumelhart1986learning, SutskeverEtAl_icml2013, Nesterov momentum nesterov1983method, SutskeverEtAl_icml2013, RMSProp hinton2012rmsprop, and Adam kingma2014adam. All of these optimization procedures, or optimizers, interact with the training examples only by repeatedly estimating stochastic gradients (4), so they support the same notion of batch size that we equate with the scale of data parallelism.

In this work, we focus on SGD, SGD with momentum, and Nesterov momentum as optimizers. The latter two optimizers are configured by a learning rate

{η_{t}}

and a scalar

γ \in (0, 1)

that we call momentum. They define the iterates:⁵⁵5These iteration rules take slightly different forms across the literature and across library implementations. Here we present and use the update rules used by the MomentumOptimizer class in TensorFlow abadi2016tensorflow.

SGD with momentum		Nesterov momentum
$v_{t + 1}$	$\leftarrow γ v_{t} + g (θ_{t}; B_{t})$	$v_{t + 1}$	$\leftarrow γ v_{t} + g (θ_{t}; B_{t})$
$θ_{t + 1}$	$\leftarrow θ_{t} - η_{t} v_{t + 1}$	$θ_{t + 1}$	$\leftarrow θ_{t} - η_{t} g (θ_{t}; B_{t}) - η_{t} γ v_{t + 1},$

given

v_{0} = 0

and an initial

θ_{0}

. Note that plain SGD can be recovered from either optimizer by taking

γ = 0

. The outcome of using these optimizers should therefore be no worse if, in any experiment, the momentum

γ

is tuned across values including zero.

Suppose we run SGD with momentum under a constant learning rate, i.e. take

η_{t} = η

for all

t

. At a given iteration

t

, the algorithm computes:

θ_{t + 1}

= θ_{t} - η v_{t + 1} = θ_{0} - η t \sum u = 0 v_{u + 1} = θ_{0} - η t \sum u = 0 u \sum s = 0 γ^{u - s} g (θ_{s}; B_{s}) .

For any fixed

τ \in {0, \dots, t}

, the coefficient accompanying the stochastic gradient

g (θ_{τ}; B_{τ})

in the above update is

η \sum_{u = τ}^{t} γ^{u - τ}

. We define the effective learning rate,

η^{eff}

as the value of this coefficient at the end of training (

t = T

), in the limit of a large number of training steps (

T \to \infty

, while

τ

is held fixed):

η^{eff}

= lim T \to \infty T \sum u = τ η γ^{u - τ} = \frac{η}{1 - γ} .

Put intuitively,

η^{eff}

captures the contribution of a given mini-batch gradient to the parameter values at the end of training.

2.3 Additional terminology in experiments

When we refer to a data-parallel implementation of SGD, we mean one that computes the summands of (4) in parallel, and then synchronizes to coordinate their summation.

The models and algorithms in our experiments are modifiable by what we call metaparameters.⁶⁶6Sometimes called “hyperparameters” elsewhere, we prefer a different name so as not to clash with the notion of a hyperparameter in Bayesian statistics. These include architectural choices, such as the number of layers in a neural network, and training parameters, such as learning rates

{η_{t}}

and regularization weights

λ

. When we use the term model, we typically assume that all architectural metaparameters have been set. In our experiments, we tunethe training metaparameters by selecting the values that yield the best performance on a validation set. We use the term workload to jointly refer to a dataset, model, and training algorithm.

3 Related work

3.1 Steps to reach a desired out-of-sample error

Convergence upper bounds from the theory of stochastic (convex) optimization can be specialized to involve terms dependent on batch size, so in this sense they comprise basic related work. These upper bounds arise from worst-case analysis, and moreover make convexity and regularity assumptions that are technically violated in neural network training, so whether they predict the actual observed behavior of our experimental workloads is an empirical question in its own right.

Given a sequence of examples drawn i.i.d. from a data source, an upper bound on the performance of SGD applied to

L

-Lipschitz convex losses is hazan2016oco, shalev2014understanding:

J (θ_{T}) - J^{⋆}

\leq O (\sqrt{\frac{L^{2}}{T}}),

(5)

for any batch size. Here,

J

is our objective function,

J^{⋆}

is its value at the global optimum, and

θ_{T}

denotes the final output of the algorithm supposing it took

T

iterations.⁷⁷7Not necessarily the

T

^th iterate, which may differ from

θ_{T}

if the algorithm averages its iterates. Meanwhile, when losses are convex and the objective is

H

-smooth, accelerated parallel mini-batch SGD enjoys the bound lan2012optimal:

J (θ_{T}) - J^{⋆}

\leq O (\frac{H}{T^{2}} + \sqrt{\frac{L^{2}}{T b}}),

(6)

where

b

is the batch size.

Compared to sequential processing without batching (i.e. a batch size of one), the bounds (5) and (6) offer two extremes, respectively:

No benefit: Increasing the batch size $b$ does not change the number of steps to convergence, as per (5).
A $b$ -fold benefit: The term in (6) proportional to $1 / \sqrt{T b}$ dominates the bound. Increasing the batch size $b$ by a multiplicative factor decreases the number of steps $T$ to a given suboptimality by the same factor.

In other words, under these simplifications, batching cannot hurt the asymptotic guarantees of steps to convergence, but it could be wasteful of examples. The two extremes imply radically different guidance for practitioners, so the critical task of establishing a relationship between batch size and number of training steps remains one to resolve experimentally, even having consulted known theoretical results.

A few recent papers propose analytical notions of a critical batch size: a point at which a transition occurs from a

b

-fold benefit to no benefit. Under assumptions including convexity, ma2017power derive such a critical batch size, and argue that a batch size of one is optimal for minimizing the number of training epochs required to reach a given target error. Under different assumptions, yin2017gradient establish a critical batch size and a pathological loss function that together exhibit a transition from a

b

-fold benefit to no benefit. Although they experiment with neural networks, their experiments are designed to investigate the effect of data redundancy and they do not provide enough information to reveal the empirical relationship between batch size and number of training steps. Focusing on linear least-squares regression, jain2018parallelizing also derive a threshold batch size, here in terms of (i) the operator norm of the objective’s Hessian and (ii) a constant from a fourth-moment bound on example inputs.

To our knowledge, in all previous work that aims to analytically characterize a critical batch size, the thresholds defined are either (i) parameter-dependent, or (ii) specific to linear least-squares regression. A critical batch size that depends on model parameters can change over the course of optimization; it is not a problem-wide threshold that can be estimated efficiently a priori. Focusing on least-squares has issues as well: while it sheds intuitive light on how batching affects stochastic optimization locally, the quantities defined inherently cannot generalize to the non-linear optimization setting of neural network training, both because the objective’s Hessian is not constant across the space of parameters as it is in a quadratic problem, and more broadly because it is unclear whether the Hessian of the objective is still the correct analogue to consider.

wilson2003general present a relevant empirical study on the relationship between batch size and training speed for neural networks. They found that a fully connected neural network with a single hidden layer took more epochs to converge with larger batch sizes on a dataset of

20, 000

examples using plain SGD. They also found that using a batch size equal to the size of the training set took more epochs to converge than a batch size of one on several small datasets of size

\leq 600

. However, their measurement protocol and assumptions limit the conclusions we can draw from their results. One issue is that training time was measured to different out-of-sample errors for different batch sizes on the same dataset. To compare training speed fairly, the error goal should be fixed across all training runs being compared. Additionally, only four learning rates were tried for each dataset, but quite often the best learning rate was at one of the two extremes and it appeared that a better learning rate might be found outside of the four possibilities allowed. Finally, despite the conclusions of the authors, their results do not necessarily imply slower training with larger batch sizes in a data-parallel implementation of mini-batch SGD: for the most part, their larger batch size experiments took fewer training steps than the corresponding batch size one experiments.

In the last few years, increasingly specialized computing systems have spurred practitioners to try much larger batch sizes than ever before, while increasingly promising results have driven hardware designers to create systems capable of even more data parallelism. chen2016revisiting used a pool of synchronized worker machines to increase the effective batch size of mini-batch SGD. They demonstrated speedups in both wall time and steps to convergence for an Inception model szegedy2016rethinking on ImageNet russakovsky2015imagenet by scaling the effective batch size from

1, 600

6, 400

. More recently, goyal2017accurate showed that the number of training epochs could be held constant across a range of batch sizes to achieve the same validation error for ResNet-50 he2016deep on ImageNet. Holding the number of training epochs constant is equivalent to scaling the number of training steps inversely with the batch size, and this reduction in training steps with increasing batch size produced near-linear wall time speedups on their hardware. Although this hints at a

b

-fold benefit regime in which increasing the batch size reduces the number of training steps by the same factor, the authors did not attempt to minimize the number of training steps (or epochs) required to reach the goal at each batch size separately. It is unclear whether any of the batch sizes that achieved the goal could do so in fewer steps than given, or how many steps the other batch sizes would have needed to achieve the same error goal.

Two studies performed concurrently with this work also investigate the relationship between batch size and training speed for neural networks. chen2018effect provide experimental evidence of a problem-dependent critical batch size after which a

b

-fold benefit is no longer achieved for plain mini-batch SGD. They contend that wider and shallower networks have larger critical batch sizes, and while their empirical results were equivocal for this particular claim, they show that the threshold batch size can depend on aspects of both the dataset and the model. Additionally, the anonymous authors of an ICLR 2019 submission [][under review at time of writing]anon2019computational study how three previously proposed heuristics for adjusting the learning rate as a function of batch size (linear scaling, square root scaling, and no scaling) affect the number of training steps required to reach a particular result. They find that if the learning rate is tuned for the the smallest batch size only, all three of these common scaling techniques break down for larger batch sizes and result in either (i) divergent training, or (ii) training that cannot reach the same error goal within a fixed number of training epochs. They also describe a basic relationship between batch size and training steps to a fixed error goal, which is comprised of three regions:

b

-fold benefit initially, then diminishing returns, and finally no benefit for all batch sizes greater than a maximum useful batch size. However, at least at the time of writing, their results are inconclusive because (i) not all model-dataset pairs exhibit this basic relationship, (ii) it does not appear consistently across error goals, and (iii) the relationship is primarily evident in training error but not out-of-sample error. These inconsistent results may be due to suboptimal pre-determined learning rates arising from the scaling rules, especially at larger batch sizes. Finally, they also find that the maximum useful batch size depends on aspects of the model and the dataset type, but not on the dataset size. Since all their experiments use plain mini-batch SGD, their results are unable to reveal any effects from the choice of optimizer and might not generalize to other popular optimizers, such as SGD with momentum.

3.2 Solution quality

The literature contains some seemingly conflicting claims regarding the effect of batch size on solution quality (out-of-sample error at the conclusion of training). Primarily, the debate centers on whether increasing the batch size incurs a cost in solution quality. keskar2016large argue that large batch⁸⁸8The term “large batch” is inherently ambiguous, and in this case accompanies experiments in keskar2016large that only compare two absolute batch sizes per dataset, rather than charting out a curve to its apparent extremes. training converges to so-called “sharp” minima with worse generalization properties. However, DinhPBB17 show that a minimum with favorable generalization properties can be made, through reparameterization, arbitrarily sharp in the same sense. lecun-98x suggest that a batch size of one can result in better solutions because the noisier updates allow for the possibility of escaping from local minima in a descent algorithm. However, they also note that we usually stop training long before reaching any sort of critical point. hoffer2017train argue that increasing the batch size need not degrade out-of-sample error at all, assuming training has gone on long enough. goyal2017accurate, among others, tested batch sizes larger than those used in keskar2016large without noticing any reduction in solution quality. Still, their results with yet larger batch sizes do not rule out the existence of a more sudden degradation once the batch size is large enough. Meanwhile, GoodfellowEtAlBook2016 state that small batches can provide a regularization effect such that they result in the best observed out-of-sample error, although in this case other regularization techniques might serve equally well.

Alas, the best possible out-of-sample error for a particular model and dataset cannot be measured unconditionally due to practical limits on wall time and hardware resources, as well as practical limits on our ability to tune optimization metaparameters (e.g. the learning rate). An empirical study can only hope to measure solution quality subject to the budgets allowed for each model experiment, potentially with caveats due to limitations of the specific procedures for selecting the metaparameters. To the best of our knowledge, all published results handle the training budget issue in exactly one of three ways: by ignoring budgets (train to convergence, which is not always possible⁹⁹9As discussed further in Section 4.8, we find that millions of training steps for small batch sizes, or thousands of epochs for large batch sizes, are required to saturate performance even for a dataset as small and simple as MNIST. In our experiments, this corresponded to more than 25 hours of wall-time for each metaparameter configuration.); by using a step budget (restrict the number of gradient descent updates performed); or by using an epoch budget (restrict number of training examples processed).¹⁰¹⁰10There are, of course, budgets in between an epoch budget and a step budget that might allow the possibility of trading off time, computation, and/or solution quality. For example, it may be possible to trade the total number of gradient computations for faster training time to reach the same quality solution. However, we are not aware of work that emphasizes these budgets. Furthermore, while some published results tune the learning rate anew for each batch size, others tune for a only single batch size and use a preordained heuristic to set the learning rate for the remaining batch sizes (the most common heuristics are constant, square root, and linear learning rate scaling rules). Tuning metaparameters at a single batch size and then heuristically adjusting them for others could clearly create a systematic advantage for trials at batch sizes near to the one tuned. All in all, the conclusions we can draw from previous studies depend on the budgets they assume and on how they select metaparameters across batch sizes. The following subsections attempt an investigation of their experimental procedures to this end.

3.2.1 Studies that ignore budgets

Training without a budget means using manual inspection or a heuristic to determine the stopping time, typically when the model is considered to have converged. None of the studies we mention below, in this category, tuned learning rates or other optimization metaparameters separately for different batch sizes, at least for the experiments they performed relevant to solution quality.

keskar2016large trained several neural network architectures on MNIST and CIFAR-10, each with two batch sizes, and found that the larger batch size consistently achieved worse out-of-sample error after training error had ceased to improve. However, all models used batch normalization ioffe2015batch and presumably computed the batch normalization statistics using the full batch size. For a fair comparison between batch sizes, batch normalization statistics should be computed over the same number of examples or else the training objective differs between batch sizes goyal2017accurate. Indeed, hoffer2017train found that computing batch normalization statistics over larger batches can degrade solution quality, which suggests an alternative explanation for the result of keskar2016large. Moreover, keskar2016large reported that data augmentation eliminated the difference in solution quality between small and large batch experiments.

chen2016revisiting and smith2018bayesian trained neural networks with two different batch sizes each. chen2016revisiting observed no difference in solution quality when scaling the batch size from 1,600 to 6,400 for an Inception model on ImageNet. smith2018bayesian trained a small neural network on just 1,000 examples sampled from MNIST, and observed that the larger batch size overfit more than the small batch size resulting in worse out-of-sample error. However, this gap was mitigated by applying L2 regularization.

3.2.2 Studies with step budgets

hoffer2017train trained neural networks with two different batch sizes on several image datasets. They found that, by computing batch normalization statistics over a fixed number of examples per iteration (“ghost batch normalization”), and by scaling the learning rate with the square root of the batch size instead of some other heuristic, the solution quality arising from the larger batch size was as good as or better than the smaller batch size. However, the largest batch size used was

4, 096

, which does not rule out an effect appearing at still larger batch sizes, as suggested by goyal2017accurate. Moreover, it remains open whether their proposed learning rate heuristic extends to arbitrarily large batch sizes, or whether it eventually breaks down for batch sizes sufficiently far from the base batch size.

3.2.3 Studies with epoch budgets

An epoch budget corresponds to fixing the total number of per-example gradient computations, but, in an idealized data-parallel implementation of SGD, it also corresponds to a step (or even wall time) budget that scales inversely with the batch size. With an epoch budget, a larger batch size can only achieve the same solution quality as a smaller batch size if it achieves perfect scaling efficiency (a

b

-fold reduction in steps from increasing the batch size, as described in Section 3.1).

masters2018revisiting show that after a critical batch size depending on the model and dataset, solution quality degrades with increasing batch size when using a fixed epoch budget. Their results effectively show a limited region of

b

-fold benefit for those model-dataset pairs when trained with SGD, although they did not investigate whether this critical batch size depended on the optimizer used, and they did not consider more than one epoch budget for each problem. We reproduced a subset of their experiments and discuss them in Section 5.

goyal2017accurate recently popularized a linear learning rate scaling heuristic for training the ResNet-50 model using different batch sizes. Using this heuristic, a 90 epoch budget, and SGD with momentum without adjusting or tuning the momentum, they increased the batch size from

64

8, 192

with no loss in accuracy. However, their learning rate heuristic broke down for even larger batch sizes. Inspired by these results, a sequence of follow-up studies applied additional techniques to further increase the batch size while still achieving the same accuracy and using the same 90 epoch budget. These follow-on studies CodreanuEtAl2017,you2017imagenet,AkibaEtAl2017 confirm that the best solution quality for a given batch size will also depend on the exact optimization techniques used.

There are several additional papers lin2018don,devarakonda2017adabatch,anon2019computational with experiments relevant to solution quality that use an epoch budget, tune the learning rate for the smallest batch size, and then use a heuristic to choose the learning rate for all larger batch sizes. For instance, devarakonda2017adabatch and lin2018don used linear learning rate scaling and anon2019computational tried constant, square root, and linear learning rate scaling heuristics. All of them conclude that small batch sizes have superior solution quality with a fixed epoch budget than large batch sizes, for various notions of “small” and “large.” This could just as easily be an artifact of the learning rate heuristics, and a possible alternative conclusion is that these heuristics are limited (as heuristics can often be).

4 Experiments and results

The primary quantity we measure is the number of steps needed to first reach a desired out-of-sample error, or steps to result. To measure steps to result, we used seven image and text datasets with training set sizes ranging from 45,000 to 26 billion examples. Table 1 summarizes these datasets and Appendix A provides the full details. We chose six families of neural network to train on these datasets. For MNIST and Fashion MNIST, we chose a simple fully connected neural network and a simple convolutional neural network (CNN). For CIFAR-10, we chose the ResNet-8 model without batch normalization, partly to compare our results to masters2018revisiting, and partly to have a version of ResNet without batch normalization. For ImageNet, we chose ResNet-50, which uses batch normalization and residual connections, and VGG-11, which uses neither. For Open Images, we chose ResNet-50. For LM1B, we chose the Transformer model and an LSTM model. For Common Crawl, we chose the Transformer model. Table 2 summarizes these models and Appendix B provides the full details.

Dataset	Type	Task	Training set size	Evaluation Metric
MNIST	Image	Classification	55,000	Classification error
Fashion MNIST	Image	Classification	55,000	Classification error
CIFAR-10	Image	Classification	45,000	Classification error
ImageNet	Image	Classification	1,281,167	Classification error
Open Images	Image	Classification (multi-label)	4,526,492	Average precision
LM1B	Text	Language modeling	30,301,028 sentences	Cross entropy error
Common Crawl	Text	Language modeling	$\sim 25.8$ billion sentences	Cross entropy error

Table 1: Summary of datasets. See Appendix A for full details.

Model Class	Sizes	Optimizers	Datasets	Learning rate
				schedule
Fully Connected	Various	SGD	MNIST	Constant
Simple CNN	Base	SGD	MNIST	Constant
	Narrow	Momentum	Fashion MNIST
	Wide	Nesterov momentum
ResNet	ResNet-8	SGD	CIFAR-10	Linear decay
		Nesterov momentum
	ResNet-50	Nesterov momentum	ImageNet	Linear decay
			Open Images
VGG	VGG-11	Nesterov momentum	ImageNet	Linear decay
Transformer	Base	SGD	LM1B	Constant
	Narrow and shallow	Momentum	Common crawl
	Shallow	Nesterov momentum
	Wide
LSTM	—	Nesterov momentum	LM1B	Constant

Table 2: Summary of models. See Appendix B for full details.

Measuring steps to result requires a particular value of out-of-sample error to be chosen as the goal. Ideally, for each task and model, we would select the best achievable error, but since validation error is noisy, the best error is sometimes obtained unreliably. Moreover, for some tasks, the validation error continues to improve steadily beyond the maximum practical training time. Therefore, we generally tried to select the best validation error that we could achieve reliably within a practical training time.

Table 2 shows the learning rate schedule we used for each model and dataset. Learning rate schedules are often used to accelerate neural network training, but finding the best schedule is an optimization problem in its own right wu2018understanding. Instead, researchers typically choose from a range of common learning rate functions based on validation performance and individual preference. These functions range from piecewise constants to cosine functions loshchilov2016sgdr. While most schedules decay the learning rate monotonically over training, some researchers also “warm-up” the learning rate at the start of training [e.g.][]he2016deep, particularly when training with large batch sizes goyal2017accurate. We ran experiments with both constant learning rates and with learning rate decay. We used decay for ResNet-8, ResNet-50, and VGG-11, which significantly reduced training time for those models. We selected our decay function by running an extensive set of experiments with ResNet-50 on ImageNet (see Appendix C for details). We chose linear decay because it performed at least well as all other schedules we tried, while also being the simplest and requiring only two additional metaparameters. In experiments that used linear decay, we specified metaparameters

(η_{0}, α, T)

such that the learning rate decayed linearly from

η_{0}

η_{T} = α η_{0}

. That is, the learning rate at step

t

is given by

η_{t} = {\begin{matrix} η_{0} - (1 - α) η_{0} \frac{t}{T} & if t \leq T α η_{0} & if t > T \end{matrix}

Steps to result depends on the training metaparameters, and, for a given task and model, each batch size might have a different metaparameter configuration that minimizes steps to result. In all experiments, we independently tuned the metaparameters at each batch size, including the initial learning rate

η_{0}

and, where learning rate decay was used, the decay schedule (

α, T

). Also, unless otherwise specified, we used the Nesterov momentum optimizer SutskeverEtAl_icml2013 and tuned the momentum

γ

. Tuning anew for each batch size is extremely important since otherwise we would not be measuring steps to result as a function of batch size, rather we would be measuring steps to result as a function of batch size and the specific values of the learning rate and other metaparameters. We used quasi-random search BousquetEtAl_LDS_2017 to tune the metaparameters with equal budgets of non-divergent¹¹¹¹11We discarded trials with a divergent training loss. Typically, this occurred when the learning rate was too high. trials for different batch sizes. We selected metaparameter search spaces by hand based on preliminary experiments. The exact number of non-divergent trials needed to produce stable results depends on the search space, but 100 trials seemed to suffice in all of our experiments.¹²¹²12LSTM on LM1B used 50 trials because we only tuned

η

with fixed

γ = 0.99

. We validated that tuning

γ

did not significantly affect the results for

B = 256, 1024,

and

4096

. If the optimal trial occurred near the boundary of the search space, or if the goal validation error was not achieved within the search space, we repeated the search with a new search space. We measured steps to result for each batch size by selecting the metaparameter tuning trial that reached the goal validation error in the fewest number of steps.

4.1 Steps to result depends on batch size in a similar way across problems

To get a sense of the basic empirical relationship, we measured the number of steps required to reach a goal validation error as a function of batch size across several different datasets and models (Figure 1). In all cases, as the batch size grows, there is an initial period of perfect scaling (

b

-fold benefit, indicated with a dashed line on the plots) where the steps needed to achieve the error goal halves for each doubling of the batch size. However, for all problems, this is followed by a region of diminishing returns that eventually leads to a regime of maximal data parallelismwhere additional parallelism provides no benefit whatsoever. In other words, for any given problem and without making strong assumptions about learning rates or other optimizer parameters, we can achieve both extremes suggested by theory (see Section 3.1). A priori, it is not obvious that every workload in our experiments should exhibit perfect scaling at the smallest batch sizes instead of immediately showing diminishing returns.

4.2 Validating our measurement protocol

If the curves in Figure 1 were sensitive to the exact choice of goal validation error, then measuring the steps needed to first reach a particular validation error would not be a meaningful proxy for training speed. For small changes in the goal validation error, we do not care about vertical shifts as long as the transition points between the three scaling regions remain relatively unchanged. Figure 2 shows that varying the error goal only vertically shifts the steps-to-result curve, at least for modest variations centered around a good absolute validation error. Furthermore, although we ultimately care about out-of-sample error, if our steps-to-result plots looked very different when measuring the steps needed to reach a particular training error, then we would need to present our results somewhat differently and include both curves. However, switching to training error does not change the plots much at all (see Figure 13 in the Appendix).

Our experiments depend on extensive metaparameter tuning for the learning rate, momentum, and, where applicable, the learning rate schedule. For each experiment, we verified our metaparameter search space by checking that the optimal trial was not too close to a boundary of the space. See Figures 14 and 15 in the Appendix for examples of how we verified our search spaces.

4.3 Some models can exploit much larger batch sizes than others

(a) Fully Connected vs Simple CNN on MNIST

We investigated whether some models can make more use of larger batches than others by experimenting with different models while keeping the dataset and optimizer fixed. We explored this question in two ways: (i) by testing completely different model architectures on the same dataset, and (ii) by varying the size (width and depth) of a model within a particular model family. Since the absolute number of steps needed to reach a goal validation error depends on the model, the steps to result vs batch size curves for each model generally appear at different vertical offsets from each other. Since we primarily care about the locations of the perfect scaling, diminishing returns, and maximal data parallelism regions, we normalized the

y

-axis of each plot by dividing by the number of steps needed to reach the goal for a particular batch size and dataset. This normalization corresponds to a vertical shift of each curve (on log-scale plots), and makes it easier to compare different models. Appendix D contains all plots in this section without the

y

-axis normalized.

Figures 2(a)–2(c) show that the model architecture significantly affects the relationship between batch size and the number of steps needed to reach a goal validation error. In Figure 2(a), the curve for the Fully Connected model flattens later than for the Simple CNN model on MNIST (although in this case the Simple CNN model can ultimately achieve better performance than the Fully Connected model). In Figure 2(b), the curve for ResNet-50 flattens much later than the curve for VGG-11, indicating that ResNet-50 can make better use of large batch sizes on this dataset. Unlike ResNet-50, VGG-11 does not use batch normalization or residual connections. Figure 2(c) shows that Transformer can make better use of large batch sizes than LSTM on LM1B.

Figures 2(d)–2(f) show that varying the depth and width can affect a model’s ability to exploit larger batches, but not necessarily in a consistent way across different model architectures. In Figure 2(d), the regions of perfect scaling, diminishing returns, and maximum useful batch size do not change much when the width is varied for the Fully Connected model on MNIST, although the shallower model seems less able to exploit larger batches than the deeper models. This contrasts with the findings of chen2018effect, although they changed width and depth simultaneously while keeping the number of parameters fixed. For Simple CNN on MNIST, the relationship between batch size and steps to a goal validation error seems not to depend on width at all (Figure 15(e) in the Appendix shows that the curves are the same even when the

y

-axis is not normalized). However, in Figure 2(f), the curves for narrower Transformer models on LM1B flatten later than for wider Transformer models, while the depth seems to have less of an effect. Thus, reducing width appears to allow Transformer to make more use of larger batch sizes on LM1B.

4.4 Momentum extends perfect scaling to larger batch sizes, but matches plain SGD at small batch sizes

We investigated whether some optimizers can make better use of larger batches than others by experimenting with plain SGD, SGD with momentum, and Nesterov momentum on the same model and dataset. Since plain SGD is a special case of both Nesterov momentum and SGD with momentum (with

γ = 0

in each case), and since we tune

γ

in all experiments, we expect that experiments with either of these optimizers should do no worse than plain SGD at any batch size. However, it is not clear a priori whether momentum optimizers should outperform SGD, either by taking fewer training steps or by extending the perfect scaling region to larger batch sizes.

Figure 4 shows that Nesterov momentum and SGD with momentum can both extend the perfect scaling region beyond that achieved by SGD, and thus can significantly reduce the number of training steps required to reach a goal validation error at larger batch sizes. However, at batch sizes small enough that all optimizers are within their perfect scaling region, momentum optimizers perform identically to SGD without momentum. Though initially surprising, this identical performance at small batch sizes is consistent with observations made in kidambi2018insufficiency. In our experiments, we did not see a large difference between Nesterov momentum and SGD with momentum – Nesterov momentum appears to scale slightly better for Transformer on LM1B, but both perform about equally well for Simple CNN on MNIST.

4.5 The dataset matters, but may be secondary to the model or the optimizer

We investigated whether properties of the dataset make some problems able to exploit larger batch sizes than others by experimenting with different datasets while keeping the model and optimizer fixed. We explored this question in two ways: (i) by testing the same model on completely different datasets, and (ii) by testing the same model on different subsets of the same dataset. We normalized the

y

-axis of all plots in this section in the same way as Section 4.3. Appendix D contains all plots in this section without the

y

-axis normalized.

Figure 5 shows that changing the dataset can affect the relationship between batch size and the number of steps needed to reach a goal validation error. Figure 4(a) shows that Fashion MNIST deviates from perfect scaling at a slightly larger batch size than MNIST for the Simple CNN model. Figure 4(b) shows that ImageNet and Open Images are extremely similar in how well ResNet-50 can make use of larger batch sizes, although, if anything, ImageNet might make slightly better use of larger batch sizes. Figure 4(c) shows that LM1B scales slightly better with increasing batch size than Common Crawl for Transformer. Since Fashion MNIST is the same size as MNIST, Open Images is larger than ImageNet, and Common Crawl is far larger than LM1B, these differences are not simply as straightforward as larger datasets making larger batch sizes more valuable.

To disentangle the effects from changes to the distribution and changes to the number of examples, we generated steps to result vs batch size plots for different random subsets of MNIST (Figure 5(a)) and ImageNet (Figure 5(b)). For MNIST, we selected subsets of different sizes, while for ImageNet, we selected a random subset of half the images and a similar sized subset that only includes images from half of the classes. At least on MNIST, any effect on the maximum useful batch size is extremely small or nonexistent. For ImageNet, Figure 5(b) shows that the random subset of half the images deviates from perfect scaling sooner than the full dataset, but the curve for the subset with half the classes is very close to the curve for the full dataset and, if anything, deviates from perfect scaling later, even though it contains roughly the same number of images as the random subset.

4.6 Regularization can be more helpful at some batch sizes than others

(a) Label smoothing benefits larger batch sizes, but has no apparent effect for smaller batch sizes.

We used label smoothing szegedy2016rethinking to regularize training in our experiments with ResNet-50 on ImageNet. Without label smoothing, we could not achieve our goal validation error rate of 0.25 with batch sizes greater than

2^{14}

within our training budget. With a fixed compute budget for each batch size, label smoothing improved the error by as much as one percentage point at large batch sizes, while having no apparent effect at small batch sizes (Figure 6(a)). Meanwhile, if multiple choices for the label smooth metaparameter achieved the goal within the training budget, then label smoothing did not change the number of steps needed (Figure 6(b)).

We confirmed that label smoothing reduced overfitting at large batch sizes for ResNet-50 on ImageNet (see Figure 19 in the Appendix). This result is consistent with the idea that noise from small batch training is a form of implicit regularization [e.g.][]GoodfellowEtAlBook2016. However, although our results show that other forms of regularization can serve in place of this noise, it might be difficult to select and tune other forms of regularization for large batch sizes. For example, we unsuccessfully tried to control overfitting with larger batch sizes by increasing the L2 weight penalty and by applying additive Gaussian gradient noise before we obtained good results with label smoothing.

Finally, we also tried label smoothing with Simple CNN on MNIST and Fashion MNIST, and found that it generally helped all batch sizes, with no consistent trend of helping smaller or larger batch sizes more (see Figure 20 in the Appendix). This may be because these datasets are sufficiently small and simple that overfitting is an issue at all batch sizes.

4.7 The best learning rate and momentum vary with batch size

Across all problems we considered, the effective learning rate (

η^{eff}

; see Section 2.2) that minimized the number of training steps to a goal validation error tended to increase with increasing batch size (Figure 8). However, it did not always follow either a linear or square root scaling heuristic, despite the popularity of these rules of thumb. In some cases, the optimal effective learning rate even decreased for larger batch sizes. We also found that the best effective learning rate should be chosen by jointly tuning the learning rate and momentum, rather than tuning only the learning rate. For example, the optimal way to scale the effective learning rate for Transformer was to increase the momentum while decreasing the learning rate or holding it constant (see Figures 22 and 23 in the Appendix). This is a refinement to past prescriptions that only change the learning rate while keeping the momentum fixed.

We further investigated the relationship between learning rate, momentum, and training speed by examining our metaparameter search spaces for different batch sizes and model sizes. For this analysis, we used Transformer on LM1B with Nesterov momentum because the metaparameter search spaces are consistent between all batch and model sizes, and can be easily visualized because they consist only of the constant learning rate

η

and the momentum

γ

. We observe the following behaviors:

With increasing batch size, the region in metaparameter space corresponding to rapid training in terms of epochs becomes smaller (Figure 9), while the region in metaparameter space corresponding to rapid training in terms of step-count grows larger (Figure 10, although it eventually plateaus for batch sizes in the maximal data parallelism regime). Thus, with a fixed error goal and in a setting where training epochs are constrained (e.g. a compute budget), it may become more challenging to choose good values for the metaparameters with increasing batch size. Conversely, with a fixed error goal and in a setting where training steps are constrained (e.g. a wall-time budget), it may become easier to choose good values for the metaparameters with increasing batch size.
The metaparameters yielding the fastest training are typically on the edge of the feasible region of the search space (Figures 9 and 10). In other words, small changes in the optimal metaparameters might make training diverge. This behavior may pose a challenge for metaparameter optimization techniques, such as Gaussian Process approaches, that assume a smooth relationship between metaparameter values and model performance. It could motivate techniques such as learning rate warm-up that enable stability at larger eventual learning rates, since the maximum stable learning rate depends on the current model parameters. That said, we did not use need to use learning rate warm-up for any of our problems. We also did not observe this behavior for ResNet-50 on ImageNet. Figure 21 in the Appendix shows the results from a range of effective learning rates near the optimum for ResNet-50 on ImageNet and Transformer on LM1B.
Smaller models have larger stable learning rates (Figure 11). This is consistent with recent work predicting that the largest stable learning rate is inversely proportional to layer width karakida2018universal.

Figure 9: The region of metaparameter space that reaches an error goal within a fixed number of training epochs shrinks with increasing batch size. Plots are for Transformer on LM1B with a goal validation cross entropy error of 3.9 and a training budget of one epoch. Yellow stars are the trials that achieved the goal in the fewest number of steps. Contours indicate the effective learning rate $η^{eff} = \frac{η}{1 - γ}$ . Infeasible trials are those that resulted in divergent training.

Figure 10: The region of metaparameter space that reaches an error goal within a fixed number of training steps expands with increasing batch size. Plots are for Transformer on LM1B with a goal validation cross entropy error of 3.9 and a training budget of 25,000 steps. Yellow stars are the trials that achieved the goal in the fewest number of steps. Contours indicate the effective learning rate $η^{eff} = \frac{η}{1 - γ}$ . Infeasible trials are those that resulted in divergent training.

Figure 11: Smaller models have larger stable learning rates for Transformer on LM1B. Plots are for different sizes of Transformer on LM1B with a batch size of 1024, a goal validation cross entropy error of 4.2, and a training budget of 50,000 steps. Contours indicate the effective learning rate $η^{eff} = \frac{η}{1 - γ}$ . Infeasible trials are those that resulted in divergent training.

4.8 Solution quality depends on compute budget more than batch size

Figure 12: Validation error depends on compute budget more than batch size. Plots show the best validation error achieved for each batch size subject to budgets of training steps (left column) or training epochs (right column). Step budgets favor large batch sizes, while epoch budgets favor small batch sizes.

We investigated the relationship between batch size and out-of-sample error for Simple CNN on MNIST and Fashion MNIST, and for two sizes of Transformer on LM1B. For each task, we ran a quasi-random metaparameter search over the constant learning rate

η

and Nesterov momentum

γ

. For MNIST and Fashion MNIST, we also added label smoothing and searched over the label smoothing parameter in

{0, 0.1}

to mitigate any confounding effects of overfitting (see Section 4.6). We ran 100 metaparameter trials for each batch size with a large practical wall-time budget.

To disentangle the effects of the batch size from the compute budget, we compared batch sizes subject to budgets of either training steps or training epochs. For each batch size and compute budget, we found the model checkpoint that achieved the best validation accuracy across all metaparameter trials, and across all training steps that fell within the compute budget. Figure 12shows the validation error for these best-validation-error checkpoints, as a function of batch size, for a range of compute budgets. We observe that, subject to a budget on training steps, larger batch sizes achieve better out-of-sample error than smaller batch sizes, but subject to a budget on training epochs, smaller batch sizes achieve better out-of-sample error than larger batch sizes. These observations are likely explained by the observations that, for a fixed number of training steps, larger batch sizes train on more data, while for a fixed number of epochs, smaller batch sizes perform more training steps.

The workloads in Figure 12 represent two distinct modes of neural network training. For the small MNIST and Fashion MNIST datasets, we chose training budgets that would saturate (or almost saturate) performance at each batch size. In other words, out-of-sample error cannot be improved any further by simply increasing the budget, with caveats due to practical limitations on our ability to find optimal values for the metaparameters. Figures 12a and 12b show that differences in maximum performance between batch sizes on these datasets are very small (Figures 24 and 25 in the Appendix contain zoomed versions of these plots). We cannot rule out that any differences at this magnitude are due to noise from metaparameter choices and training stochasticity. Thus, for these workloads at least, the effect of batch size on solution quality is either very small or nonexistent. On the other hand, we cannot saturate performance with Transformer on LM1B within a practical training time. In this case, the scenario is much simpler: for a given batch size, the best error is achieved by the largest compute budget. Larger batch sizes are favored by compute budgets defined in terms of training steps, while smaller batch sizes are favored by compute budgets defined in terms of training epochs.

Taken together, these observations suggest that in practice the relevant question is not which batch size leads to the best performance, but rather how compute budget varies as a function of batch size. Although we attempted to saturate performance with MNIST and Fashion MNIST, we found that it took millions of training steps for small batch sizes, and thousands of epochs for large batch sizes, even for datasets as small and simple as these. Indeed, despite sampling 100 metaparameter configurations per batch size and training for up to 25 hours per trial, it is still not certain whether we truly saturated performance at the smallest and largest batch sizes (see Figures 24 and 25 in the Appendix). Thus, the regime of saturated performance is of limited practical concern for most workloads – the compute budget required to saturate performance is likely beyond what a practitioner would typically use. For realistic workloads, practitioners should be most concerned with identifying the batch size at which they can most efficiently apply their compute.

5 Discussion

Our goals in measuring the effects of data parallelism on neural network training were twofold:
first, we hoped to produce actionable advice for practitioners, and
second, we hoped to understand the utility of building systems capable of very high degrees of data parallelism.

Our results indicate that, for idealized data parallel hardware, there is a universal relationship between training time and batch size, but there is dramatic variation in how well different workloads can make use of larger batch sizes.
Across all our experiments, increasing the batch size initially reduced the number of training steps needed proportionally. However, depending on the workload, this perfect scaling regime ended anywhere from a batch size of
$2^{4}$ to a batch size of $2^{13}$ .

As batch size increases beyond the perfect scaling regime, there are diminishing returns (where increasing the batch size by a factor of

k

only reduces the number of training steps needed by a factor less than

k

) that end with a maximum useful batch size (where increasing the batch size no longer changes the number of training steps needed). Once again, the maximum useful batch size is extremely problem-dependent and varied between roughly

2^{9}

and

2^{16}

in our experiments. Other workloads may have the region of perfect scaling end at batch sizes even smaller or larger than the range we observed, as well as having even smaller or larger maximum useful batch sizes.

On the one hand, the possibility that perfect scaling can extend to batch sizes beyond $2^{13}$ for some workloads is good news for practitioners because it suggests that efficient data-parallel systems can provide extremely large speedups for neural network training.

On the other hand, the wide variation in scaling behavior across workloads is bad news because any given workload might have a maximum useful batch size well below the limits of our hardware.

Moreover, for a new workload, measuring the training steps needed as a function of batch size and confirming the boundaries of the three basic scaling regimes requires expensive experiments. In this work, we have only described how to retrospectively predict the scaling behavior by tuning the optimization metaparameters for every batch size.

Although anon2019computational also described the same basic scaling behavior we found, in their experiments the relationship did not appear consistently across problems, across error goals, or in out-of-sample error. In light of our own results, the heuristics they assumed for adjusting the learning rate as a function of batch size are the likely cause of these inconsistencies, but this explanation only drives home the inconvenience of having to carefully tune at every new batch size. We were unable to find reliable support for any of the previously proposed heuristics for adjusting the learning rate as a function of batch size. Thus we are forced to recommend that practitioners tune all optimization parameters anew when they change the batch size or they risk masking the true behavior of the training procedure.

If the scaling behavior of workloads with respect to batch size has a simple dependence on properties of the workload, then we might be able to predict the limits of perfect scaling (or the maximum useful batch size) before running extensive experiments. We could then prioritize workloads to run on specialized hardware or decide whether gaining access to specialized hardware would be useful for a given workload of interest. On the one hand, our results are bad news for practitioners because they show that accurate scaling predictions must depend on a combination of non-obvious properties of the model, properties of the optimizer, and properties of the dataset. On the other hand, we have a lot of control over the choice of model and optimizer and there is some indication that model and optimizer properties might be responsible for the largest portion of the variation between workloads. Our results comparing SGD and SGD with momentum (or Nesterov momentum) show that, at least for the problems we tried, momentum can extend perfect scaling to much larger batch sizes, offering clear guidance for practitioners. Other optimizers, such as KFAC martens2015optimizing,grosse2016kronecker,ba2016distributed, or optimization techniques designed specifically for massively data parallel systems li2014efficient, might allow perfect scaling to extend much further. Intuitively, it seems plausible that optimizers that estimate local curvature information might be able to benefit more from large batches than optimizers that only use gradients.

Although the model seems to have a large effect on the maximum useful batch size and the limit of perfect scaling, our results do not give definitive answers on exactly how to design models that scale better for a given optimizer and dataset. Even when we kept the model family fixed, we observed somewhat inconsistent results from changing the model width and depth. chen2018effect suggested that wider models can exploit larger batch sizes than narrower models, but their theoretical arguments only apply to linear networks and fully connected networks with a single hidden layer. In contrast, we found that narrower variants of the Transformer model scaled better to larger batch sizes, although it is unclear if the same notion of “width” transfers between different types of neural networks.

Unlike the model and optimizer, we generally have much less control over the dataset. Unfortunately, the properties of the dataset also affect how well training scales in practice. Our results are equivocal on whether the number of training examples has any effect, but changing the dataset entirely can certainly change the scaling behavior with respect to batch size.

Finally, our results at least partially reconcile conflicting stances in the literature on whether increasing the batch size degrades model quality. Our experiments show that:

Any study that only tunes the learning rate for one batch size and then uses a heuristic to choose the learning rate for other batch sizes goyal2017accurate,keskar2016large,hoffer2017train,lin2018don,devarakonda2017adabatch,anon2019computational gives a systematic advantage to the batch size used in tuning (as well as nearby batch sizes). Our results did not show a simple relationship between the optimal learning rate and batch size that scales indefinitely (see Figures 8 and 22), so the use of simple heuristics for batch sizes sufficiently far from the base batch size could very well explain the degraded solutions and divergent training reported in prior work. Similarly, the optimal values of other metaparameters, such as the momentum and learning rate decay schedule, should not be assumed to remain constant or scale in a simple way as the batch size increases.
Assuming an epoch budget when comparing solution quality between batch sizes masters2018revisiting,goyal2017accurate,lin2018don,devarakonda2017adabatch, in effect, limits an investigation to the perfect scaling region of the steps to result vs batch size curve (see Figure 1). This budget favors smaller batch sizes because they will perform more optimizer steps for the same number of training examples (see Section 4.8). Certainly, there are situations where an epoch budget is appropriate, but there may exist budgets just outside the perfect scaling region that can achieve the same quality solution, and those budgets may still represent a significant reduction in the number of training steps required. Moreover, even for a fixed model and dataset, simply changing the optimizer can significantly extend the perfect scaling regime to larger batch sizes. For example, masters2018revisiting found that test performance of ResNet-8 (without batch normalization) on CIFAR-10 with a fixed epoch budget degraded after batch size 16, but considered only plain mini-batch SGD. Our experiments confirmed that perfect scaling ends at batch size 16 with plain mini-batch SGD, but using Nesterov momentum extends the perfect scaling regime to batch size 256 (see Figure 0(c)).
Assuming a step budget when comparing solution quality between batch sizes hoffer2017train might favor larger batch sizes because they will see more training examples for the same number of gradient updates (see Section 4.8). A step budget is likely sufficient for a larger batch size to reach at least the same performance as a smaller batch size: we never saw the number of steps to reach a goal validation error increase when the batch size was increased (see Figure 1).
Increasing the batch size reduces noise in the gradient estimates (see Equation 4). However, the noise in updates due to small batches might, in some cases, provide a helpful regularization effect GoodfellowEtAlBook2016,smith2018bayesian. Thankfully, other regularization techniques may be able to replace this effect. In our experiments with ResNet-50 on ImageNet, we could not initially achieve the same quality solution using batch sizes greater than $2^{14}$ as we could with smaller batch sizes within our training budget, but label smoothing eliminated this degradation in solution quality (see Section 4.6). Others have also used regularization techniques, such as data augmentation keskar2016large and L2 regularization smith2018bayesian, to eliminate the “generalization gap” between two batch sizes.
Finally, although we do not believe there is an inherent degradation in solution quality associated with increasing the batch size, depending on the compute budget, it may be more difficult to find good values for the metaparameters at larger batch sizes (see Figures 9and 10). Specifically, with increasing batch size, the region in metaparameter space corresponding to rapid training in terms of epochs becomes smaller, while the region in metaparameter space corresponding to rapid training in terms of step-count grows larger, at least for some problems. This suggests that metaparameter optimization in a training-epoch constrained setting (e.g. compute budget) may become more challenging with increasing batch size.

5.1 Limitations of our experimental protocol

When interpreting our results, one should keep in mind any limitations of our experimental protocol, even if they seem minor. We do not believe any of these limitations are debilitating, and we hope that describing these potential areas of concern will spur methodological innovation in future work.

First, we were unable to avoid some amount of human judgment when tuning metaparameters. Although we did not tune metaparameters by hand, we specified the search spaces for automatic tuning by hand and they may not have been equally appropriate for all batch sizes, despite our best efforts. We are most confident in our search spaces that tuned the fewest metaparameters (such as in our experiments that only tuned learning rate and momentum). We found it quite difficult to be confident that our tuning was sufficient when we searched over learning rate decay schedules; readers should be aware that the steps to result measurement is generally quite sensitive to the learning rate schedule. Thus, we may not have sampled enough trials at some batch sizes or, nearly equivalently, our search spaces may have been too wide at some batch sizes. Even though we verified that the best trial was not on the boundary of the search space, this by no means guarantees that we found the globally optimal metaparameters.

Smaller batch sizes typically had more opportunities to measure validation error and, when validation error was noisy, got more chances to sample a lucky validation error. Batch sizes (usually larger ones) that did not reach the goal validation error using the first search space we tried used revised search spaces that gave them an extra bite of the apple, so to speak.

Finally, our analysis does not consider how robustly we can reach a goal error rate. For instance, we did not distinguish between batch sizes where all 100 trials achieved the goal validation error and batch sizes where only one of the 100 trials achieved the goal. The maximum or minimum value over a set of trials is not usually a very robust statistic, but something like the 50^th percentile trial is a close to meaningless quantity that mostly reveals information about the search space. We tried to strike a balance between our desire to study realistic workloads and our desire to be able to repeat our experiments so many times over that these uncertainty questions become trivial. Ultimately, for this work, we opted for simplicity of presentation and reported results for optimal trials.

6 Conclusions and future work

Increasing the batch size is a simple way to produce valuable speedups across a range of workloads, but, for all the workloads we tried, the benefits diminished well within the limits of state-of-the-art hardware.

Unfortunately, blindly increasing the batch size to the current limits of our hardware will not produce a large speedup for all workloads.

However, our results suggest that some optimization algorithms may be able to consistently extend perfect scaling across many models and datasets.

Future work should perform our same measurements with other optimizers, beyond the closely-related ones we tried, to see if an optimizer with the right properties already exists. If we focus less on generality and only crave speedups for specific, high-value problems, we can alternatively consider changing the model to extend perfect scaling to much larger batch sizes.

However, unlike the optimizer, practitioners are likely to tailor their model architectures to the specific problems at hand. Therefore, instead of searching for a single model architecture that happens to scale extremely well, future work should try to uncover general principles for designing models that can scale perfectly to larger batch sizes. Even if such principles remain elusive, we would still benefit from methods to prospectively predict the scaling behavior of a given workload without requiring careful metaparameter tuning at several different batch sizes. Although not all of these avenues of future work may pan out, the deep learning community can always benefit from methodical experiments designed to test hypotheses, characterize phenomena, and reduce confusion, to balance more exploratory work designed to generate new ideas for algorithms and models.

Step budget	Epoch budget

(a) Simple CNN on MNIST

(b) Simple CNN on Fashion MNIST

(c) Transformer (Shallow and Narrow) on LM1B

(d) Transformer (Base) on LM1B

Measuring the Effects of Data Parallelism on Neural Network Training