Keras multi-gpu batch normalization The Next CEO of Stack Overflow2019 Community Moderator ElectionPaper: What's the difference between Layer Normalization, Recurrent Batch Normalization (2016), and Batch Normalized RNN (2015)?Multi GPU in kerasunderstanding batch normalizationBatch normalization variance calculationBatch Normalization and Input Normalization in CNNNavigating the jungle of choices for scalable ML deploymentDoes batch normalization mean that sigmoids work better than ReLUs?Why is my Keras model not learning image segmentation?Batch normalization vs batch sizeUsing the Python Keras multi_gpu_model with LSTM / GRU to predict Timeseries data
How to write the block matrix in LaTex?
What's the Pac-Man-like video game seen in the movie "Joysticks"?
Why is there a PLL in CPU?
At which OSI layer a user-generated data resides?
Only print output after finding pattern
Inappropriate reference requests from Journal reviewers
Help understanding this unsettling image of Titan, Epimetheus, and Saturn's rings?
How did people program for Consoles with multiple CPUs?
Does the Brexit deal have to be agreed by both Houses?
What can we do to stop prior company from asking us questions?
Is there a difference between "Fahrstuhl" and "Aufzug"
Can the Reverse Gravity spell affect the Meteor Swarm spell?
How to get regions to plot as graphics
Increase performance creating Mandelbrot set in python
What is the difference between "behavior" and "behaviour"?
Why do professional authors make "consistency" mistakes? And how to avoid them?
Why do airplanes bank sharply to the right after air-to-air refueling?
How easy is it to start Magic from scratch?
Clustering points and summing up attributes per cluster in QGIS
How do I go from 300 unfinished/half written blog posts, to published posts?
Grabbing quick drinks
Which organization defines CJK Unified Ideographs?
Unreliable Magic - Is it worth it?
Oh, one short ode of love
Keras multi-gpu batch normalization
The Next CEO of Stack Overflow2019 Community Moderator ElectionPaper: What's the difference between Layer Normalization, Recurrent Batch Normalization (2016), and Batch Normalized RNN (2015)?Multi GPU in kerasunderstanding batch normalizationBatch normalization variance calculationBatch Normalization and Input Normalization in CNNNavigating the jungle of choices for scalable ML deploymentDoes batch normalization mean that sigmoids work better than ReLUs?Why is my Keras model not learning image segmentation?Batch normalization vs batch sizeUsing the Python Keras multi_gpu_model with LSTM / GRU to predict Timeseries data
$begingroup$
1) How does the batch normalization layer work with multi_gpu_model?
Is it calculated separately on each GPU, or is somehow synchronized between GPUs?
2) Which batch normalization parameters are saved when saving a model? (Since when using multiple-gpus in Keras, the original model must be saved, as suggested here)?
In the docs of multi_gpu_model
is says:
Specifically, this function implements single-machine multi-GPU data
parallelism.
What does it mean for Batch Normalization?
deep-learning keras tensorflow batch-normalization
$endgroup$
add a comment |
$begingroup$
1) How does the batch normalization layer work with multi_gpu_model?
Is it calculated separately on each GPU, or is somehow synchronized between GPUs?
2) Which batch normalization parameters are saved when saving a model? (Since when using multiple-gpus in Keras, the original model must be saved, as suggested here)?
In the docs of multi_gpu_model
is says:
Specifically, this function implements single-machine multi-GPU data
parallelism.
What does it mean for Batch Normalization?
deep-learning keras tensorflow batch-normalization
$endgroup$
add a comment |
$begingroup$
1) How does the batch normalization layer work with multi_gpu_model?
Is it calculated separately on each GPU, or is somehow synchronized between GPUs?
2) Which batch normalization parameters are saved when saving a model? (Since when using multiple-gpus in Keras, the original model must be saved, as suggested here)?
In the docs of multi_gpu_model
is says:
Specifically, this function implements single-machine multi-GPU data
parallelism.
What does it mean for Batch Normalization?
deep-learning keras tensorflow batch-normalization
$endgroup$
1) How does the batch normalization layer work with multi_gpu_model?
Is it calculated separately on each GPU, or is somehow synchronized between GPUs?
2) Which batch normalization parameters are saved when saving a model? (Since when using multiple-gpus in Keras, the original model must be saved, as suggested here)?
In the docs of multi_gpu_model
is says:
Specifically, this function implements single-machine multi-GPU data
parallelism.
What does it mean for Batch Normalization?
deep-learning keras tensorflow batch-normalization
deep-learning keras tensorflow batch-normalization
edited Mar 22 at 15:49
Ethan
588224
588224
asked Mar 22 at 14:34
Antonio JurićAntonio Jurić
731111
731111
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
1) How does batch normalization layer work with multi_gpu_model?
For N GPUs, there are N copies of model, one on each GPU. For each copy, forward and backward passes are executed for a sub-batch (each sub-batch is 1/Nth of a batch). This means, batch normalization is actually sub-batch normalization, there is no access to the rest of batch.
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)
2) Which batch normalization parameters are saved when saving a model
(since when using multiple-gpus in Keras original model must be saved,
as suggested here)?
One unified (template) weight is saved instead of N different weights. Each GPU calculates a different gradient due to receiving a different sub-batch. Then, either weights are updated separately on each GPU and become synchronized periodically, or N outputs and gradients are aggregated on the template model (on CPU), and then new weights are broadcasted back to GPUs. In both cases, there is a unified (template) model, although in the first case, the template model may not have the latest changes if it is kept out of synchronization and gets updated occasionally.
Extra remarks
Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:
Standard Implementations of BN in public frameworks (suck as Caffe,
MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
data are normalized within each GPU.Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.
Some technicalities of
multi_gpu_model
are discussed in this github issue.multi_gpu_model
has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:
$endgroup$
$begingroup$
How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)
) ?
$endgroup$
– Antonio Jurić
Mar 25 at 9:49
$begingroup$
@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the.fit
is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
$endgroup$
– Esmailian
Mar 25 at 11:15
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47795%2fkeras-multi-gpu-batch-normalization%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
1) How does batch normalization layer work with multi_gpu_model?
For N GPUs, there are N copies of model, one on each GPU. For each copy, forward and backward passes are executed for a sub-batch (each sub-batch is 1/Nth of a batch). This means, batch normalization is actually sub-batch normalization, there is no access to the rest of batch.
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)
2) Which batch normalization parameters are saved when saving a model
(since when using multiple-gpus in Keras original model must be saved,
as suggested here)?
One unified (template) weight is saved instead of N different weights. Each GPU calculates a different gradient due to receiving a different sub-batch. Then, either weights are updated separately on each GPU and become synchronized periodically, or N outputs and gradients are aggregated on the template model (on CPU), and then new weights are broadcasted back to GPUs. In both cases, there is a unified (template) model, although in the first case, the template model may not have the latest changes if it is kept out of synchronization and gets updated occasionally.
Extra remarks
Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:
Standard Implementations of BN in public frameworks (suck as Caffe,
MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
data are normalized within each GPU.Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.
Some technicalities of
multi_gpu_model
are discussed in this github issue.multi_gpu_model
has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:
$endgroup$
$begingroup$
How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)
) ?
$endgroup$
– Antonio Jurić
Mar 25 at 9:49
$begingroup$
@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the.fit
is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
$endgroup$
– Esmailian
Mar 25 at 11:15
add a comment |
$begingroup$
1) How does batch normalization layer work with multi_gpu_model?
For N GPUs, there are N copies of model, one on each GPU. For each copy, forward and backward passes are executed for a sub-batch (each sub-batch is 1/Nth of a batch). This means, batch normalization is actually sub-batch normalization, there is no access to the rest of batch.
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)
2) Which batch normalization parameters are saved when saving a model
(since when using multiple-gpus in Keras original model must be saved,
as suggested here)?
One unified (template) weight is saved instead of N different weights. Each GPU calculates a different gradient due to receiving a different sub-batch. Then, either weights are updated separately on each GPU and become synchronized periodically, or N outputs and gradients are aggregated on the template model (on CPU), and then new weights are broadcasted back to GPUs. In both cases, there is a unified (template) model, although in the first case, the template model may not have the latest changes if it is kept out of synchronization and gets updated occasionally.
Extra remarks
Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:
Standard Implementations of BN in public frameworks (suck as Caffe,
MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
data are normalized within each GPU.Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.
Some technicalities of
multi_gpu_model
are discussed in this github issue.multi_gpu_model
has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:
$endgroup$
$begingroup$
How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)
) ?
$endgroup$
– Antonio Jurić
Mar 25 at 9:49
$begingroup$
@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the.fit
is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
$endgroup$
– Esmailian
Mar 25 at 11:15
add a comment |
$begingroup$
1) How does batch normalization layer work with multi_gpu_model?
For N GPUs, there are N copies of model, one on each GPU. For each copy, forward and backward passes are executed for a sub-batch (each sub-batch is 1/Nth of a batch). This means, batch normalization is actually sub-batch normalization, there is no access to the rest of batch.
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)
2) Which batch normalization parameters are saved when saving a model
(since when using multiple-gpus in Keras original model must be saved,
as suggested here)?
One unified (template) weight is saved instead of N different weights. Each GPU calculates a different gradient due to receiving a different sub-batch. Then, either weights are updated separately on each GPU and become synchronized periodically, or N outputs and gradients are aggregated on the template model (on CPU), and then new weights are broadcasted back to GPUs. In both cases, there is a unified (template) model, although in the first case, the template model may not have the latest changes if it is kept out of synchronization and gets updated occasionally.
Extra remarks
Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:
Standard Implementations of BN in public frameworks (suck as Caffe,
MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
data are normalized within each GPU.Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.
Some technicalities of
multi_gpu_model
are discussed in this github issue.multi_gpu_model
has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:
$endgroup$
1) How does batch normalization layer work with multi_gpu_model?
For N GPUs, there are N copies of model, one on each GPU. For each copy, forward and backward passes are executed for a sub-batch (each sub-batch is 1/Nth of a batch). This means, batch normalization is actually sub-batch normalization, there is no access to the rest of batch.
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)
2) Which batch normalization parameters are saved when saving a model
(since when using multiple-gpus in Keras original model must be saved,
as suggested here)?
One unified (template) weight is saved instead of N different weights. Each GPU calculates a different gradient due to receiving a different sub-batch. Then, either weights are updated separately on each GPU and become synchronized periodically, or N outputs and gradients are aggregated on the template model (on CPU), and then new weights are broadcasted back to GPUs. In both cases, there is a unified (template) model, although in the first case, the template model may not have the latest changes if it is kept out of synchronization and gets updated occasionally.
Extra remarks
Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:
Standard Implementations of BN in public frameworks (suck as Caffe,
MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
data are normalized within each GPU.Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.
Some technicalities of
multi_gpu_model
are discussed in this github issue.multi_gpu_model
has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:
edited 2 hours ago
answered Mar 22 at 15:49
EsmailianEsmailian
2,048218
2,048218
$begingroup$
How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)
) ?
$endgroup$
– Antonio Jurić
Mar 25 at 9:49
$begingroup$
@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the.fit
is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
$endgroup$
– Esmailian
Mar 25 at 11:15
add a comment |
$begingroup$
How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)
) ?
$endgroup$
– Antonio Jurić
Mar 25 at 9:49
$begingroup$
@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the.fit
is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
$endgroup$
– Esmailian
Mar 25 at 11:15
$begingroup$
How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (
# Save model via the template model (which shares the same weights)
) ?$endgroup$
– Antonio Jurić
Mar 25 at 9:49
$begingroup$
How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (
# Save model via the template model (which shares the same weights)
) ?$endgroup$
– Antonio Jurić
Mar 25 at 9:49
$begingroup$
@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the
.fit
is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.$endgroup$
– Esmailian
Mar 25 at 11:15
$begingroup$
@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the
.fit
is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.$endgroup$
– Esmailian
Mar 25 at 11:15
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47795%2fkeras-multi-gpu-batch-normalization%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown