Keras multi-gpu batch normalization The Next CEO of Stack Overflow2019 Community Moderator ElectionPaper: What's the difference between Layer Normalization, Recurrent Batch Normalization (2016), and Batch Normalized RNN (2015)?Multi GPU in kerasunderstanding batch normalizationBatch normalization variance calculationBatch Normalization and Input Normalization in CNNNavigating the jungle of choices for scalable ML deploymentDoes batch normalization mean that sigmoids work better than ReLUs?Why is my Keras model not learning image segmentation?Batch normalization vs batch sizeUsing the Python Keras multi_gpu_model with LSTM / GRU to predict Timeseries data

How to write the block matrix in LaTex?

What's the Pac-Man-like video game seen in the movie "Joysticks"?

Why is there a PLL in CPU?

At which OSI layer a user-generated data resides?

Only print output after finding pattern

Inappropriate reference requests from Journal reviewers

Help understanding this unsettling image of Titan, Epimetheus, and Saturn's rings?

How did people program for Consoles with multiple CPUs?

Does the Brexit deal have to be agreed by both Houses?

What can we do to stop prior company from asking us questions?

Is there a difference between "Fahrstuhl" and "Aufzug"

Can the Reverse Gravity spell affect the Meteor Swarm spell?

How to get regions to plot as graphics

Increase performance creating Mandelbrot set in python

What is the difference between "behavior" and "behaviour"?

Why do professional authors make "consistency" mistakes? And how to avoid them?

Why do airplanes bank sharply to the right after air-to-air refueling?

How easy is it to start Magic from scratch?

Clustering points and summing up attributes per cluster in QGIS

How do I go from 300 unfinished/half written blog posts, to published posts?

Grabbing quick drinks

Which organization defines CJK Unified Ideographs?

Unreliable Magic - Is it worth it?

Oh, one short ode of love

Keras multi-gpu batch normalization

The Next CEO of Stack Overflow

2019 Community Moderator ElectionPaper: What's the difference between Layer Normalization, Recurrent Batch Normalization (2016), and Batch Normalized RNN (2015)?Multi GPU in kerasunderstanding batch normalizationBatch normalization variance calculationBatch Normalization and Input Normalization in CNNNavigating the jungle of choices for scalable ML deploymentDoes batch normalization mean that sigmoids work better than ReLUs?Why is my Keras model not learning image segmentation?Batch normalization vs batch sizeUsing the Python Keras multi_gpu_model with LSTM / GRU to predict Timeseries data

1) How does the batch normalization layer work with multi_gpu_model?

Is it calculated separately on each GPU, or is somehow synchronized between GPUs?

2) Which batch normalization parameters are saved when saving a model? (Since when using multiple-gpus in Keras, the original model must be saved, as suggested here)?

In the docs of multi_gpu_model is says:

Specifically, this function implements single-machine multi-GPU data
parallelism.

What does it mean for Batch Normalization?

edited Mar 22 at 15:49

Ethan

588224

asked Mar 22 at 14:34

Antonio Jurić

731111

add a comment |

1) How does the batch normalization layer work with multi_gpu_model?

Is it calculated separately on each GPU, or is somehow synchronized between GPUs?

2) Which batch normalization parameters are saved when saving a model? (Since when using multiple-gpus in Keras, the original model must be saved, as suggested here)?

In the docs of multi_gpu_model is says:

Specifically, this function implements single-machine multi-GPU data
parallelism.

What does it mean for Batch Normalization?

edited Mar 22 at 15:49

Ethan

588224

asked Mar 22 at 14:34

Antonio Jurić

731111

add a comment |

1) How does the batch normalization layer work with multi_gpu_model?

Is it calculated separately on each GPU, or is somehow synchronized between GPUs?

2) Which batch normalization parameters are saved when saving a model? (Since when using multiple-gpus in Keras, the original model must be saved, as suggested here)?

In the docs of multi_gpu_model is says:

Specifically, this function implements single-machine multi-GPU data
parallelism.

What does it mean for Batch Normalization?

edited Mar 22 at 15:49

Ethan

588224

asked Mar 22 at 14:34

Antonio Jurić

731111

1) How does the batch normalization layer work with multi_gpu_model?

Is it calculated separately on each GPU, or is somehow synchronized between GPUs?

2) Which batch normalization parameters are saved when saving a model? (Since when using multiple-gpus in Keras, the original model must be saved, as suggested here)?

In the docs of multi_gpu_model is says:

Specifically, this function implements single-machine multi-GPU data
parallelism.

What does it mean for Batch Normalization?

deep-learning keras tensorflow batch-normalization

edited Mar 22 at 15:49

Ethan

588224

asked Mar 22 at 14:34

Antonio Jurić

731111

edited Mar 22 at 15:49

Ethan

588224

asked Mar 22 at 14:34

Antonio Jurić

731111

edited Mar 22 at 15:49

Ethan

588224

edited Mar 22 at 15:49

Ethan

588224

edited Mar 22 at 15:49

Ethan

588224

asked Mar 22 at 14:34

Antonio Jurić

731111

asked Mar 22 at 14:34

Antonio Jurić

731111

asked Mar 22 at 14:34

Antonio Jurić

731111

add a comment |

1 Answer
1

active

oldest

votes

1) How does batch normalization layer work with multi_gpu_model?

For N GPUs, there are N copies of model, one on each GPU. For each copy, forward and backward passes are executed for a sub-batch (each sub-batch is 1/Nth of a batch). This means, batch normalization is actually sub-batch normalization, there is no access to the rest of batch.

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

2) Which batch normalization parameters are saved when saving a model
(since when using multiple-gpus in Keras original model must be saved,
as suggested here)?

One unified (template) weight is saved instead of N different weights. Each GPU calculates a different gradient due to receiving a different sub-batch. Then, either weights are updated separately on each GPU and become synchronized periodically, or N outputs and gradients are aggregated on the template model (on CPU), and then new weights are broadcasted back to GPUs. In both cases, there is a unified (template) model, although in the first case, the template model may not have the latest changes if it is kept out of synchronization and gets updated occasionally.

Extra remarks

Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:

Standard Implementations of BN in public frameworks (suck as Caffe,
MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
data are normalized within each GPU.

Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.

Some technicalities of multi_gpu_model are discussed in this github issue. multi_gpu_model has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.

Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:

edited 2 hours ago

answered Mar 22 at 15:49

Esmailian

2,048218

$begingroup$
How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?
$endgroup$
– Antonio Jurić
Mar 25 at 9:49

$begingroup$
@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
$endgroup$
– Esmailian
Mar 25 at 11:15

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47795%2fkeras-multi-gpu-batch-normalization%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

1) How does batch normalization layer work with multi_gpu_model?

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

2) Which batch normalization parameters are saved when saving a model
(since when using multiple-gpus in Keras original model must be saved,
as suggested here)?

Extra remarks

Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:

Standard Implementations of BN in public frameworks (suck as Caffe,
MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
data are normalized within each GPU.

Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.

Some technicalities of multi_gpu_model are discussed in this github issue. multi_gpu_model has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.

Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:

edited 2 hours ago

answered Mar 22 at 15:49

Esmailian

2,048218

$begingroup$
How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?
$endgroup$
– Antonio Jurić
Mar 25 at 9:49

$begingroup$
@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
$endgroup$
– Esmailian
Mar 25 at 11:15

add a comment |

1) How does batch normalization layer work with multi_gpu_model?

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

2) Which batch normalization parameters are saved when saving a model
(since when using multiple-gpus in Keras original model must be saved,
as suggested here)?

Extra remarks

Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:

Standard Implementations of BN in public frameworks (suck as Caffe,
MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
data are normalized within each GPU.

Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.

Some technicalities of multi_gpu_model are discussed in this github issue. multi_gpu_model has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.

Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:

edited 2 hours ago

answered Mar 22 at 15:49

Esmailian

2,048218

$begingroup$
How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?
$endgroup$
– Antonio Jurić
Mar 25 at 9:49

$begingroup$
@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
$endgroup$
– Esmailian
Mar 25 at 11:15

add a comment |

1) How does batch normalization layer work with multi_gpu_model?

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

2) Which batch normalization parameters are saved when saving a model
(since when using multiple-gpus in Keras original model must be saved,
as suggested here)?

Extra remarks

Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:

Standard Implementations of BN in public frameworks (suck as Caffe,
MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
data are normalized within each GPU.

Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.

Some technicalities of multi_gpu_model are discussed in this github issue. multi_gpu_model has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.

Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:

edited 2 hours ago

answered Mar 22 at 15:49

Esmailian

2,048218

1) How does batch normalization layer work with multi_gpu_model?

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

2) Which batch normalization parameters are saved when saving a model
(since when using multiple-gpus in Keras original model must be saved,
as suggested here)?

Extra remarks

Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:

Standard Implementations of BN in public frameworks (suck as Caffe,
MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
data are normalized within each GPU.

Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.

Some technicalities of multi_gpu_model are discussed in this github issue. multi_gpu_model has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.

Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:

edited 2 hours ago

answered Mar 22 at 15:49

Esmailian

2,048218

edited 2 hours ago

answered Mar 22 at 15:49

Esmailian

2,048218

answered Mar 22 at 15:49

Esmailian

2,048218

answered Mar 22 at 15:49

Esmailian

2,048218

$begingroup$
How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?
$endgroup$
– Antonio Jurić
Mar 25 at 9:49

$begingroup$
@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
$endgroup$
– Esmailian
Mar 25 at 11:15

add a comment |

$begingroup$
How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?
$endgroup$
– Antonio Jurić
Mar 25 at 9:49

$begingroup$
@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
$endgroup$
– Esmailian
Mar 25 at 11:15

How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?

– Antonio Jurić
Mar 25 at 9:49

@AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.

– Esmailian
Mar 25 at 11:15

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

IZA68 7ihF5jh,LylahP 9b,Q59A,J,KeyrLSYnRdy6J

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1