Keras multi-gpu batch normalization The Next CEO of Stack Overflow2019 Community Moderator ElectionPaper: What's the difference between Layer Normalization, Recurrent Batch Normalization (2016), and Batch Normalized RNN (2015)?Multi GPU in kerasunderstanding batch normalizationBatch normalization variance calculationBatch Normalization and Input Normalization in CNNNavigating the jungle of choices for scalable ML deploymentDoes batch normalization mean that sigmoids work better than ReLUs?Why is my Keras model not learning image segmentation?Batch normalization vs batch sizeUsing the Python Keras multi_gpu_model with LSTM / GRU to predict Timeseries data

How to write the block matrix in LaTex?

What's the Pac-Man-like video game seen in the movie "Joysticks"?

Why is there a PLL in CPU?

At which OSI layer a user-generated data resides?

Only print output after finding pattern

Inappropriate reference requests from Journal reviewers

Help understanding this unsettling image of Titan, Epimetheus, and Saturn's rings?

How did people program for Consoles with multiple CPUs?

Does the Brexit deal have to be agreed by both Houses?

What can we do to stop prior company from asking us questions?

Is there a difference between "Fahrstuhl" and "Aufzug"

Can the Reverse Gravity spell affect the Meteor Swarm spell?

How to get regions to plot as graphics

Increase performance creating Mandelbrot set in python

What is the difference between "behavior" and "behaviour"?

Why do professional authors make "consistency" mistakes? And how to avoid them?

Why do airplanes bank sharply to the right after air-to-air refueling?

How easy is it to start Magic from scratch?

Clustering points and summing up attributes per cluster in QGIS

How do I go from 300 unfinished/half written blog posts, to published posts?

Grabbing quick drinks

Which organization defines CJK Unified Ideographs?

Unreliable Magic - Is it worth it?

Oh, one short ode of love



Keras multi-gpu batch normalization



The Next CEO of Stack Overflow
2019 Community Moderator ElectionPaper: What's the difference between Layer Normalization, Recurrent Batch Normalization (2016), and Batch Normalized RNN (2015)?Multi GPU in kerasunderstanding batch normalizationBatch normalization variance calculationBatch Normalization and Input Normalization in CNNNavigating the jungle of choices for scalable ML deploymentDoes batch normalization mean that sigmoids work better than ReLUs?Why is my Keras model not learning image segmentation?Batch normalization vs batch sizeUsing the Python Keras multi_gpu_model with LSTM / GRU to predict Timeseries data










2












$begingroup$


1) How does the batch normalization layer work with multi_gpu_model?



Is it calculated separately on each GPU, or is somehow synchronized between GPUs?



2) Which batch normalization parameters are saved when saving a model? (Since when using multiple-gpus in Keras, the original model must be saved, as suggested here)?



In the docs of multi_gpu_model is says:




Specifically, this function implements single-machine multi-GPU data
parallelism.




What does it mean for Batch Normalization?










share|improve this question











$endgroup$
















    2












    $begingroup$


    1) How does the batch normalization layer work with multi_gpu_model?



    Is it calculated separately on each GPU, or is somehow synchronized between GPUs?



    2) Which batch normalization parameters are saved when saving a model? (Since when using multiple-gpus in Keras, the original model must be saved, as suggested here)?



    In the docs of multi_gpu_model is says:




    Specifically, this function implements single-machine multi-GPU data
    parallelism.




    What does it mean for Batch Normalization?










    share|improve this question











    $endgroup$














      2












      2








      2





      $begingroup$


      1) How does the batch normalization layer work with multi_gpu_model?



      Is it calculated separately on each GPU, or is somehow synchronized between GPUs?



      2) Which batch normalization parameters are saved when saving a model? (Since when using multiple-gpus in Keras, the original model must be saved, as suggested here)?



      In the docs of multi_gpu_model is says:




      Specifically, this function implements single-machine multi-GPU data
      parallelism.




      What does it mean for Batch Normalization?










      share|improve this question











      $endgroup$




      1) How does the batch normalization layer work with multi_gpu_model?



      Is it calculated separately on each GPU, or is somehow synchronized between GPUs?



      2) Which batch normalization parameters are saved when saving a model? (Since when using multiple-gpus in Keras, the original model must be saved, as suggested here)?



      In the docs of multi_gpu_model is says:




      Specifically, this function implements single-machine multi-GPU data
      parallelism.




      What does it mean for Batch Normalization?







      deep-learning keras tensorflow batch-normalization






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 22 at 15:49









      Ethan

      588224




      588224










      asked Mar 22 at 14:34









      Antonio JurićAntonio Jurić

      731111




      731111




















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$


          1) How does batch normalization layer work with multi_gpu_model?




          For N GPUs, there are N copies of model, one on each GPU. For each copy, forward and backward passes are executed for a sub-batch (each sub-batch is 1/Nth of a batch). This means, batch normalization is actually sub-batch normalization, there is no access to the rest of batch.



          # This `fit` call will be distributed on 8 GPUs.
          # Since the batch size is 256, each GPU will process 32 samples.
          parallel_model.fit(x, y, epochs=20, batch_size=256)



          2) Which batch normalization parameters are saved when saving a model
          (since when using multiple-gpus in Keras original model must be saved,
          as suggested here)?




          One unified (template) weight is saved instead of N different weights. Each GPU calculates a different gradient due to receiving a different sub-batch. Then, either weights are updated separately on each GPU and become synchronized periodically, or N outputs and gradients are aggregated on the template model (on CPU), and then new weights are broadcasted back to GPUs. In both cases, there is a unified (template) model, although in the first case, the template model may not have the latest changes if it is kept out of synchronization and gets updated occasionally.



          Extra remarks




          1. Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:




            Standard Implementations of BN in public frameworks (suck as Caffe,
            MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
            data are normalized within each GPU.






          2. Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.


          3. Some technicalities of multi_gpu_model are discussed in this github issue. multi_gpu_model has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.


          4. Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:






          share|improve this answer











          $endgroup$












          • $begingroup$
            How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?
            $endgroup$
            – Antonio Jurić
            Mar 25 at 9:49











          • $begingroup$
            @AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
            $endgroup$
            – Esmailian
            Mar 25 at 11:15












          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47795%2fkeras-multi-gpu-batch-normalization%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$


          1) How does batch normalization layer work with multi_gpu_model?




          For N GPUs, there are N copies of model, one on each GPU. For each copy, forward and backward passes are executed for a sub-batch (each sub-batch is 1/Nth of a batch). This means, batch normalization is actually sub-batch normalization, there is no access to the rest of batch.



          # This `fit` call will be distributed on 8 GPUs.
          # Since the batch size is 256, each GPU will process 32 samples.
          parallel_model.fit(x, y, epochs=20, batch_size=256)



          2) Which batch normalization parameters are saved when saving a model
          (since when using multiple-gpus in Keras original model must be saved,
          as suggested here)?




          One unified (template) weight is saved instead of N different weights. Each GPU calculates a different gradient due to receiving a different sub-batch. Then, either weights are updated separately on each GPU and become synchronized periodically, or N outputs and gradients are aggregated on the template model (on CPU), and then new weights are broadcasted back to GPUs. In both cases, there is a unified (template) model, although in the first case, the template model may not have the latest changes if it is kept out of synchronization and gets updated occasionally.



          Extra remarks




          1. Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:




            Standard Implementations of BN in public frameworks (suck as Caffe,
            MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
            data are normalized within each GPU.






          2. Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.


          3. Some technicalities of multi_gpu_model are discussed in this github issue. multi_gpu_model has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.


          4. Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:






          share|improve this answer











          $endgroup$












          • $begingroup$
            How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?
            $endgroup$
            – Antonio Jurić
            Mar 25 at 9:49











          • $begingroup$
            @AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
            $endgroup$
            – Esmailian
            Mar 25 at 11:15
















          1












          $begingroup$


          1) How does batch normalization layer work with multi_gpu_model?




          For N GPUs, there are N copies of model, one on each GPU. For each copy, forward and backward passes are executed for a sub-batch (each sub-batch is 1/Nth of a batch). This means, batch normalization is actually sub-batch normalization, there is no access to the rest of batch.



          # This `fit` call will be distributed on 8 GPUs.
          # Since the batch size is 256, each GPU will process 32 samples.
          parallel_model.fit(x, y, epochs=20, batch_size=256)



          2) Which batch normalization parameters are saved when saving a model
          (since when using multiple-gpus in Keras original model must be saved,
          as suggested here)?




          One unified (template) weight is saved instead of N different weights. Each GPU calculates a different gradient due to receiving a different sub-batch. Then, either weights are updated separately on each GPU and become synchronized periodically, or N outputs and gradients are aggregated on the template model (on CPU), and then new weights are broadcasted back to GPUs. In both cases, there is a unified (template) model, although in the first case, the template model may not have the latest changes if it is kept out of synchronization and gets updated occasionally.



          Extra remarks




          1. Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:




            Standard Implementations of BN in public frameworks (suck as Caffe,
            MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
            data are normalized within each GPU.






          2. Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.


          3. Some technicalities of multi_gpu_model are discussed in this github issue. multi_gpu_model has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.


          4. Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:






          share|improve this answer











          $endgroup$












          • $begingroup$
            How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?
            $endgroup$
            – Antonio Jurić
            Mar 25 at 9:49











          • $begingroup$
            @AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
            $endgroup$
            – Esmailian
            Mar 25 at 11:15














          1












          1








          1





          $begingroup$


          1) How does batch normalization layer work with multi_gpu_model?




          For N GPUs, there are N copies of model, one on each GPU. For each copy, forward and backward passes are executed for a sub-batch (each sub-batch is 1/Nth of a batch). This means, batch normalization is actually sub-batch normalization, there is no access to the rest of batch.



          # This `fit` call will be distributed on 8 GPUs.
          # Since the batch size is 256, each GPU will process 32 samples.
          parallel_model.fit(x, y, epochs=20, batch_size=256)



          2) Which batch normalization parameters are saved when saving a model
          (since when using multiple-gpus in Keras original model must be saved,
          as suggested here)?




          One unified (template) weight is saved instead of N different weights. Each GPU calculates a different gradient due to receiving a different sub-batch. Then, either weights are updated separately on each GPU and become synchronized periodically, or N outputs and gradients are aggregated on the template model (on CPU), and then new weights are broadcasted back to GPUs. In both cases, there is a unified (template) model, although in the first case, the template model may not have the latest changes if it is kept out of synchronization and gets updated occasionally.



          Extra remarks




          1. Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:




            Standard Implementations of BN in public frameworks (suck as Caffe,
            MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
            data are normalized within each GPU.






          2. Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.


          3. Some technicalities of multi_gpu_model are discussed in this github issue. multi_gpu_model has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.


          4. Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:






          share|improve this answer











          $endgroup$




          1) How does batch normalization layer work with multi_gpu_model?




          For N GPUs, there are N copies of model, one on each GPU. For each copy, forward and backward passes are executed for a sub-batch (each sub-batch is 1/Nth of a batch). This means, batch normalization is actually sub-batch normalization, there is no access to the rest of batch.



          # This `fit` call will be distributed on 8 GPUs.
          # Since the batch size is 256, each GPU will process 32 samples.
          parallel_model.fit(x, y, epochs=20, batch_size=256)



          2) Which batch normalization parameters are saved when saving a model
          (since when using multiple-gpus in Keras original model must be saved,
          as suggested here)?




          One unified (template) weight is saved instead of N different weights. Each GPU calculates a different gradient due to receiving a different sub-batch. Then, either weights are updated separately on each GPU and become synchronized periodically, or N outputs and gradients are aggregated on the template model (on CPU), and then new weights are broadcasted back to GPUs. In both cases, there is a unified (template) model, although in the first case, the template model may not have the latest changes if it is kept out of synchronization and gets updated occasionally.



          Extra remarks




          1. Some researchers have proposed a specific synchronizing technique for batch normalization to utilize the whole batch instead of a sub-batch. They state:




            Standard Implementations of BN in public frameworks (suck as Caffe,
            MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the
            data are normalized within each GPU.






          2. Because of weights synchronization we cannot expect a linear speed-up w.r.t. number of GPUs.


          3. Some technicalities of multi_gpu_model are discussed in this github issue. multi_gpu_model has a speed gain when weights are sparse (in comparison to Dense layers), otherwise weights synchronization becomes a bottleneck.


          4. Also, here is an example of GPU-GPU weight synchronization flow from Nvidia:







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 2 hours ago

























          answered Mar 22 at 15:49









          EsmailianEsmailian

          2,048218




          2,048218











          • $begingroup$
            How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?
            $endgroup$
            – Antonio Jurić
            Mar 25 at 9:49











          • $begingroup$
            @AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
            $endgroup$
            – Esmailian
            Mar 25 at 11:15

















          • $begingroup$
            How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?
            $endgroup$
            – Antonio Jurić
            Mar 25 at 9:49











          • $begingroup$
            @AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
            $endgroup$
            – Esmailian
            Mar 25 at 11:15
















          $begingroup$
          How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?
          $endgroup$
          – Antonio Jurić
          Mar 25 at 9:49





          $begingroup$
          How do you know that Keras does weights synchronization since in the docs of multi_gpu_model function it only says that original model and the parallel model share weights (# Save model via the template model (which shares the same weights)) ?
          $endgroup$
          – Antonio Jurić
          Mar 25 at 9:49













          $begingroup$
          @AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
          $endgroup$
          – Esmailian
          Mar 25 at 11:15





          $begingroup$
          @AntonioJurić you are right, that's an inaccurate statement, because if all N copies have the same weight at all times, what the .fit is doing on separate GPUs and separate data? So weights are different temporarily and become unified periodically.
          $endgroup$
          – Esmailian
          Mar 25 at 11:15


















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47795%2fkeras-multi-gpu-batch-normalization%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

          Do these cracks on my tires look bad? The Next CEO of Stack OverflowDry rot tire should I replace?Having to replace tiresFishtailed so easily? Bad tires? ABS?Filling the tires with something other than air, to avoid puncture hassles?Used Michelin tires safe to install?Do these tyre cracks necessitate replacement?Rumbling noise: tires or mechanicalIs it possible to fix noisy feathered tires?Are bad winter tires still better than summer tires in winter?Torque converter failure - Related to replacing only 2 tires?Why use snow tires on all 4 wheels on 2-wheel-drive cars?