Sampling Big Data for Predictive Analytics in Python The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsIs Python suitable for big dataLooking for a strong Phd Topic in Predictive Analytics in the context of Big DataCan data analytics be a basis for artificial intelligence?Big data and data mining for CRM?Sampling from a multivariate von Mises-Fisher distribution in PythonSampling for multi categorical variablePredictive Analytics on distributed systems vs standalone systemBig data analytics referencesIs stratified sampling necessary (random forest, Python)?Audit methodologies and standards for Big Data

Why does the Event Horizon Telescope (EHT) not include telescopes from Africa, Asia or Australia?

Can a novice safely splice in wire to lengthen 5V charging cable?

Did the UK government pay "millions and millions of dollars" to try to snag Julian Assange?

Wall plug outlet change

Semisimplicity of the category of coherent sheaves?

Simulating Exploding Dice

How does this infinite series simplify to an integral?

How to politely respond to generic emails requesting a PhD/job in my lab? Without wasting too much time

How to delete random line from file using Unix command?

The following signatures were invalid: EXPKEYSIG 1397BC53640DB551

Can smartphones with the same camera sensor have different image quality?

Mortgage adviser recommends a longer term than necessary combined with overpayments

Can the prologue be the backstory of your main character?

Relations between two reciprocal partial derivatives?

Sort list of array linked objects by keys and values

What's the point in a preamp?

Why can't devices on different VLANs, but on the same subnet, communicate?

Keeping a retro style to sci-fi spaceships?

Would an alien lifeform be able to achieve space travel if lacking in vision?

Am I ethically obligated to go into work on an off day if the reason is sudden?

Would it be possible to rearrange a dragon's flight muscle to somewhat circumvent the square-cube law?

What aspect of planet Earth must be changed to prevent the industrial revolution?

Can the DM override racial traits?

What was the last x86 CPU that did not have the x87 floating-point unit built in?



Sampling Big Data for Predictive Analytics in Python



The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsIs Python suitable for big dataLooking for a strong Phd Topic in Predictive Analytics in the context of Big DataCan data analytics be a basis for artificial intelligence?Big data and data mining for CRM?Sampling from a multivariate von Mises-Fisher distribution in PythonSampling for multi categorical variablePredictive Analytics on distributed systems vs standalone systemBig data analytics referencesIs stratified sampling necessary (random forest, Python)?Audit methodologies and standards for Big Data










0












$begingroup$


In practice, how does one go about sampling a from big data set (eg. +/- 50 million distinct observations) to perform ML using Python?
Most non-parametric models (e.g., SVM, ensemble models) start to push computer resources with much smaller sets (e.g., 200 features, 200K observations).



How is this done in practice in industry?



Other questions here get at this but don't explicitly ask. So this is not a duplicate.
Thanks in advance.










share|improve this question









$endgroup$
















    0












    $begingroup$


    In practice, how does one go about sampling a from big data set (eg. +/- 50 million distinct observations) to perform ML using Python?
    Most non-parametric models (e.g., SVM, ensemble models) start to push computer resources with much smaller sets (e.g., 200 features, 200K observations).



    How is this done in practice in industry?



    Other questions here get at this but don't explicitly ask. So this is not a duplicate.
    Thanks in advance.










    share|improve this question









    $endgroup$














      0












      0








      0





      $begingroup$


      In practice, how does one go about sampling a from big data set (eg. +/- 50 million distinct observations) to perform ML using Python?
      Most non-parametric models (e.g., SVM, ensemble models) start to push computer resources with much smaller sets (e.g., 200 features, 200K observations).



      How is this done in practice in industry?



      Other questions here get at this but don't explicitly ask. So this is not a duplicate.
      Thanks in advance.










      share|improve this question









      $endgroup$




      In practice, how does one go about sampling a from big data set (eg. +/- 50 million distinct observations) to perform ML using Python?
      Most non-parametric models (e.g., SVM, ensemble models) start to push computer resources with much smaller sets (e.g., 200 features, 200K observations).



      How is this done in practice in industry?



      Other questions here get at this but don't explicitly ask. So this is not a duplicate.
      Thanks in advance.







      machine-learning bigdata sampling






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 31 at 15:03









      Windstorm1981Windstorm1981

      1011




      1011




















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          This is what I do in projects :



          • Pre-process data in DB / Data Lake. aim is to :

          • A. Form batches (might require a new table with shuffled indices)

          • B. Create a copy with Normalization and other feature related tasks

          After this, try algorithms that support batch learning. Neural networks support batch learning and few other algos (https://sklearn.org/modules/scaling_strategies.html#incremental-learning).



          With batch learning, you can plot loss for each batch and see if algo is working or not.






          share|improve this answer









          $endgroup$













            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48297%2fsampling-big-data-for-predictive-analytics-in-python%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            This is what I do in projects :



            • Pre-process data in DB / Data Lake. aim is to :

            • A. Form batches (might require a new table with shuffled indices)

            • B. Create a copy with Normalization and other feature related tasks

            After this, try algorithms that support batch learning. Neural networks support batch learning and few other algos (https://sklearn.org/modules/scaling_strategies.html#incremental-learning).



            With batch learning, you can plot loss for each batch and see if algo is working or not.






            share|improve this answer









            $endgroup$

















              1












              $begingroup$

              This is what I do in projects :



              • Pre-process data in DB / Data Lake. aim is to :

              • A. Form batches (might require a new table with shuffled indices)

              • B. Create a copy with Normalization and other feature related tasks

              After this, try algorithms that support batch learning. Neural networks support batch learning and few other algos (https://sklearn.org/modules/scaling_strategies.html#incremental-learning).



              With batch learning, you can plot loss for each batch and see if algo is working or not.






              share|improve this answer









              $endgroup$















                1












                1








                1





                $begingroup$

                This is what I do in projects :



                • Pre-process data in DB / Data Lake. aim is to :

                • A. Form batches (might require a new table with shuffled indices)

                • B. Create a copy with Normalization and other feature related tasks

                After this, try algorithms that support batch learning. Neural networks support batch learning and few other algos (https://sklearn.org/modules/scaling_strategies.html#incremental-learning).



                With batch learning, you can plot loss for each batch and see if algo is working or not.






                share|improve this answer









                $endgroup$



                This is what I do in projects :



                • Pre-process data in DB / Data Lake. aim is to :

                • A. Form batches (might require a new table with shuffled indices)

                • B. Create a copy with Normalization and other feature related tasks

                After this, try algorithms that support batch learning. Neural networks support batch learning and few other algos (https://sklearn.org/modules/scaling_strategies.html#incremental-learning).



                With batch learning, you can plot loss for each batch and see if algo is working or not.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 31 at 16:44









                Shamit VermaShamit Verma

                1,5941314




                1,5941314



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48297%2fsampling-big-data-for-predictive-analytics-in-python%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                    Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                    Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High