Sampling Big Data for Predictive Analytics in Python The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsIs Python suitable for big dataLooking for a strong Phd Topic in Predictive Analytics in the context of Big DataCan data analytics be a basis for artificial intelligence?Big data and data mining for CRM?Sampling from a multivariate von Mises-Fisher distribution in PythonSampling for multi categorical variablePredictive Analytics on distributed systems vs standalone systemBig data analytics referencesIs stratified sampling necessary (random forest, Python)?Audit methodologies and standards for Big Data
Why does the Event Horizon Telescope (EHT) not include telescopes from Africa, Asia or Australia?
Can a novice safely splice in wire to lengthen 5V charging cable?
Did the UK government pay "millions and millions of dollars" to try to snag Julian Assange?
Wall plug outlet change
Semisimplicity of the category of coherent sheaves?
Simulating Exploding Dice
How does this infinite series simplify to an integral?
How to politely respond to generic emails requesting a PhD/job in my lab? Without wasting too much time
How to delete random line from file using Unix command?
The following signatures were invalid: EXPKEYSIG 1397BC53640DB551
Can smartphones with the same camera sensor have different image quality?
Mortgage adviser recommends a longer term than necessary combined with overpayments
Can the prologue be the backstory of your main character?
Relations between two reciprocal partial derivatives?
Sort list of array linked objects by keys and values
What's the point in a preamp?
Why can't devices on different VLANs, but on the same subnet, communicate?
Keeping a retro style to sci-fi spaceships?
Would an alien lifeform be able to achieve space travel if lacking in vision?
Am I ethically obligated to go into work on an off day if the reason is sudden?
Would it be possible to rearrange a dragon's flight muscle to somewhat circumvent the square-cube law?
What aspect of planet Earth must be changed to prevent the industrial revolution?
Can the DM override racial traits?
What was the last x86 CPU that did not have the x87 floating-point unit built in?
Sampling Big Data for Predictive Analytics in Python
The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsIs Python suitable for big dataLooking for a strong Phd Topic in Predictive Analytics in the context of Big DataCan data analytics be a basis for artificial intelligence?Big data and data mining for CRM?Sampling from a multivariate von Mises-Fisher distribution in PythonSampling for multi categorical variablePredictive Analytics on distributed systems vs standalone systemBig data analytics referencesIs stratified sampling necessary (random forest, Python)?Audit methodologies and standards for Big Data
$begingroup$
In practice, how does one go about sampling a from big data set (eg. +/- 50 million distinct observations) to perform ML using Python?
Most non-parametric models (e.g., SVM, ensemble models) start to push computer resources with much smaller sets (e.g., 200 features, 200K observations).
How is this done in practice in industry?
Other questions here get at this but don't explicitly ask. So this is not a duplicate.
Thanks in advance.
machine-learning bigdata sampling
$endgroup$
add a comment |
$begingroup$
In practice, how does one go about sampling a from big data set (eg. +/- 50 million distinct observations) to perform ML using Python?
Most non-parametric models (e.g., SVM, ensemble models) start to push computer resources with much smaller sets (e.g., 200 features, 200K observations).
How is this done in practice in industry?
Other questions here get at this but don't explicitly ask. So this is not a duplicate.
Thanks in advance.
machine-learning bigdata sampling
$endgroup$
add a comment |
$begingroup$
In practice, how does one go about sampling a from big data set (eg. +/- 50 million distinct observations) to perform ML using Python?
Most non-parametric models (e.g., SVM, ensemble models) start to push computer resources with much smaller sets (e.g., 200 features, 200K observations).
How is this done in practice in industry?
Other questions here get at this but don't explicitly ask. So this is not a duplicate.
Thanks in advance.
machine-learning bigdata sampling
$endgroup$
In practice, how does one go about sampling a from big data set (eg. +/- 50 million distinct observations) to perform ML using Python?
Most non-parametric models (e.g., SVM, ensemble models) start to push computer resources with much smaller sets (e.g., 200 features, 200K observations).
How is this done in practice in industry?
Other questions here get at this but don't explicitly ask. So this is not a duplicate.
Thanks in advance.
machine-learning bigdata sampling
machine-learning bigdata sampling
asked Mar 31 at 15:03
Windstorm1981Windstorm1981
1011
1011
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
This is what I do in projects :
- Pre-process data in DB / Data Lake. aim is to :
- A. Form batches (might require a new table with shuffled indices)
- B. Create a copy with Normalization and other feature related tasks
After this, try algorithms that support batch learning. Neural networks support batch learning and few other algos (https://sklearn.org/modules/scaling_strategies.html#incremental-learning).
With batch learning, you can plot loss for each batch and see if algo is working or not.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48297%2fsampling-big-data-for-predictive-analytics-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
This is what I do in projects :
- Pre-process data in DB / Data Lake. aim is to :
- A. Form batches (might require a new table with shuffled indices)
- B. Create a copy with Normalization and other feature related tasks
After this, try algorithms that support batch learning. Neural networks support batch learning and few other algos (https://sklearn.org/modules/scaling_strategies.html#incremental-learning).
With batch learning, you can plot loss for each batch and see if algo is working or not.
$endgroup$
add a comment |
$begingroup$
This is what I do in projects :
- Pre-process data in DB / Data Lake. aim is to :
- A. Form batches (might require a new table with shuffled indices)
- B. Create a copy with Normalization and other feature related tasks
After this, try algorithms that support batch learning. Neural networks support batch learning and few other algos (https://sklearn.org/modules/scaling_strategies.html#incremental-learning).
With batch learning, you can plot loss for each batch and see if algo is working or not.
$endgroup$
add a comment |
$begingroup$
This is what I do in projects :
- Pre-process data in DB / Data Lake. aim is to :
- A. Form batches (might require a new table with shuffled indices)
- B. Create a copy with Normalization and other feature related tasks
After this, try algorithms that support batch learning. Neural networks support batch learning and few other algos (https://sklearn.org/modules/scaling_strategies.html#incremental-learning).
With batch learning, you can plot loss for each batch and see if algo is working or not.
$endgroup$
This is what I do in projects :
- Pre-process data in DB / Data Lake. aim is to :
- A. Form batches (might require a new table with shuffled indices)
- B. Create a copy with Normalization and other feature related tasks
After this, try algorithms that support batch learning. Neural networks support batch learning and few other algos (https://sklearn.org/modules/scaling_strategies.html#incremental-learning).
With batch learning, you can plot loss for each batch and see if algo is working or not.
answered Mar 31 at 16:44
Shamit VermaShamit Verma
1,5941314
1,5941314
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48297%2fsampling-big-data-for-predictive-analytics-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown