How important is the input data for a ML model?Predicting Soccer: guessing which matches a model will predict correctlyBest regression model to use for sales predictionPrediction model for marketing to prospective customers (using pandas)Python: How to make model predict in a generalized manner using ML AlgorithmPython: Handling imbalance Classes in python Machine LearningDecision Tree generating leaves for only one caseLogistic regression on biased datamodel to predict annual outcome based on previous years dataMachine Learning in real timeWhat Machine Learning Algorithm could I use to determine some measure in a date?

Does an advisor owe his/her student anything? Will an advisor keep a PhD student only out of pity?

Multiplicative persistence

A variation to the phrase "hanging over my shoulders"

What is going on with gets(stdin) on the site coderbyte?

Doesn't the system of the Supreme Court oppose justice?

Do we have to expect a queue for the shuttle from Watford Junction to Harry Potter Studio?

Can you use Vicious Mockery to win an argument or gain favours?

A Trivial Diagnosis

Does Doodling or Improvising on the Piano Have Any Benefits?

Make a Bowl of Alphabet Soup

Shouldn’t conservatives embrace universal basic income?

What's the name of the logical fallacy where a debater extends a statement far beyond the original statement to make it true?

Is this part of the description of the Archfey warlock's Misty Escape feature redundant?

15% tax on $7.5k earnings. Is that right?

"before" and "want" for the same systemd service?

Review your own paper in Mathematics

Why is it that I can sometimes guess the next note?

What kind of floor tile is this?

Merge org tables

Does "he squandered his car on drink" sound natural?

How to draw a matrix with arrows in limited space

Giving feedback to someone without sounding prejudiced

How do you make your own symbol when Detexify fails?

The IT department bottlenecks progress, how should I handle this?

How important is the input data for a ML model?

Predicting Soccer: guessing which matches a model will predict correctlyBest regression model to use for sales predictionPrediction model for marketing to prospective customers (using pandas)Python: How to make model predict in a generalized manner using ML AlgorithmPython: Handling imbalance Classes in python Machine LearningDecision Tree generating leaves for only one caseLogistic regression on biased datamodel to predict annual outcome based on previous years dataMachine Learning in real timeWhat Machine Learning Algorithm could I use to determine some measure in a date?

Last 4-6 weeks, I have been learning and working for the first time on ML. Reading blogs, articles, documentations, etc. and practising. Have asked lot of questions here on Stack Overflow as well.

While I have got some amount of hands-on experience, but still got a very basic doubt (confusion) --
When I take my input data set with 1000 records, the model prediction accuracy is say 75%. When I keep 50000 records, the model accuracy is 65%.

1) Does that mean the model responds completely based on the i/p data being fed into?

2) If #1 is true, then in real-world where we don't have control on input data, how will the model work?

Ex. For suggesting products to a customer, the input data to the model would be the past customer buying experiences. As the quantity of input data increases, the prediction accuracy will increase or decrease?

Please let me know if I need to add further details to my question.

Thanks.

Edit - 1 - Below added frequency distribution of my input data:

enter image description here

Edit - 2 - Adding Confusion matrix and Classification report:

enter image description here

edited 2 days ago

asked Mar 18 at 22:31

ranit.b

547

$begingroup$
It looks like your model overfits did you try to do a train/test split?
$endgroup$
– Robin Nicole
Mar 18 at 23:00

$begingroup$
Thanks Robin. Yes, I've have a 75/25 split. Just out of curiosity, may I ask what hint made you think that the model overfits? ps. Added frequency distribution of my input data in the question.
$endgroup$
– ranit.b
Mar 18 at 23:07

$begingroup$
So I guess it is your test accuracy which decreases. If your training accuracy keeps on increasing but your test accuracy decreases it meanss your model is overfitting.
$endgroup$
– Robin Nicole
Mar 18 at 23:15

$begingroup$
Would you please consider moving this as a Comment to the original question? It works much better that way than a standalone answer.
$endgroup$
– I_Play_With_Data
2 days ago

$begingroup$
Just added: Confusion Matrix and Classification Report. Do you think this is a case of class imbalance ? (where the output classes with less training data are never predicted)
$endgroup$
– ranit.b
2 days ago

|
show 1 more comment

Last 4-6 weeks, I have been learning and working for the first time on ML. Reading blogs, articles, documentations, etc. and practising. Have asked lot of questions here on Stack Overflow as well.

1) Does that mean the model responds completely based on the i/p data being fed into?

2) If #1 is true, then in real-world where we don't have control on input data, how will the model work?

Please let me know if I need to add further details to my question.

Thanks.

Edit - 1 - Below added frequency distribution of my input data:

enter image description here

Edit - 2 - Adding Confusion matrix and Classification report:

enter image description here

edited 2 days ago

asked Mar 18 at 22:31

ranit.b

547

$begingroup$
It looks like your model overfits did you try to do a train/test split?
$endgroup$
– Robin Nicole
Mar 18 at 23:00

$begingroup$
Thanks Robin. Yes, I've have a 75/25 split. Just out of curiosity, may I ask what hint made you think that the model overfits? ps. Added frequency distribution of my input data in the question.
$endgroup$
– ranit.b
Mar 18 at 23:07

$begingroup$
So I guess it is your test accuracy which decreases. If your training accuracy keeps on increasing but your test accuracy decreases it meanss your model is overfitting.
$endgroup$
– Robin Nicole
Mar 18 at 23:15

$begingroup$
Would you please consider moving this as a Comment to the original question? It works much better that way than a standalone answer.
$endgroup$
– I_Play_With_Data
2 days ago

$begingroup$
Just added: Confusion Matrix and Classification Report. Do you think this is a case of class imbalance ? (where the output classes with less training data are never predicted)
$endgroup$
– ranit.b
2 days ago

|
show 1 more comment

Last 4-6 weeks, I have been learning and working for the first time on ML. Reading blogs, articles, documentations, etc. and practising. Have asked lot of questions here on Stack Overflow as well.

1) Does that mean the model responds completely based on the i/p data being fed into?

2) If #1 is true, then in real-world where we don't have control on input data, how will the model work?

Please let me know if I need to add further details to my question.

Thanks.

Edit - 1 - Below added frequency distribution of my input data:

enter image description here

Edit - 2 - Adding Confusion matrix and Classification report:

enter image description here

edited 2 days ago

asked Mar 18 at 22:31

ranit.b

547

Last 4-6 weeks, I have been learning and working for the first time on ML. Reading blogs, articles, documentations, etc. and practising. Have asked lot of questions here on Stack Overflow as well.

1) Does that mean the model responds completely based on the i/p data being fed into?

2) If #1 is true, then in real-world where we don't have control on input data, how will the model work?

Please let me know if I need to add further details to my question.

Thanks.

Edit - 1 - Below added frequency distribution of my input data:

enter image description here

Edit - 2 - Adding Confusion matrix and Classification report:

enter image description here

machine-learning predictive-modeling machine-learning-model

edited 2 days ago

asked Mar 18 at 22:31

ranit.b

547

edited 2 days ago

asked Mar 18 at 22:31

ranit.b

547

edited 2 days ago

asked Mar 18 at 22:31

ranit.b

547

asked Mar 18 at 22:31

ranit.b

547

asked Mar 18 at 22:31

ranit.b

547

$begingroup$
It looks like your model overfits did you try to do a train/test split?
$endgroup$
– Robin Nicole
Mar 18 at 23:00

$begingroup$
Thanks Robin. Yes, I've have a 75/25 split. Just out of curiosity, may I ask what hint made you think that the model overfits? ps. Added frequency distribution of my input data in the question.
$endgroup$
– ranit.b
Mar 18 at 23:07

$begingroup$
So I guess it is your test accuracy which decreases. If your training accuracy keeps on increasing but your test accuracy decreases it meanss your model is overfitting.
$endgroup$
– Robin Nicole
Mar 18 at 23:15

$begingroup$
Would you please consider moving this as a Comment to the original question? It works much better that way than a standalone answer.
$endgroup$
– I_Play_With_Data
2 days ago

$begingroup$
Just added: Confusion Matrix and Classification Report. Do you think this is a case of class imbalance ? (where the output classes with less training data are never predicted)
$endgroup$
– ranit.b
2 days ago

|
show 1 more comment

$begingroup$
It looks like your model overfits did you try to do a train/test split?
$endgroup$
– Robin Nicole
Mar 18 at 23:00

$begingroup$
Thanks Robin. Yes, I've have a 75/25 split. Just out of curiosity, may I ask what hint made you think that the model overfits? ps. Added frequency distribution of my input data in the question.
$endgroup$
– ranit.b
Mar 18 at 23:07

$begingroup$
So I guess it is your test accuracy which decreases. If your training accuracy keeps on increasing but your test accuracy decreases it meanss your model is overfitting.
$endgroup$
– Robin Nicole
Mar 18 at 23:15

$begingroup$
Would you please consider moving this as a Comment to the original question? It works much better that way than a standalone answer.
$endgroup$
– I_Play_With_Data
2 days ago

$begingroup$
Just added: Confusion Matrix and Classification Report. Do you think this is a case of class imbalance ? (where the output classes with less training data are never predicted)
$endgroup$
– ranit.b
2 days ago

It looks like your model overfits did you try to do a train/test split?

– Robin Nicole
Mar 18 at 23:00

Thanks Robin. Yes, I've have a 75/25 split. Just out of curiosity, may I ask what hint made you think that the model overfits? ps. Added frequency distribution of my input data in the question.

– ranit.b
Mar 18 at 23:07

So I guess it is your test accuracy which decreases. If your training accuracy keeps on increasing but your test accuracy decreases it meanss your model is overfitting.

– Robin Nicole
Mar 18 at 23:15

Would you please consider moving this as a Comment to the original question? It works much better that way than a standalone answer.

– I_Play_With_Data
2 days ago

Just added: Confusion Matrix and Classification Report. Do you think this is a case of class imbalance ? (where the output classes with less training data are never predicted)

– ranit.b
2 days ago

|
show 1 more comment

2 Answers
2

active

oldest

votes

To answer your first question, the accuracy of the model highly depends on the "quality" of the input data. Basically, your training data should represent the same scenario as that of the final model deployment environment.

There are two possible reasons why the scenario you mentioned is happening,

When you added more data, maybe there is no good relationship between input features and label for the new examples. It is always said that less and clean data is better than large and messy data.

If 49000 records added afterward are from the same set(i.e. have a good relationship between label and features) as that of 1000 before, there are again two possible reasons

A. If accuracy on the train dataset is small along with test dataset. e.g. training accuracy is 70% and test accuracy is 65%, then you are underfitting data. Model is very complex and dataset is small in terms of the number of examples.

B. If your training accuracy is near 100% and test accuracy is 65%, you are overfitting data. Model is complex, so you should go with some simple algorithm.

NOTE* Since you haven't mentioned about training accuracy, it is difficult to say what out of the two above is happening.

Now coming to your second question about real-world deployment. There is something called model staleness over time which is basically the problem of reducing model accuracy over time. This is the article by a product manager at Google explaining the staleness problem and how it can be solved. This will answer your second question.

Let me know if something is not clear.

edited 2 days ago

answered 2 days ago

Sagar Shelke

362

New contributor

$begingroup$
Many thanks, Sagar. Great piece of advice. Let me ponder my model from your viewpoint. And, will go through the article you posted bit later tonight.
$endgroup$
– ranit.b
2 days ago

add a comment |

There is a false myth that more data means better classification. The model also needs to build up in its complexity otherwise, the model is just overfitting on the data.

Taking only a few random samples from the data is the best strategy to train models, rather than inputting every bit of data that we can find.

answered 2 days ago

abunickabhi

1838

1

$begingroup$
Thanks for sharing your thoughts. I guess, I've to understand the overfitting concept more properly.
$endgroup$
– ranit.b
2 days ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47565%2fhow-important-is-the-input-data-for-a-ml-model%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

There are two possible reasons why the scenario you mentioned is happening,

When you added more data, maybe there is no good relationship between input features and label for the new examples. It is always said that less and clean data is better than large and messy data.

If 49000 records added afterward are from the same set(i.e. have a good relationship between label and features) as that of 1000 before, there are again two possible reasons

A. If accuracy on the train dataset is small along with test dataset. e.g. training accuracy is 70% and test accuracy is 65%, then you are underfitting data. Model is very complex and dataset is small in terms of the number of examples.

B. If your training accuracy is near 100% and test accuracy is 65%, you are overfitting data. Model is complex, so you should go with some simple algorithm.

NOTE* Since you haven't mentioned about training accuracy, it is difficult to say what out of the two above is happening.

Let me know if something is not clear.

edited 2 days ago

answered 2 days ago

Sagar Shelke

362

New contributor

$begingroup$
Many thanks, Sagar. Great piece of advice. Let me ponder my model from your viewpoint. And, will go through the article you posted bit later tonight.
$endgroup$
– ranit.b
2 days ago

add a comment |

There are two possible reasons why the scenario you mentioned is happening,

When you added more data, maybe there is no good relationship between input features and label for the new examples. It is always said that less and clean data is better than large and messy data.

If 49000 records added afterward are from the same set(i.e. have a good relationship between label and features) as that of 1000 before, there are again two possible reasons

A. If accuracy on the train dataset is small along with test dataset. e.g. training accuracy is 70% and test accuracy is 65%, then you are underfitting data. Model is very complex and dataset is small in terms of the number of examples.

B. If your training accuracy is near 100% and test accuracy is 65%, you are overfitting data. Model is complex, so you should go with some simple algorithm.

NOTE* Since you haven't mentioned about training accuracy, it is difficult to say what out of the two above is happening.

Let me know if something is not clear.

edited 2 days ago

answered 2 days ago

Sagar Shelke

362

New contributor

$begingroup$
Many thanks, Sagar. Great piece of advice. Let me ponder my model from your viewpoint. And, will go through the article you posted bit later tonight.
$endgroup$
– ranit.b
2 days ago

add a comment |

There are two possible reasons why the scenario you mentioned is happening,

When you added more data, maybe there is no good relationship between input features and label for the new examples. It is always said that less and clean data is better than large and messy data.

If 49000 records added afterward are from the same set(i.e. have a good relationship between label and features) as that of 1000 before, there are again two possible reasons

A. If accuracy on the train dataset is small along with test dataset. e.g. training accuracy is 70% and test accuracy is 65%, then you are underfitting data. Model is very complex and dataset is small in terms of the number of examples.

B. If your training accuracy is near 100% and test accuracy is 65%, you are overfitting data. Model is complex, so you should go with some simple algorithm.

NOTE* Since you haven't mentioned about training accuracy, it is difficult to say what out of the two above is happening.

Let me know if something is not clear.

edited 2 days ago

answered 2 days ago

Sagar Shelke

362

New contributor

There are two possible reasons why the scenario you mentioned is happening,

When you added more data, maybe there is no good relationship between input features and label for the new examples. It is always said that less and clean data is better than large and messy data.

If 49000 records added afterward are from the same set(i.e. have a good relationship between label and features) as that of 1000 before, there are again two possible reasons

A. If accuracy on the train dataset is small along with test dataset. e.g. training accuracy is 70% and test accuracy is 65%, then you are underfitting data. Model is very complex and dataset is small in terms of the number of examples.

B. If your training accuracy is near 100% and test accuracy is 65%, you are overfitting data. Model is complex, so you should go with some simple algorithm.

NOTE* Since you haven't mentioned about training accuracy, it is difficult to say what out of the two above is happening.

Let me know if something is not clear.

edited 2 days ago

answered 2 days ago

Sagar Shelke

362

New contributor

edited 2 days ago

answered 2 days ago

Sagar Shelke

362

New contributor

answered 2 days ago

Sagar Shelke

362

answered 2 days ago

Sagar Shelke

362

New contributor

Sagar Shelke is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
Many thanks, Sagar. Great piece of advice. Let me ponder my model from your viewpoint. And, will go through the article you posted bit later tonight.
$endgroup$
– ranit.b
2 days ago

add a comment |

$begingroup$
Many thanks, Sagar. Great piece of advice. Let me ponder my model from your viewpoint. And, will go through the article you posted bit later tonight.
$endgroup$
– ranit.b
2 days ago

Many thanks, Sagar. Great piece of advice. Let me ponder my model from your viewpoint. And, will go through the article you posted bit later tonight.

– ranit.b
2 days ago

add a comment |

There is a false myth that more data means better classification. The model also needs to build up in its complexity otherwise, the model is just overfitting on the data.

Taking only a few random samples from the data is the best strategy to train models, rather than inputting every bit of data that we can find.

answered 2 days ago

abunickabhi

1838

1

$begingroup$
Thanks for sharing your thoughts. I guess, I've to understand the overfitting concept more properly.
$endgroup$
– ranit.b
2 days ago

add a comment |

There is a false myth that more data means better classification. The model also needs to build up in its complexity otherwise, the model is just overfitting on the data.

Taking only a few random samples from the data is the best strategy to train models, rather than inputting every bit of data that we can find.

answered 2 days ago

abunickabhi

1838

1

$begingroup$
Thanks for sharing your thoughts. I guess, I've to understand the overfitting concept more properly.
$endgroup$
– ranit.b
2 days ago

add a comment |

There is a false myth that more data means better classification. The model also needs to build up in its complexity otherwise, the model is just overfitting on the data.

Taking only a few random samples from the data is the best strategy to train models, rather than inputting every bit of data that we can find.

answered 2 days ago

abunickabhi

1838

There is a false myth that more data means better classification. The model also needs to build up in its complexity otherwise, the model is just overfitting on the data.

Taking only a few random samples from the data is the best strategy to train models, rather than inputting every bit of data that we can find.

answered 2 days ago

abunickabhi

1838

answered 2 days ago

abunickabhi

1838

answered 2 days ago

abunickabhi

1838

answered 2 days ago

abunickabhi

1838

1

$begingroup$
Thanks for sharing your thoughts. I guess, I've to understand the overfitting concept more properly.
$endgroup$
– ranit.b
2 days ago

add a comment |

1

$begingroup$
Thanks for sharing your thoughts. I guess, I've to understand the overfitting concept more properly.
$endgroup$
– ranit.b
2 days ago

Thanks for sharing your thoughts. I guess, I've to understand the overfitting concept more properly.

– ranit.b
2 days ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

XJjQ Jm2R0k OvnX Qwg Kx PyhlsheKm4UD y

搜尋此網誌

Trjtdtk

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers
2

2 Answers
2

2 Answers
2