Choosing k value in KNN classifier?Backpropagation: how do you compute the gradient of the final output with respect to any loss function?scikit-learn classifier reset in loopSci-kit learn function to select threshold for higher recall than precisionInterpreting 1vs1 support vectors in an SVMStacking when the the target variable is categorical?How can I do classification with categorical data which is not fixed?Why does Bagging or Boosting algorithm give better accuracy than basic Algorithms in small datasets?When does decision tree perform better than the neural network?Problem about tuning hyper-parametresHow to use a one-hot encoded nominal feature in a classifier in Scikit Learn?

Why do freehub and cassette have only one position that matches?

Why are notes ordered like they are on a piano?

How to implement float hashing with approximate equality

Hang 20lb projector screen on Hardieplank

How to scale a verbatim environment on a minipage?

What was the state of the German rail system in 1944?

How can I close a gap between my fence and my neighbor's that's on his side of the property line?

Can commander tax be proliferated?

Was Hulk present at this event?

My ID is expired, can I fly to the Bahamas with my passport

Pressure to defend the relevance of one's area of mathematics

CRT Oscilloscope - part of the plot is missing

Visa for volunteering in England

Why is Thanos so tough at the beginning of "Avengers: Endgame"?

How to efficiently calculate prefix sum of frequencies of characters in a string?

Copy line and insert it in a new position with sed or awk

Is it always OK to ask for a copy of the lecturer's slides?

Why was the battle set up *outside* Winterfell?

What precisely is a link?

Junior developer struggles: how to communicate with management?

Can PCs use nonmagical armor and weapons looted from monsters?

If Earth is tilted, why is Polaris always above the same spot?

Unexpected email from Yorkshire Bank

Feels like I am getting dragged into office politics



Choosing k value in KNN classifier?


Backpropagation: how do you compute the gradient of the final output with respect to any loss function?scikit-learn classifier reset in loopSci-kit learn function to select threshold for higher recall than precisionInterpreting 1vs1 support vectors in an SVMStacking when the the target variable is categorical?How can I do classification with categorical data which is not fixed?Why does Bagging or Boosting algorithm give better accuracy than basic Algorithms in small datasets?When does decision tree perform better than the neural network?Problem about tuning hyper-parametresHow to use a one-hot encoded nominal feature in a classifier in Scikit Learn?













1












$begingroup$


I'm working on classification problem and decided to use KNN classifier for the problem.



so if k=131 gave me auc of 0.689 and k=71 gave me auc of 0.682 what should be my ideal k?



Does choosing higher k means more usage of computational resource? if that's the case can I go with k=71. (or) should I always use K with maximum score no matter what?










share|improve this question









$endgroup$











  • $begingroup$
    So, are you calculating auc using cross-validation?
    $endgroup$
    – pythinker
    Apr 8 at 19:16










  • $begingroup$
    @pythinker yes..
    $endgroup$
    – user214
    Apr 8 at 19:26















1












$begingroup$


I'm working on classification problem and decided to use KNN classifier for the problem.



so if k=131 gave me auc of 0.689 and k=71 gave me auc of 0.682 what should be my ideal k?



Does choosing higher k means more usage of computational resource? if that's the case can I go with k=71. (or) should I always use K with maximum score no matter what?










share|improve this question









$endgroup$











  • $begingroup$
    So, are you calculating auc using cross-validation?
    $endgroup$
    – pythinker
    Apr 8 at 19:16










  • $begingroup$
    @pythinker yes..
    $endgroup$
    – user214
    Apr 8 at 19:26













1












1








1





$begingroup$


I'm working on classification problem and decided to use KNN classifier for the problem.



so if k=131 gave me auc of 0.689 and k=71 gave me auc of 0.682 what should be my ideal k?



Does choosing higher k means more usage of computational resource? if that's the case can I go with k=71. (or) should I always use K with maximum score no matter what?










share|improve this question









$endgroup$




I'm working on classification problem and decided to use KNN classifier for the problem.



so if k=131 gave me auc of 0.689 and k=71 gave me auc of 0.682 what should be my ideal k?



Does choosing higher k means more usage of computational resource? if that's the case can I go with k=71. (or) should I always use K with maximum score no matter what?







machine-learning k-nn






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Apr 8 at 18:36









user214user214

22818




22818











  • $begingroup$
    So, are you calculating auc using cross-validation?
    $endgroup$
    – pythinker
    Apr 8 at 19:16










  • $begingroup$
    @pythinker yes..
    $endgroup$
    – user214
    Apr 8 at 19:26
















  • $begingroup$
    So, are you calculating auc using cross-validation?
    $endgroup$
    – pythinker
    Apr 8 at 19:16










  • $begingroup$
    @pythinker yes..
    $endgroup$
    – user214
    Apr 8 at 19:26















$begingroup$
So, are you calculating auc using cross-validation?
$endgroup$
– pythinker
Apr 8 at 19:16




$begingroup$
So, are you calculating auc using cross-validation?
$endgroup$
– pythinker
Apr 8 at 19:16












$begingroup$
@pythinker yes..
$endgroup$
– user214
Apr 8 at 19:26




$begingroup$
@pythinker yes..
$endgroup$
– user214
Apr 8 at 19:26










2 Answers
2






active

oldest

votes


















1












$begingroup$

Because knn is a non-parametric method, computational costs of choosing k, highly depends on the size of training data. If the size of training data is small, you can freely choose the k for which the best auc for validation dataset is achieved. In the case where you have a large training dataset, choosing large k can lead to huge computational complexity which is reflected in slow prediction for test data.






share|improve this answer









$endgroup$












  • $begingroup$
    does 100k rows and 8000 features qualify as big training data? Also choosing high k values means we are underfitting how can I know that i'm not underfitting when choosing high k values?
    $endgroup$
    – user214
    Apr 8 at 19:44







  • 1




    $begingroup$
    Yes, that’s actually a big training dataset. To ensure that you are not underfitting or overfitting, you should check the performance of your model on the training and validation dataset, simultaneously. If it training score is low, you are underfitting. If training score is much higher than validation score, you are overfitting. The best case is when training and validation scores are close enough.
    $endgroup$
    – pythinker
    Apr 8 at 20:02



















2












$begingroup$

I was taught the best way is to find the error for each k then plot them and look for the "elbow" on the plot.






share|improve this answer









$endgroup$












  • $begingroup$
    So I used go with k=131
    $endgroup$
    – user214
    Apr 8 at 18:46










  • $begingroup$
    It really depends. The higher your k the higher your chance of overfitting. So if you do every k from 2 to 200 and plot the error of all of them you use the k where the curve starts to flatten out.
    $endgroup$
    – Stephen Ewing
    Apr 8 at 18:48











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48905%2fchoosing-k-value-in-knn-classifier%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









1












$begingroup$

Because knn is a non-parametric method, computational costs of choosing k, highly depends on the size of training data. If the size of training data is small, you can freely choose the k for which the best auc for validation dataset is achieved. In the case where you have a large training dataset, choosing large k can lead to huge computational complexity which is reflected in slow prediction for test data.






share|improve this answer









$endgroup$












  • $begingroup$
    does 100k rows and 8000 features qualify as big training data? Also choosing high k values means we are underfitting how can I know that i'm not underfitting when choosing high k values?
    $endgroup$
    – user214
    Apr 8 at 19:44







  • 1




    $begingroup$
    Yes, that’s actually a big training dataset. To ensure that you are not underfitting or overfitting, you should check the performance of your model on the training and validation dataset, simultaneously. If it training score is low, you are underfitting. If training score is much higher than validation score, you are overfitting. The best case is when training and validation scores are close enough.
    $endgroup$
    – pythinker
    Apr 8 at 20:02
















1












$begingroup$

Because knn is a non-parametric method, computational costs of choosing k, highly depends on the size of training data. If the size of training data is small, you can freely choose the k for which the best auc for validation dataset is achieved. In the case where you have a large training dataset, choosing large k can lead to huge computational complexity which is reflected in slow prediction for test data.






share|improve this answer









$endgroup$












  • $begingroup$
    does 100k rows and 8000 features qualify as big training data? Also choosing high k values means we are underfitting how can I know that i'm not underfitting when choosing high k values?
    $endgroup$
    – user214
    Apr 8 at 19:44







  • 1




    $begingroup$
    Yes, that’s actually a big training dataset. To ensure that you are not underfitting or overfitting, you should check the performance of your model on the training and validation dataset, simultaneously. If it training score is low, you are underfitting. If training score is much higher than validation score, you are overfitting. The best case is when training and validation scores are close enough.
    $endgroup$
    – pythinker
    Apr 8 at 20:02














1












1








1





$begingroup$

Because knn is a non-parametric method, computational costs of choosing k, highly depends on the size of training data. If the size of training data is small, you can freely choose the k for which the best auc for validation dataset is achieved. In the case where you have a large training dataset, choosing large k can lead to huge computational complexity which is reflected in slow prediction for test data.






share|improve this answer









$endgroup$



Because knn is a non-parametric method, computational costs of choosing k, highly depends on the size of training data. If the size of training data is small, you can freely choose the k for which the best auc for validation dataset is achieved. In the case where you have a large training dataset, choosing large k can lead to huge computational complexity which is reflected in slow prediction for test data.







share|improve this answer












share|improve this answer



share|improve this answer










answered Apr 8 at 19:39









pythinkerpythinker

8641314




8641314











  • $begingroup$
    does 100k rows and 8000 features qualify as big training data? Also choosing high k values means we are underfitting how can I know that i'm not underfitting when choosing high k values?
    $endgroup$
    – user214
    Apr 8 at 19:44







  • 1




    $begingroup$
    Yes, that’s actually a big training dataset. To ensure that you are not underfitting or overfitting, you should check the performance of your model on the training and validation dataset, simultaneously. If it training score is low, you are underfitting. If training score is much higher than validation score, you are overfitting. The best case is when training and validation scores are close enough.
    $endgroup$
    – pythinker
    Apr 8 at 20:02

















  • $begingroup$
    does 100k rows and 8000 features qualify as big training data? Also choosing high k values means we are underfitting how can I know that i'm not underfitting when choosing high k values?
    $endgroup$
    – user214
    Apr 8 at 19:44







  • 1




    $begingroup$
    Yes, that’s actually a big training dataset. To ensure that you are not underfitting or overfitting, you should check the performance of your model on the training and validation dataset, simultaneously. If it training score is low, you are underfitting. If training score is much higher than validation score, you are overfitting. The best case is when training and validation scores are close enough.
    $endgroup$
    – pythinker
    Apr 8 at 20:02
















$begingroup$
does 100k rows and 8000 features qualify as big training data? Also choosing high k values means we are underfitting how can I know that i'm not underfitting when choosing high k values?
$endgroup$
– user214
Apr 8 at 19:44





$begingroup$
does 100k rows and 8000 features qualify as big training data? Also choosing high k values means we are underfitting how can I know that i'm not underfitting when choosing high k values?
$endgroup$
– user214
Apr 8 at 19:44





1




1




$begingroup$
Yes, that’s actually a big training dataset. To ensure that you are not underfitting or overfitting, you should check the performance of your model on the training and validation dataset, simultaneously. If it training score is low, you are underfitting. If training score is much higher than validation score, you are overfitting. The best case is when training and validation scores are close enough.
$endgroup$
– pythinker
Apr 8 at 20:02





$begingroup$
Yes, that’s actually a big training dataset. To ensure that you are not underfitting or overfitting, you should check the performance of your model on the training and validation dataset, simultaneously. If it training score is low, you are underfitting. If training score is much higher than validation score, you are overfitting. The best case is when training and validation scores are close enough.
$endgroup$
– pythinker
Apr 8 at 20:02












2












$begingroup$

I was taught the best way is to find the error for each k then plot them and look for the "elbow" on the plot.






share|improve this answer









$endgroup$












  • $begingroup$
    So I used go with k=131
    $endgroup$
    – user214
    Apr 8 at 18:46










  • $begingroup$
    It really depends. The higher your k the higher your chance of overfitting. So if you do every k from 2 to 200 and plot the error of all of them you use the k where the curve starts to flatten out.
    $endgroup$
    – Stephen Ewing
    Apr 8 at 18:48















2












$begingroup$

I was taught the best way is to find the error for each k then plot them and look for the "elbow" on the plot.






share|improve this answer









$endgroup$












  • $begingroup$
    So I used go with k=131
    $endgroup$
    – user214
    Apr 8 at 18:46










  • $begingroup$
    It really depends. The higher your k the higher your chance of overfitting. So if you do every k from 2 to 200 and plot the error of all of them you use the k where the curve starts to flatten out.
    $endgroup$
    – Stephen Ewing
    Apr 8 at 18:48













2












2








2





$begingroup$

I was taught the best way is to find the error for each k then plot them and look for the "elbow" on the plot.






share|improve this answer









$endgroup$



I was taught the best way is to find the error for each k then plot them and look for the "elbow" on the plot.







share|improve this answer












share|improve this answer



share|improve this answer










answered Apr 8 at 18:40









Stephen EwingStephen Ewing

212




212











  • $begingroup$
    So I used go with k=131
    $endgroup$
    – user214
    Apr 8 at 18:46










  • $begingroup$
    It really depends. The higher your k the higher your chance of overfitting. So if you do every k from 2 to 200 and plot the error of all of them you use the k where the curve starts to flatten out.
    $endgroup$
    – Stephen Ewing
    Apr 8 at 18:48
















  • $begingroup$
    So I used go with k=131
    $endgroup$
    – user214
    Apr 8 at 18:46










  • $begingroup$
    It really depends. The higher your k the higher your chance of overfitting. So if you do every k from 2 to 200 and plot the error of all of them you use the k where the curve starts to flatten out.
    $endgroup$
    – Stephen Ewing
    Apr 8 at 18:48















$begingroup$
So I used go with k=131
$endgroup$
– user214
Apr 8 at 18:46




$begingroup$
So I used go with k=131
$endgroup$
– user214
Apr 8 at 18:46












$begingroup$
It really depends. The higher your k the higher your chance of overfitting. So if you do every k from 2 to 200 and plot the error of all of them you use the k where the curve starts to flatten out.
$endgroup$
– Stephen Ewing
Apr 8 at 18:48




$begingroup$
It really depends. The higher your k the higher your chance of overfitting. So if you do every k from 2 to 200 and plot the error of all of them you use the k where the curve starts to flatten out.
$endgroup$
– Stephen Ewing
Apr 8 at 18:48

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48905%2fchoosing-k-value-in-knn-classifier%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High