When should you balance a time series dataset?Why does my Multilayer Perceptron only classify linearly?Gradient Boosted Trees or Neural Networks Using Model Averaging?Why MLP only learns bias for unbalanced binary classification?Imbalanced data causing mis-classification on multiclass datasetClassification of a time series dataWhat is the best way to deal with imbalanced data for XGBoost?Need help on Time Series ARIMA ModelMethods for analyzing food diary (time series CSV) to correlate foods to symptomsTime series pixel classification

Help rendering a complicated sum/product formula

What are substitutions for coconut in curry?

How to get the n-th line after a grepped one?

Can a medieval gyroplane be built?

What favor did Moody owe Dumbledore?

Light propagating through a sound wave

Violin - Can double stops be played when the strings are not next to each other?

Worshiping one God at a time?

What does "mu" mean as an interjection?

Do US professors/group leaders only get a salary, but no group budget?

Asserting that Atheism and Theism are both faith based positions

How do hiring committees for research positions view getting "scooped"?

Can you move over difficult terrain with only 5 feet of movement?

Why is there so much iron?

Why are there no stars visible in cislunar space?

Maths symbols and unicode-math input inside siunitx commands

Should I be concerned about student access to a test bank?

Variable completely messes up echoed string

How could an airship be repaired midflight?

Knife as defense against stray dogs

Matrix using tikz package

What is the term when voters “dishonestly” choose something that they do not want to choose?

What does Jesus mean regarding "Raca," and "you fool?" - is he contrasting them?

What does Deadpool mean by "left the house in that shirt"?



When should you balance a time series dataset?


Why does my Multilayer Perceptron only classify linearly?Gradient Boosted Trees or Neural Networks Using Model Averaging?Why MLP only learns bias for unbalanced binary classification?Imbalanced data causing mis-classification on multiclass datasetClassification of a time series dataWhat is the best way to deal with imbalanced data for XGBoost?Need help on Time Series ARIMA ModelMethods for analyzing food diary (time series CSV) to correlate foods to symptomsTime series pixel classification













3












$begingroup$


I'm training a machine learning algorithm to classify up/down trends in a time series and I'm using an imbalanced feature set. It seems necessary to balance the data since the algorithm could learn a bias towards a particular trend, but this comes at the cost of a non-representative dataset. Should I balance my data? And if so, is random undersampling the right method?










share|improve this question











$endgroup$







  • 1




    $begingroup$
    Which types of models do you use? some models are less sensitive to imbalanced datasets
    $endgroup$
    – Omri374
    Feb 24 '18 at 20:41






  • 1




    $begingroup$
    @Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier.
    $endgroup$
    – Jonathan Shobrook
    Feb 25 '18 at 5:28






  • 1




    $begingroup$
    For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows
    $endgroup$
    – Omri374
    Feb 25 '18 at 9:07






  • 1




    $begingroup$
    For LSTMs, you could tweak the loss function. see here: stats.stackexchange.com/questions/197273/… and stackoverflow.com/questions/35155655/…
    $endgroup$
    – Omri374
    Feb 25 '18 at 9:13











  • $begingroup$
    You might want to read this paper
    $endgroup$
    – iso_9001_
    Mar 14 at 14:09















3












$begingroup$


I'm training a machine learning algorithm to classify up/down trends in a time series and I'm using an imbalanced feature set. It seems necessary to balance the data since the algorithm could learn a bias towards a particular trend, but this comes at the cost of a non-representative dataset. Should I balance my data? And if so, is random undersampling the right method?










share|improve this question











$endgroup$







  • 1




    $begingroup$
    Which types of models do you use? some models are less sensitive to imbalanced datasets
    $endgroup$
    – Omri374
    Feb 24 '18 at 20:41






  • 1




    $begingroup$
    @Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier.
    $endgroup$
    – Jonathan Shobrook
    Feb 25 '18 at 5:28






  • 1




    $begingroup$
    For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows
    $endgroup$
    – Omri374
    Feb 25 '18 at 9:07






  • 1




    $begingroup$
    For LSTMs, you could tweak the loss function. see here: stats.stackexchange.com/questions/197273/… and stackoverflow.com/questions/35155655/…
    $endgroup$
    – Omri374
    Feb 25 '18 at 9:13











  • $begingroup$
    You might want to read this paper
    $endgroup$
    – iso_9001_
    Mar 14 at 14:09













3












3








3





$begingroup$


I'm training a machine learning algorithm to classify up/down trends in a time series and I'm using an imbalanced feature set. It seems necessary to balance the data since the algorithm could learn a bias towards a particular trend, but this comes at the cost of a non-representative dataset. Should I balance my data? And if so, is random undersampling the right method?










share|improve this question











$endgroup$




I'm training a machine learning algorithm to classify up/down trends in a time series and I'm using an imbalanced feature set. It seems necessary to balance the data since the algorithm could learn a bias towards a particular trend, but this comes at the cost of a non-representative dataset. Should I balance my data? And if so, is random undersampling the right method?







machine-learning classification time-series unbalanced-classes class-imbalance






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 22 '18 at 18:58







Jonathan Shobrook

















asked Feb 22 '18 at 18:10









Jonathan ShobrookJonathan Shobrook

73117




73117







  • 1




    $begingroup$
    Which types of models do you use? some models are less sensitive to imbalanced datasets
    $endgroup$
    – Omri374
    Feb 24 '18 at 20:41






  • 1




    $begingroup$
    @Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier.
    $endgroup$
    – Jonathan Shobrook
    Feb 25 '18 at 5:28






  • 1




    $begingroup$
    For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows
    $endgroup$
    – Omri374
    Feb 25 '18 at 9:07






  • 1




    $begingroup$
    For LSTMs, you could tweak the loss function. see here: stats.stackexchange.com/questions/197273/… and stackoverflow.com/questions/35155655/…
    $endgroup$
    – Omri374
    Feb 25 '18 at 9:13











  • $begingroup$
    You might want to read this paper
    $endgroup$
    – iso_9001_
    Mar 14 at 14:09












  • 1




    $begingroup$
    Which types of models do you use? some models are less sensitive to imbalanced datasets
    $endgroup$
    – Omri374
    Feb 24 '18 at 20:41






  • 1




    $begingroup$
    @Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier.
    $endgroup$
    – Jonathan Shobrook
    Feb 25 '18 at 5:28






  • 1




    $begingroup$
    For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows
    $endgroup$
    – Omri374
    Feb 25 '18 at 9:07






  • 1




    $begingroup$
    For LSTMs, you could tweak the loss function. see here: stats.stackexchange.com/questions/197273/… and stackoverflow.com/questions/35155655/…
    $endgroup$
    – Omri374
    Feb 25 '18 at 9:13











  • $begingroup$
    You might want to read this paper
    $endgroup$
    – iso_9001_
    Mar 14 at 14:09







1




1




$begingroup$
Which types of models do you use? some models are less sensitive to imbalanced datasets
$endgroup$
– Omri374
Feb 24 '18 at 20:41




$begingroup$
Which types of models do you use? some models are less sensitive to imbalanced datasets
$endgroup$
– Omri374
Feb 24 '18 at 20:41




1




1




$begingroup$
@Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier.
$endgroup$
– Jonathan Shobrook
Feb 25 '18 at 5:28




$begingroup$
@Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier.
$endgroup$
– Jonathan Shobrook
Feb 25 '18 at 5:28




1




1




$begingroup$
For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows
$endgroup$
– Omri374
Feb 25 '18 at 9:07




$begingroup$
For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows
$endgroup$
– Omri374
Feb 25 '18 at 9:07




1




1




$begingroup$
For LSTMs, you could tweak the loss function. see here: stats.stackexchange.com/questions/197273/… and stackoverflow.com/questions/35155655/…
$endgroup$
– Omri374
Feb 25 '18 at 9:13





$begingroup$
For LSTMs, you could tweak the loss function. see here: stats.stackexchange.com/questions/197273/… and stackoverflow.com/questions/35155655/…
$endgroup$
– Omri374
Feb 25 '18 at 9:13













$begingroup$
You might want to read this paper
$endgroup$
– iso_9001_
Mar 14 at 14:09




$begingroup$
You might want to read this paper
$endgroup$
– iso_9001_
Mar 14 at 14:09










1 Answer
1






active

oldest

votes


















1












$begingroup$

If you can change the Loss function of the algorithm, It will be very helpful and as a result you don't need to down sample your data. There are many useful metrics which were introduced for evaluating the performance of classification methods for imbalanced data-sets. Some of them are Kappa, CEN, MCEN, MCC, and DP.



Disclaimer:



If you use python, PyCM module can help you to find out these metrics.



Here is a simple code to get the recommended parameters from this module:



>>> from pycm import *

>>> cm = ConfusionMatrix(matrix="Class1": "Class1": 1, "Class2":2, "Class2": "Class1": 0, "Class2": 5)

>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


After that, each of these parameters you want to use as the loss function can be used as follows:



>>> y_pred = model.predict #the prediction of the implemented model

>>> y_actu = data.target #data labels

>>> cm = ConfusionMatrix(y_actu, y_pred)

>>> loss = cm.Kappa #or any other parameter (Example: cm.SOA1)





share|improve this answer











$endgroup$












    Your Answer





    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "557"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28200%2fwhen-should-you-balance-a-time-series-dataset%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1












    $begingroup$

    If you can change the Loss function of the algorithm, It will be very helpful and as a result you don't need to down sample your data. There are many useful metrics which were introduced for evaluating the performance of classification methods for imbalanced data-sets. Some of them are Kappa, CEN, MCEN, MCC, and DP.



    Disclaimer:



    If you use python, PyCM module can help you to find out these metrics.



    Here is a simple code to get the recommended parameters from this module:



    >>> from pycm import *

    >>> cm = ConfusionMatrix(matrix="Class1": "Class1": 1, "Class2":2, "Class2": "Class1": 0, "Class2": 5)

    >>> print(cm.recommended_list)
    ["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


    After that, each of these parameters you want to use as the loss function can be used as follows:



    >>> y_pred = model.predict #the prediction of the implemented model

    >>> y_actu = data.target #data labels

    >>> cm = ConfusionMatrix(y_actu, y_pred)

    >>> loss = cm.Kappa #or any other parameter (Example: cm.SOA1)





    share|improve this answer











    $endgroup$

















      1












      $begingroup$

      If you can change the Loss function of the algorithm, It will be very helpful and as a result you don't need to down sample your data. There are many useful metrics which were introduced for evaluating the performance of classification methods for imbalanced data-sets. Some of them are Kappa, CEN, MCEN, MCC, and DP.



      Disclaimer:



      If you use python, PyCM module can help you to find out these metrics.



      Here is a simple code to get the recommended parameters from this module:



      >>> from pycm import *

      >>> cm = ConfusionMatrix(matrix="Class1": "Class1": 1, "Class2":2, "Class2": "Class1": 0, "Class2": 5)

      >>> print(cm.recommended_list)
      ["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


      After that, each of these parameters you want to use as the loss function can be used as follows:



      >>> y_pred = model.predict #the prediction of the implemented model

      >>> y_actu = data.target #data labels

      >>> cm = ConfusionMatrix(y_actu, y_pred)

      >>> loss = cm.Kappa #or any other parameter (Example: cm.SOA1)





      share|improve this answer











      $endgroup$















        1












        1








        1





        $begingroup$

        If you can change the Loss function of the algorithm, It will be very helpful and as a result you don't need to down sample your data. There are many useful metrics which were introduced for evaluating the performance of classification methods for imbalanced data-sets. Some of them are Kappa, CEN, MCEN, MCC, and DP.



        Disclaimer:



        If you use python, PyCM module can help you to find out these metrics.



        Here is a simple code to get the recommended parameters from this module:



        >>> from pycm import *

        >>> cm = ConfusionMatrix(matrix="Class1": "Class1": 1, "Class2":2, "Class2": "Class1": 0, "Class2": 5)

        >>> print(cm.recommended_list)
        ["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


        After that, each of these parameters you want to use as the loss function can be used as follows:



        >>> y_pred = model.predict #the prediction of the implemented model

        >>> y_actu = data.target #data labels

        >>> cm = ConfusionMatrix(y_actu, y_pred)

        >>> loss = cm.Kappa #or any other parameter (Example: cm.SOA1)





        share|improve this answer











        $endgroup$



        If you can change the Loss function of the algorithm, It will be very helpful and as a result you don't need to down sample your data. There are many useful metrics which were introduced for evaluating the performance of classification methods for imbalanced data-sets. Some of them are Kappa, CEN, MCEN, MCC, and DP.



        Disclaimer:



        If you use python, PyCM module can help you to find out these metrics.



        Here is a simple code to get the recommended parameters from this module:



        >>> from pycm import *

        >>> cm = ConfusionMatrix(matrix="Class1": "Class1": 1, "Class2":2, "Class2": "Class1": 0, "Class2": 5)

        >>> print(cm.recommended_list)
        ["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


        After that, each of these parameters you want to use as the loss function can be used as follows:



        >>> y_pred = model.predict #the prediction of the implemented model

        >>> y_actu = data.target #data labels

        >>> cm = ConfusionMatrix(y_actu, y_pred)

        >>> loss = cm.Kappa #or any other parameter (Example: cm.SOA1)






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 2 days ago

























        answered Mar 14 at 14:47









        Alireza ZolanvariAlireza Zolanvari

        19114




        19114



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28200%2fwhen-should-you-balance-a-time-series-dataset%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

            Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

            Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High