How to use Machine Learning to discover important biomarkers in an unbalanced small data set Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsRegression for binary classification and AUC metricHow to deal with a machine learning model which affects future ground truth data?Data representation (NoSQL database?) for a medical studyHandling data imbalance and class number for classificationHow to classify parametric curves?How do I control for some patients providing multiple samples in my training data?Can we make two separate models vs one for classification?Multivariate time Series classification - One classBuilding a minimal encoding of nominal labels from numerical featuresoversampling data with subclass

The logistics of corpse disposal

How does a Death Domain cleric's Touch of Death feature work with Touch-range spells delivered by familiars?

Is there any avatar supposed to be born between the death of Krishna and the birth of Kalki?

What is the musical term for a note that continously plays through a melody?

Why is "Consequences inflicted." not a sentence?

Bonus calculation: Am I making a mountain out of a molehill?

Is a manifold-with-boundary with given interior and non-empty boundary essentially unique?

Gastric acid as a weapon

How can I make names more distinctive without making them longer?

Why was the term "discrete" used in discrete logarithm?

Is it true that "carbohydrates are of no use for the basal metabolic need"?

When -s is used with third person singular. What's its use in this context?

Models of set theory where not every set can be linearly ordered

I am not a queen, who am I?

When is phishing education going too far?

Is above average number of years spent on PhD considered a red flag in future academia or industry positions?

What LEGO pieces have "real-world" functionality?

Does accepting a pardon have any bearing on trying that person for the same crime in a sovereign jurisdiction?

How to recreate this effect in Photoshop?

Stars Make Stars

Is it ethical to give a final exam after the professor has quit before teaching the remaining chapters of the course?

Is there a concise way to say "all of the X, one of each"?

Super Attribute Position on Product Page Magento 1

Why are there no cargo aircraft with "flying wing" design?



How to use Machine Learning to discover important biomarkers in an unbalanced small data set



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsRegression for binary classification and AUC metricHow to deal with a machine learning model which affects future ground truth data?Data representation (NoSQL database?) for a medical studyHandling data imbalance and class number for classificationHow to classify parametric curves?How do I control for some patients providing multiple samples in my training data?Can we make two separate models vs one for classification?Multivariate time Series classification - One classBuilding a minimal encoding of nominal labels from numerical featuresoversampling data with subclass










0












$begingroup$


I have a project which I am just starting out, I am only just learning machine learning and statistics so I am somewhat unsure as to what approaches will be good to start off with, and I am sorry if this does not belong here.



The data set is of different patients carrying a certain disease and each patient has different biomarkers and physical measurements such as heart rate at different time points, until death, if they do die. I was told that the goal was to identify the key features, which would be associated with a a patient dying.



I only have 33 patients, and only 16 of them have died. But disregarding patient the biomarkers came from I have 300 odd time slots, I first tried to approach it as a binary classification problem, classifying the 'death' point from other points. The problems were:



  1. The data imbalance and,

  2. How to you interpret the models to discover most important features.

For imbalance, I tried SMOTE oversampling with didn't work as I thought, then I randomly under-sampled, which gave decent results but the data set was even smaller, so I wasn't sure if its a good idea.



Simple binary classification models like Gaussian Naive Bayes and Logistic Regression did okay even with the imbalanced data, but they don't (at least as far as I know) give a way to discern feature importance..



So my main questions are:



  1. What's the best way to approach this problem, or in general what kind of approaches work when you want to identify most influential features (data measurements).


  2. If I do want to approach it as a binary classification problem what approaches can I take to combat class imbalance?










share|improve this question











$endgroup$







  • 1




    $begingroup$
    I think you need to be a bit more specific on what your questions is or it will be difficult to provide you with helpful answers
    $endgroup$
    – oW_
    Apr 1 at 20:56










  • $begingroup$
    You are going to have a hard time getting solid results for testing biomarkers with sample sizes that low unless that effect sizes are huge. The number of different measurements doesn't really increase statistical power very much. You might want to look at prior research on "severity of illness" to get an idea of what has already been discovered. There is quite a bit already done in the intensive care literature using various techniques.
    $endgroup$
    – 42-
    Apr 2 at 0:13
















0












$begingroup$


I have a project which I am just starting out, I am only just learning machine learning and statistics so I am somewhat unsure as to what approaches will be good to start off with, and I am sorry if this does not belong here.



The data set is of different patients carrying a certain disease and each patient has different biomarkers and physical measurements such as heart rate at different time points, until death, if they do die. I was told that the goal was to identify the key features, which would be associated with a a patient dying.



I only have 33 patients, and only 16 of them have died. But disregarding patient the biomarkers came from I have 300 odd time slots, I first tried to approach it as a binary classification problem, classifying the 'death' point from other points. The problems were:



  1. The data imbalance and,

  2. How to you interpret the models to discover most important features.

For imbalance, I tried SMOTE oversampling with didn't work as I thought, then I randomly under-sampled, which gave decent results but the data set was even smaller, so I wasn't sure if its a good idea.



Simple binary classification models like Gaussian Naive Bayes and Logistic Regression did okay even with the imbalanced data, but they don't (at least as far as I know) give a way to discern feature importance..



So my main questions are:



  1. What's the best way to approach this problem, or in general what kind of approaches work when you want to identify most influential features (data measurements).


  2. If I do want to approach it as a binary classification problem what approaches can I take to combat class imbalance?










share|improve this question











$endgroup$







  • 1




    $begingroup$
    I think you need to be a bit more specific on what your questions is or it will be difficult to provide you with helpful answers
    $endgroup$
    – oW_
    Apr 1 at 20:56










  • $begingroup$
    You are going to have a hard time getting solid results for testing biomarkers with sample sizes that low unless that effect sizes are huge. The number of different measurements doesn't really increase statistical power very much. You might want to look at prior research on "severity of illness" to get an idea of what has already been discovered. There is quite a bit already done in the intensive care literature using various techniques.
    $endgroup$
    – 42-
    Apr 2 at 0:13














0












0








0





$begingroup$


I have a project which I am just starting out, I am only just learning machine learning and statistics so I am somewhat unsure as to what approaches will be good to start off with, and I am sorry if this does not belong here.



The data set is of different patients carrying a certain disease and each patient has different biomarkers and physical measurements such as heart rate at different time points, until death, if they do die. I was told that the goal was to identify the key features, which would be associated with a a patient dying.



I only have 33 patients, and only 16 of them have died. But disregarding patient the biomarkers came from I have 300 odd time slots, I first tried to approach it as a binary classification problem, classifying the 'death' point from other points. The problems were:



  1. The data imbalance and,

  2. How to you interpret the models to discover most important features.

For imbalance, I tried SMOTE oversampling with didn't work as I thought, then I randomly under-sampled, which gave decent results but the data set was even smaller, so I wasn't sure if its a good idea.



Simple binary classification models like Gaussian Naive Bayes and Logistic Regression did okay even with the imbalanced data, but they don't (at least as far as I know) give a way to discern feature importance..



So my main questions are:



  1. What's the best way to approach this problem, or in general what kind of approaches work when you want to identify most influential features (data measurements).


  2. If I do want to approach it as a binary classification problem what approaches can I take to combat class imbalance?










share|improve this question











$endgroup$




I have a project which I am just starting out, I am only just learning machine learning and statistics so I am somewhat unsure as to what approaches will be good to start off with, and I am sorry if this does not belong here.



The data set is of different patients carrying a certain disease and each patient has different biomarkers and physical measurements such as heart rate at different time points, until death, if they do die. I was told that the goal was to identify the key features, which would be associated with a a patient dying.



I only have 33 patients, and only 16 of them have died. But disregarding patient the biomarkers came from I have 300 odd time slots, I first tried to approach it as a binary classification problem, classifying the 'death' point from other points. The problems were:



  1. The data imbalance and,

  2. How to you interpret the models to discover most important features.

For imbalance, I tried SMOTE oversampling with didn't work as I thought, then I randomly under-sampled, which gave decent results but the data set was even smaller, so I wasn't sure if its a good idea.



Simple binary classification models like Gaussian Naive Bayes and Logistic Regression did okay even with the imbalanced data, but they don't (at least as far as I know) give a way to discern feature importance..



So my main questions are:



  1. What's the best way to approach this problem, or in general what kind of approaches work when you want to identify most influential features (data measurements).


  2. If I do want to approach it as a binary classification problem what approaches can I take to combat class imbalance?







machine-learning data-mining






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 2 at 1:30









Stephen Rauch

1,52551330




1,52551330










asked Apr 1 at 20:06









InfinityInfinity

1




1







  • 1




    $begingroup$
    I think you need to be a bit more specific on what your questions is or it will be difficult to provide you with helpful answers
    $endgroup$
    – oW_
    Apr 1 at 20:56










  • $begingroup$
    You are going to have a hard time getting solid results for testing biomarkers with sample sizes that low unless that effect sizes are huge. The number of different measurements doesn't really increase statistical power very much. You might want to look at prior research on "severity of illness" to get an idea of what has already been discovered. There is quite a bit already done in the intensive care literature using various techniques.
    $endgroup$
    – 42-
    Apr 2 at 0:13













  • 1




    $begingroup$
    I think you need to be a bit more specific on what your questions is or it will be difficult to provide you with helpful answers
    $endgroup$
    – oW_
    Apr 1 at 20:56










  • $begingroup$
    You are going to have a hard time getting solid results for testing biomarkers with sample sizes that low unless that effect sizes are huge. The number of different measurements doesn't really increase statistical power very much. You might want to look at prior research on "severity of illness" to get an idea of what has already been discovered. There is quite a bit already done in the intensive care literature using various techniques.
    $endgroup$
    – 42-
    Apr 2 at 0:13








1




1




$begingroup$
I think you need to be a bit more specific on what your questions is or it will be difficult to provide you with helpful answers
$endgroup$
– oW_
Apr 1 at 20:56




$begingroup$
I think you need to be a bit more specific on what your questions is or it will be difficult to provide you with helpful answers
$endgroup$
– oW_
Apr 1 at 20:56












$begingroup$
You are going to have a hard time getting solid results for testing biomarkers with sample sizes that low unless that effect sizes are huge. The number of different measurements doesn't really increase statistical power very much. You might want to look at prior research on "severity of illness" to get an idea of what has already been discovered. There is quite a bit already done in the intensive care literature using various techniques.
$endgroup$
– 42-
Apr 2 at 0:13





$begingroup$
You are going to have a hard time getting solid results for testing biomarkers with sample sizes that low unless that effect sizes are huge. The number of different measurements doesn't really increase statistical power very much. You might want to look at prior research on "severity of illness" to get an idea of what has already been discovered. There is quite a bit already done in the intensive care literature using various techniques.
$endgroup$
– 42-
Apr 2 at 0:13











1 Answer
1






active

oldest

votes


















0












$begingroup$

If your goal is to identify important features, I would say go for a Decision Tree which inherently calculates importance/separation capability of the features while selecting them for splitting the internal nodes. You can also go for an ensemble of decision trees such as RandomForest which will return feature importance based on their average impurity reduction throughout all its trees.



This article can help you set up a basic experiment.






share|improve this answer









$endgroup$













    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "557"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48372%2fhow-to-use-machine-learning-to-discover-important-biomarkers-in-an-unbalanced-sm%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    If your goal is to identify important features, I would say go for a Decision Tree which inherently calculates importance/separation capability of the features while selecting them for splitting the internal nodes. You can also go for an ensemble of decision trees such as RandomForest which will return feature importance based on their average impurity reduction throughout all its trees.



    This article can help you set up a basic experiment.






    share|improve this answer









    $endgroup$

















      0












      $begingroup$

      If your goal is to identify important features, I would say go for a Decision Tree which inherently calculates importance/separation capability of the features while selecting them for splitting the internal nodes. You can also go for an ensemble of decision trees such as RandomForest which will return feature importance based on their average impurity reduction throughout all its trees.



      This article can help you set up a basic experiment.






      share|improve this answer









      $endgroup$















        0












        0








        0





        $begingroup$

        If your goal is to identify important features, I would say go for a Decision Tree which inherently calculates importance/separation capability of the features while selecting them for splitting the internal nodes. You can also go for an ensemble of decision trees such as RandomForest which will return feature importance based on their average impurity reduction throughout all its trees.



        This article can help you set up a basic experiment.






        share|improve this answer









        $endgroup$



        If your goal is to identify important features, I would say go for a Decision Tree which inherently calculates importance/separation capability of the features while selecting them for splitting the internal nodes. You can also go for an ensemble of decision trees such as RandomForest which will return feature importance based on their average impurity reduction throughout all its trees.



        This article can help you set up a basic experiment.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Apr 1 at 22:25









        Sajid AhmedSajid Ahmed

        315




        315



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48372%2fhow-to-use-machine-learning-to-discover-important-biomarkers-in-an-unbalanced-sm%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

            Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

            Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High