How to use Machine Learning to discover important biomarkers in an unbalanced small data set Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsRegression for binary classification and AUC metricHow to deal with a machine learning model which affects future ground truth data?Data representation (NoSQL database?) for a medical studyHandling data imbalance and class number for classificationHow to classify parametric curves?How do I control for some patients providing multiple samples in my training data?Can we make two separate models vs one for classification?Multivariate time Series classification - One classBuilding a minimal encoding of nominal labels from numerical featuresoversampling data with subclass
The logistics of corpse disposal
How does a Death Domain cleric's Touch of Death feature work with Touch-range spells delivered by familiars?
Is there any avatar supposed to be born between the death of Krishna and the birth of Kalki?
What is the musical term for a note that continously plays through a melody?
Why is "Consequences inflicted." not a sentence?
Bonus calculation: Am I making a mountain out of a molehill?
Is a manifold-with-boundary with given interior and non-empty boundary essentially unique?
Gastric acid as a weapon
How can I make names more distinctive without making them longer?
Why was the term "discrete" used in discrete logarithm?
Is it true that "carbohydrates are of no use for the basal metabolic need"?
When -s is used with third person singular. What's its use in this context?
Models of set theory where not every set can be linearly ordered
I am not a queen, who am I?
When is phishing education going too far?
Is above average number of years spent on PhD considered a red flag in future academia or industry positions?
What LEGO pieces have "real-world" functionality?
Does accepting a pardon have any bearing on trying that person for the same crime in a sovereign jurisdiction?
How to recreate this effect in Photoshop?
Stars Make Stars
Is it ethical to give a final exam after the professor has quit before teaching the remaining chapters of the course?
Is there a concise way to say "all of the X, one of each"?
Super Attribute Position on Product Page Magento 1
Why are there no cargo aircraft with "flying wing" design?
How to use Machine Learning to discover important biomarkers in an unbalanced small data set
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsRegression for binary classification and AUC metricHow to deal with a machine learning model which affects future ground truth data?Data representation (NoSQL database?) for a medical studyHandling data imbalance and class number for classificationHow to classify parametric curves?How do I control for some patients providing multiple samples in my training data?Can we make two separate models vs one for classification?Multivariate time Series classification - One classBuilding a minimal encoding of nominal labels from numerical featuresoversampling data with subclass
$begingroup$
I have a project which I am just starting out, I am only just learning machine learning and statistics so I am somewhat unsure as to what approaches will be good to start off with, and I am sorry if this does not belong here.
The data set is of different patients carrying a certain disease and each patient has different biomarkers and physical measurements such as heart rate at different time points, until death, if they do die. I was told that the goal was to identify the key features, which would be associated with a a patient dying.
I only have 33 patients, and only 16 of them have died. But disregarding patient the biomarkers came from I have 300 odd time slots, I first tried to approach it as a binary classification problem, classifying the 'death' point from other points. The problems were:
- The data imbalance and,
- How to you interpret the models to discover most important features.
For imbalance, I tried SMOTE oversampling with didn't work as I thought, then I randomly under-sampled, which gave decent results but the data set was even smaller, so I wasn't sure if its a good idea.
Simple binary classification models like Gaussian Naive Bayes and Logistic Regression did okay even with the imbalanced data, but they don't (at least as far as I know) give a way to discern feature importance..
So my main questions are:
What's the best way to approach this problem, or in general what kind of approaches work when you want to identify most influential features (data measurements).
If I do want to approach it as a binary classification problem what approaches can I take to combat class imbalance?
machine-learning data-mining
$endgroup$
add a comment |
$begingroup$
I have a project which I am just starting out, I am only just learning machine learning and statistics so I am somewhat unsure as to what approaches will be good to start off with, and I am sorry if this does not belong here.
The data set is of different patients carrying a certain disease and each patient has different biomarkers and physical measurements such as heart rate at different time points, until death, if they do die. I was told that the goal was to identify the key features, which would be associated with a a patient dying.
I only have 33 patients, and only 16 of them have died. But disregarding patient the biomarkers came from I have 300 odd time slots, I first tried to approach it as a binary classification problem, classifying the 'death' point from other points. The problems were:
- The data imbalance and,
- How to you interpret the models to discover most important features.
For imbalance, I tried SMOTE oversampling with didn't work as I thought, then I randomly under-sampled, which gave decent results but the data set was even smaller, so I wasn't sure if its a good idea.
Simple binary classification models like Gaussian Naive Bayes and Logistic Regression did okay even with the imbalanced data, but they don't (at least as far as I know) give a way to discern feature importance..
So my main questions are:
What's the best way to approach this problem, or in general what kind of approaches work when you want to identify most influential features (data measurements).
If I do want to approach it as a binary classification problem what approaches can I take to combat class imbalance?
machine-learning data-mining
$endgroup$
1
$begingroup$
I think you need to be a bit more specific on what your questions is or it will be difficult to provide you with helpful answers
$endgroup$
– oW_♦
Apr 1 at 20:56
$begingroup$
You are going to have a hard time getting solid results for testing biomarkers with sample sizes that low unless that effect sizes are huge. The number of different measurements doesn't really increase statistical power very much. You might want to look at prior research on "severity of illness" to get an idea of what has already been discovered. There is quite a bit already done in the intensive care literature using various techniques.
$endgroup$
– 42-
Apr 2 at 0:13
add a comment |
$begingroup$
I have a project which I am just starting out, I am only just learning machine learning and statistics so I am somewhat unsure as to what approaches will be good to start off with, and I am sorry if this does not belong here.
The data set is of different patients carrying a certain disease and each patient has different biomarkers and physical measurements such as heart rate at different time points, until death, if they do die. I was told that the goal was to identify the key features, which would be associated with a a patient dying.
I only have 33 patients, and only 16 of them have died. But disregarding patient the biomarkers came from I have 300 odd time slots, I first tried to approach it as a binary classification problem, classifying the 'death' point from other points. The problems were:
- The data imbalance and,
- How to you interpret the models to discover most important features.
For imbalance, I tried SMOTE oversampling with didn't work as I thought, then I randomly under-sampled, which gave decent results but the data set was even smaller, so I wasn't sure if its a good idea.
Simple binary classification models like Gaussian Naive Bayes and Logistic Regression did okay even with the imbalanced data, but they don't (at least as far as I know) give a way to discern feature importance..
So my main questions are:
What's the best way to approach this problem, or in general what kind of approaches work when you want to identify most influential features (data measurements).
If I do want to approach it as a binary classification problem what approaches can I take to combat class imbalance?
machine-learning data-mining
$endgroup$
I have a project which I am just starting out, I am only just learning machine learning and statistics so I am somewhat unsure as to what approaches will be good to start off with, and I am sorry if this does not belong here.
The data set is of different patients carrying a certain disease and each patient has different biomarkers and physical measurements such as heart rate at different time points, until death, if they do die. I was told that the goal was to identify the key features, which would be associated with a a patient dying.
I only have 33 patients, and only 16 of them have died. But disregarding patient the biomarkers came from I have 300 odd time slots, I first tried to approach it as a binary classification problem, classifying the 'death' point from other points. The problems were:
- The data imbalance and,
- How to you interpret the models to discover most important features.
For imbalance, I tried SMOTE oversampling with didn't work as I thought, then I randomly under-sampled, which gave decent results but the data set was even smaller, so I wasn't sure if its a good idea.
Simple binary classification models like Gaussian Naive Bayes and Logistic Regression did okay even with the imbalanced data, but they don't (at least as far as I know) give a way to discern feature importance..
So my main questions are:
What's the best way to approach this problem, or in general what kind of approaches work when you want to identify most influential features (data measurements).
If I do want to approach it as a binary classification problem what approaches can I take to combat class imbalance?
machine-learning data-mining
machine-learning data-mining
edited Apr 2 at 1:30
Stephen Rauch♦
1,52551330
1,52551330
asked Apr 1 at 20:06
InfinityInfinity
1
1
1
$begingroup$
I think you need to be a bit more specific on what your questions is or it will be difficult to provide you with helpful answers
$endgroup$
– oW_♦
Apr 1 at 20:56
$begingroup$
You are going to have a hard time getting solid results for testing biomarkers with sample sizes that low unless that effect sizes are huge. The number of different measurements doesn't really increase statistical power very much. You might want to look at prior research on "severity of illness" to get an idea of what has already been discovered. There is quite a bit already done in the intensive care literature using various techniques.
$endgroup$
– 42-
Apr 2 at 0:13
add a comment |
1
$begingroup$
I think you need to be a bit more specific on what your questions is or it will be difficult to provide you with helpful answers
$endgroup$
– oW_♦
Apr 1 at 20:56
$begingroup$
You are going to have a hard time getting solid results for testing biomarkers with sample sizes that low unless that effect sizes are huge. The number of different measurements doesn't really increase statistical power very much. You might want to look at prior research on "severity of illness" to get an idea of what has already been discovered. There is quite a bit already done in the intensive care literature using various techniques.
$endgroup$
– 42-
Apr 2 at 0:13
1
1
$begingroup$
I think you need to be a bit more specific on what your questions is or it will be difficult to provide you with helpful answers
$endgroup$
– oW_♦
Apr 1 at 20:56
$begingroup$
I think you need to be a bit more specific on what your questions is or it will be difficult to provide you with helpful answers
$endgroup$
– oW_♦
Apr 1 at 20:56
$begingroup$
You are going to have a hard time getting solid results for testing biomarkers with sample sizes that low unless that effect sizes are huge. The number of different measurements doesn't really increase statistical power very much. You might want to look at prior research on "severity of illness" to get an idea of what has already been discovered. There is quite a bit already done in the intensive care literature using various techniques.
$endgroup$
– 42-
Apr 2 at 0:13
$begingroup$
You are going to have a hard time getting solid results for testing biomarkers with sample sizes that low unless that effect sizes are huge. The number of different measurements doesn't really increase statistical power very much. You might want to look at prior research on "severity of illness" to get an idea of what has already been discovered. There is quite a bit already done in the intensive care literature using various techniques.
$endgroup$
– 42-
Apr 2 at 0:13
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
If your goal is to identify important features, I would say go for a Decision Tree which inherently calculates importance/separation capability of the features while selecting them for splitting the internal nodes. You can also go for an ensemble of decision trees such as RandomForest which will return feature importance based on their average impurity reduction throughout all its trees.
This article can help you set up a basic experiment.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48372%2fhow-to-use-machine-learning-to-discover-important-biomarkers-in-an-unbalanced-sm%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
If your goal is to identify important features, I would say go for a Decision Tree which inherently calculates importance/separation capability of the features while selecting them for splitting the internal nodes. You can also go for an ensemble of decision trees such as RandomForest which will return feature importance based on their average impurity reduction throughout all its trees.
This article can help you set up a basic experiment.
$endgroup$
add a comment |
$begingroup$
If your goal is to identify important features, I would say go for a Decision Tree which inherently calculates importance/separation capability of the features while selecting them for splitting the internal nodes. You can also go for an ensemble of decision trees such as RandomForest which will return feature importance based on their average impurity reduction throughout all its trees.
This article can help you set up a basic experiment.
$endgroup$
add a comment |
$begingroup$
If your goal is to identify important features, I would say go for a Decision Tree which inherently calculates importance/separation capability of the features while selecting them for splitting the internal nodes. You can also go for an ensemble of decision trees such as RandomForest which will return feature importance based on their average impurity reduction throughout all its trees.
This article can help you set up a basic experiment.
$endgroup$
If your goal is to identify important features, I would say go for a Decision Tree which inherently calculates importance/separation capability of the features while selecting them for splitting the internal nodes. You can also go for an ensemble of decision trees such as RandomForest which will return feature importance based on their average impurity reduction throughout all its trees.
This article can help you set up a basic experiment.
answered Apr 1 at 22:25
Sajid AhmedSajid Ahmed
315
315
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48372%2fhow-to-use-machine-learning-to-discover-important-biomarkers-in-an-unbalanced-sm%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
I think you need to be a bit more specific on what your questions is or it will be difficult to provide you with helpful answers
$endgroup$
– oW_♦
Apr 1 at 20:56
$begingroup$
You are going to have a hard time getting solid results for testing biomarkers with sample sizes that low unless that effect sizes are huge. The number of different measurements doesn't really increase statistical power very much. You might want to look at prior research on "severity of illness" to get an idea of what has already been discovered. There is quite a bit already done in the intensive care literature using various techniques.
$endgroup$
– 42-
Apr 2 at 0:13