How to approach a machine learning problem? [closed] The Next CEO of Stack Overflow2019 Community Moderator ElectionAge range prediction according to browsing historyWhat approach for creating a multi-classification model based on all categorical features (1 with 5,000 levels)?How to do feature engineering for email cleaning / text extraction?How to handle large number of features in machine learning?How to fake data based on the condition and weightHow to model machine learning problem for cache replacement policy?Classifier not predicting real datausing neural networks to predict set of characterticsTime series data into supervised learning problem PythonManual feature engineering based on the output

How to pronounce fünf in 45

How can I separate the number from the unit in argument?

Shortening a title without changing its meaning

Creating a script with console commands

Is the 21st century's idea of "freedom of speech" based on precedent?

Can you teleport closer to a creature you are Frightened of?

Why can't we say "I have been having a dog"?

Is there a rule of thumb for determining the amount one should accept for a settlement offer?

Why do we say “un seul M” and not “une seule M” even though M is a “consonne”?

Is this a new Fibonacci Identity?

Is a distribution that is normal, but highly skewed, considered Gaussian?

Is it okay to majorly distort historical facts while writing a fiction story?

Gödel's incompleteness theorems - what are the religious implications?

Can a PhD from a non-TU9 German university become a professor in a TU9 university?

What is a typical Mizrachi Seder like?

Does Germany produce more waste than the US?

Is it correct to say moon starry nights?

My ex-girlfriend uses my Apple ID to login to her iPad, do I have to give her my Apple ID password to reset it?

Find the majority element, which appears more than half the time

Identify and count spells (Distinctive events within each group)

Arrows in tikz Markov chain diagram overlap

"Eavesdropping" vs "Listen in on"

What steps are necessary to read a Modern SSD in Medieval Europe?

How to compactly explain secondary and tertiary characters without resorting to stereotypes?



How to approach a machine learning problem? [closed]



The Next CEO of Stack Overflow
2019 Community Moderator ElectionAge range prediction according to browsing historyWhat approach for creating a multi-classification model based on all categorical features (1 with 5,000 levels)?How to do feature engineering for email cleaning / text extraction?How to handle large number of features in machine learning?How to fake data based on the condition and weightHow to model machine learning problem for cache replacement policy?Classifier not predicting real datausing neural networks to predict set of characterticsTime series data into supervised learning problem PythonManual feature engineering based on the output










1












$begingroup$


I'm a beginner in machine learning, and no real statistical background ( just basic knowledge ). I comprehend half of what is said on forums about statistical methods and techniques for normalizing data and putting up plots to see data distribution. Anyway, i managed to create multiple predictive models following steps from people on Kaggle. However, i would like to receive some knowledge on the steps that must be taken while building a model. I'll use an example i worked on to demonstrate my steps, and if you spot a beginner's mistake and you will, just point.



I started by viewing my data ( combine is the test set + trainset minus the target column to predict )



print(combine.isnull().sum())


  1. age 0

  2. cmp 42855

  3. code_dept 654

  4. id_opportunite 0

  5. mca 4539

  6. nb_enfants 1624

  7. nom_organisme 0

  8. situation 58

Okay , some potential predictors have a variety of values of 'nulls'!



NOTE :



Situation : type = string , indicating the social situation of the person.



Nb_Enfants : type = int , indicating number of children.



cmp : type = string, indicating the name of the company this person was contracted with ( which is not a common information to get from a person over a phone call, which explains the huge number of nulls on that column )



nom_organisme : type=string , name of the call center that referred this person to us ( might be a good predictor , some centers send statiscally people more likely to sign than other centers according to my plot )



Age : type=int, no explanation needed i suppose



code_dept : type=int, refers to id's of departments , a department is a slice of a big town so these are id's to locations. ( it shoud be logically a good predictor since chance of signature is higher in some departments than others )



Question 1: is there a conclusion to draw from this ? or a predictor to eliminate like 'cmp' ? or a must-do like replace the nulls or predict them?



These are most of the features i'll be using. i want to predict if a person will sign or not with the company. I ommited the sign column, but it's a binary value



This is my output distribution. just below




Signature: 11674 (17.7 percent), Non Signature: 54250 (82.3 percent), Total: 65924




*So i have 2 output classes 0 and 1 and they are imbalanced 17% to 82%!



Question 2: should i use SMOTE in this case? or not? and are there other conclusions to draw from this?*



I started by removing what you guys call Outliers that of course i made sure they were random noises and mistakes in the data capture. ( Yes i made sure it was noise rather than a pattern that needed to be taught to the model )



Then, i visualized the columns one by one and noted the behavior against the sign column. These are some examples :



nb_enfants



This shows the percentage of people who signed in Yellow and those who have not in Blue per number of children ( from 0 to 5 ).



Also i figured a value_counts was necessary!



df3.nb_enfants.value_counts()


nb_enfants!



Question 3 : Are there other conclusions to draw from this plot? or other ways of extracting information from the possible feature? and can I decide just by looking at this plot wether nb_enfants is a good predictor or not?



then i went on and 'manually' picked features i liked ( That's how i roll ).



Just kidding, i picked features that would logically have an impact on the prediction ( I know there are ways to spot features unseen to the naked eye, but i think i'm not on that statistical level of displaying data and extracting informations from, any advice on how to do that is welcome )



Question 4 : What's a better indicator to a good predictor in my case?



I also manually encoded age into 7 classes, situation ( which contains strings, needed encoding anyway ) into 6 classes, created 2 features based on code_dept ( where the person lives ) and nom_organisme ( which company referred this person to our company ) where i assign a class to each code_dept or nom_organisme based on the percentage of people who signed.



I didn't like the dummy variables since, i'm deploying this machine later on a server, to be used as a webservice, and i need to encode the "people" that get sent over requests the same way i encoded my train dataset, and the only way i managed to do that is by static encoding via IF's in a function that i apply to the "person" that gets through.



Question5 : I know static is bad for maintenance or updating in general. Is there way to do that webservice transformation without using static encoding? Also is it recommended at all to use static encoding on every feature?



This is a sample of a train row



Question6 : Any remarks on the data ?



I set up a correlation matrix for some features by doing this :



plt.figure(figsize=(16,14)) 
foo=sns.heatmap(train.drop(['id_opportunite','mca','code_dept','nom_organisme'],axis=1).corr(), vmax=0.6,square=True, annot=True)


Correlation Matrix



Question7 : What conclusions one should draw from this matrix?



I went with :
X = train[['age','classe_dep','situation','nb_enfants','classe_organisme']]
y = train.loc[:,'signer']



Result on decision tree :



le score maximale de l'algorithme Decision Tree est : 84.11% pour max_leaf_nodes = 52


Result on the test set : 75% accuracy.



Question8 : Any general thoughts or remarks? Thanks!



Answer should be verification of my steps, and some remarks on one or many of my questions! Thanks!










share|improve this question











$endgroup$



closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

















  • $begingroup$
    I'm not a big fun of clustering numerical variables such as the age in group, i prefer to let the tree decide which are the best splits. For the rest you didn't provide much details but I don't see anything clearly wrong. Why you created the correlation matrix?
    $endgroup$
    – VD93
    Mar 25 at 11:30










  • $begingroup$
    To see how the potential predictors are interacting with each other , and with the prediction column. What other details would you rather see? i can always edit my question for clarity. Thanks.
    $endgroup$
    – Blenzus
    Mar 25 at 11:38










  • $begingroup$
    Hi Blenzus, and welcome! This question is way too broad. If you could split it up into a more specific, detailed question, you're much more likely to get a useful answer. Asking "what am I doing wrong?" with your entire project, is unlikely to get a response. To add to this: if you break your problem down into smaller specific questions, I'm sure you'll find the answers to those questions already on this site. Good luck :)
    $endgroup$
    – Dan Carter
    Mar 25 at 12:01











  • $begingroup$
    Thanks for the input , I will!
    $endgroup$
    – Blenzus
    Mar 25 at 13:19















1












$begingroup$


I'm a beginner in machine learning, and no real statistical background ( just basic knowledge ). I comprehend half of what is said on forums about statistical methods and techniques for normalizing data and putting up plots to see data distribution. Anyway, i managed to create multiple predictive models following steps from people on Kaggle. However, i would like to receive some knowledge on the steps that must be taken while building a model. I'll use an example i worked on to demonstrate my steps, and if you spot a beginner's mistake and you will, just point.



I started by viewing my data ( combine is the test set + trainset minus the target column to predict )



print(combine.isnull().sum())


  1. age 0

  2. cmp 42855

  3. code_dept 654

  4. id_opportunite 0

  5. mca 4539

  6. nb_enfants 1624

  7. nom_organisme 0

  8. situation 58

Okay , some potential predictors have a variety of values of 'nulls'!



NOTE :



Situation : type = string , indicating the social situation of the person.



Nb_Enfants : type = int , indicating number of children.



cmp : type = string, indicating the name of the company this person was contracted with ( which is not a common information to get from a person over a phone call, which explains the huge number of nulls on that column )



nom_organisme : type=string , name of the call center that referred this person to us ( might be a good predictor , some centers send statiscally people more likely to sign than other centers according to my plot )



Age : type=int, no explanation needed i suppose



code_dept : type=int, refers to id's of departments , a department is a slice of a big town so these are id's to locations. ( it shoud be logically a good predictor since chance of signature is higher in some departments than others )



Question 1: is there a conclusion to draw from this ? or a predictor to eliminate like 'cmp' ? or a must-do like replace the nulls or predict them?



These are most of the features i'll be using. i want to predict if a person will sign or not with the company. I ommited the sign column, but it's a binary value



This is my output distribution. just below




Signature: 11674 (17.7 percent), Non Signature: 54250 (82.3 percent), Total: 65924




*So i have 2 output classes 0 and 1 and they are imbalanced 17% to 82%!



Question 2: should i use SMOTE in this case? or not? and are there other conclusions to draw from this?*



I started by removing what you guys call Outliers that of course i made sure they were random noises and mistakes in the data capture. ( Yes i made sure it was noise rather than a pattern that needed to be taught to the model )



Then, i visualized the columns one by one and noted the behavior against the sign column. These are some examples :



nb_enfants



This shows the percentage of people who signed in Yellow and those who have not in Blue per number of children ( from 0 to 5 ).



Also i figured a value_counts was necessary!



df3.nb_enfants.value_counts()


nb_enfants!



Question 3 : Are there other conclusions to draw from this plot? or other ways of extracting information from the possible feature? and can I decide just by looking at this plot wether nb_enfants is a good predictor or not?



then i went on and 'manually' picked features i liked ( That's how i roll ).



Just kidding, i picked features that would logically have an impact on the prediction ( I know there are ways to spot features unseen to the naked eye, but i think i'm not on that statistical level of displaying data and extracting informations from, any advice on how to do that is welcome )



Question 4 : What's a better indicator to a good predictor in my case?



I also manually encoded age into 7 classes, situation ( which contains strings, needed encoding anyway ) into 6 classes, created 2 features based on code_dept ( where the person lives ) and nom_organisme ( which company referred this person to our company ) where i assign a class to each code_dept or nom_organisme based on the percentage of people who signed.



I didn't like the dummy variables since, i'm deploying this machine later on a server, to be used as a webservice, and i need to encode the "people" that get sent over requests the same way i encoded my train dataset, and the only way i managed to do that is by static encoding via IF's in a function that i apply to the "person" that gets through.



Question5 : I know static is bad for maintenance or updating in general. Is there way to do that webservice transformation without using static encoding? Also is it recommended at all to use static encoding on every feature?



This is a sample of a train row



Question6 : Any remarks on the data ?



I set up a correlation matrix for some features by doing this :



plt.figure(figsize=(16,14)) 
foo=sns.heatmap(train.drop(['id_opportunite','mca','code_dept','nom_organisme'],axis=1).corr(), vmax=0.6,square=True, annot=True)


Correlation Matrix



Question7 : What conclusions one should draw from this matrix?



I went with :
X = train[['age','classe_dep','situation','nb_enfants','classe_organisme']]
y = train.loc[:,'signer']



Result on decision tree :



le score maximale de l'algorithme Decision Tree est : 84.11% pour max_leaf_nodes = 52


Result on the test set : 75% accuracy.



Question8 : Any general thoughts or remarks? Thanks!



Answer should be verification of my steps, and some remarks on one or many of my questions! Thanks!










share|improve this question











$endgroup$



closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

















  • $begingroup$
    I'm not a big fun of clustering numerical variables such as the age in group, i prefer to let the tree decide which are the best splits. For the rest you didn't provide much details but I don't see anything clearly wrong. Why you created the correlation matrix?
    $endgroup$
    – VD93
    Mar 25 at 11:30










  • $begingroup$
    To see how the potential predictors are interacting with each other , and with the prediction column. What other details would you rather see? i can always edit my question for clarity. Thanks.
    $endgroup$
    – Blenzus
    Mar 25 at 11:38










  • $begingroup$
    Hi Blenzus, and welcome! This question is way too broad. If you could split it up into a more specific, detailed question, you're much more likely to get a useful answer. Asking "what am I doing wrong?" with your entire project, is unlikely to get a response. To add to this: if you break your problem down into smaller specific questions, I'm sure you'll find the answers to those questions already on this site. Good luck :)
    $endgroup$
    – Dan Carter
    Mar 25 at 12:01











  • $begingroup$
    Thanks for the input , I will!
    $endgroup$
    – Blenzus
    Mar 25 at 13:19













1












1








1


1



$begingroup$


I'm a beginner in machine learning, and no real statistical background ( just basic knowledge ). I comprehend half of what is said on forums about statistical methods and techniques for normalizing data and putting up plots to see data distribution. Anyway, i managed to create multiple predictive models following steps from people on Kaggle. However, i would like to receive some knowledge on the steps that must be taken while building a model. I'll use an example i worked on to demonstrate my steps, and if you spot a beginner's mistake and you will, just point.



I started by viewing my data ( combine is the test set + trainset minus the target column to predict )



print(combine.isnull().sum())


  1. age 0

  2. cmp 42855

  3. code_dept 654

  4. id_opportunite 0

  5. mca 4539

  6. nb_enfants 1624

  7. nom_organisme 0

  8. situation 58

Okay , some potential predictors have a variety of values of 'nulls'!



NOTE :



Situation : type = string , indicating the social situation of the person.



Nb_Enfants : type = int , indicating number of children.



cmp : type = string, indicating the name of the company this person was contracted with ( which is not a common information to get from a person over a phone call, which explains the huge number of nulls on that column )



nom_organisme : type=string , name of the call center that referred this person to us ( might be a good predictor , some centers send statiscally people more likely to sign than other centers according to my plot )



Age : type=int, no explanation needed i suppose



code_dept : type=int, refers to id's of departments , a department is a slice of a big town so these are id's to locations. ( it shoud be logically a good predictor since chance of signature is higher in some departments than others )



Question 1: is there a conclusion to draw from this ? or a predictor to eliminate like 'cmp' ? or a must-do like replace the nulls or predict them?



These are most of the features i'll be using. i want to predict if a person will sign or not with the company. I ommited the sign column, but it's a binary value



This is my output distribution. just below




Signature: 11674 (17.7 percent), Non Signature: 54250 (82.3 percent), Total: 65924




*So i have 2 output classes 0 and 1 and they are imbalanced 17% to 82%!



Question 2: should i use SMOTE in this case? or not? and are there other conclusions to draw from this?*



I started by removing what you guys call Outliers that of course i made sure they were random noises and mistakes in the data capture. ( Yes i made sure it was noise rather than a pattern that needed to be taught to the model )



Then, i visualized the columns one by one and noted the behavior against the sign column. These are some examples :



nb_enfants



This shows the percentage of people who signed in Yellow and those who have not in Blue per number of children ( from 0 to 5 ).



Also i figured a value_counts was necessary!



df3.nb_enfants.value_counts()


nb_enfants!



Question 3 : Are there other conclusions to draw from this plot? or other ways of extracting information from the possible feature? and can I decide just by looking at this plot wether nb_enfants is a good predictor or not?



then i went on and 'manually' picked features i liked ( That's how i roll ).



Just kidding, i picked features that would logically have an impact on the prediction ( I know there are ways to spot features unseen to the naked eye, but i think i'm not on that statistical level of displaying data and extracting informations from, any advice on how to do that is welcome )



Question 4 : What's a better indicator to a good predictor in my case?



I also manually encoded age into 7 classes, situation ( which contains strings, needed encoding anyway ) into 6 classes, created 2 features based on code_dept ( where the person lives ) and nom_organisme ( which company referred this person to our company ) where i assign a class to each code_dept or nom_organisme based on the percentage of people who signed.



I didn't like the dummy variables since, i'm deploying this machine later on a server, to be used as a webservice, and i need to encode the "people" that get sent over requests the same way i encoded my train dataset, and the only way i managed to do that is by static encoding via IF's in a function that i apply to the "person" that gets through.



Question5 : I know static is bad for maintenance or updating in general. Is there way to do that webservice transformation without using static encoding? Also is it recommended at all to use static encoding on every feature?



This is a sample of a train row



Question6 : Any remarks on the data ?



I set up a correlation matrix for some features by doing this :



plt.figure(figsize=(16,14)) 
foo=sns.heatmap(train.drop(['id_opportunite','mca','code_dept','nom_organisme'],axis=1).corr(), vmax=0.6,square=True, annot=True)


Correlation Matrix



Question7 : What conclusions one should draw from this matrix?



I went with :
X = train[['age','classe_dep','situation','nb_enfants','classe_organisme']]
y = train.loc[:,'signer']



Result on decision tree :



le score maximale de l'algorithme Decision Tree est : 84.11% pour max_leaf_nodes = 52


Result on the test set : 75% accuracy.



Question8 : Any general thoughts or remarks? Thanks!



Answer should be verification of my steps, and some remarks on one or many of my questions! Thanks!










share|improve this question











$endgroup$




I'm a beginner in machine learning, and no real statistical background ( just basic knowledge ). I comprehend half of what is said on forums about statistical methods and techniques for normalizing data and putting up plots to see data distribution. Anyway, i managed to create multiple predictive models following steps from people on Kaggle. However, i would like to receive some knowledge on the steps that must be taken while building a model. I'll use an example i worked on to demonstrate my steps, and if you spot a beginner's mistake and you will, just point.



I started by viewing my data ( combine is the test set + trainset minus the target column to predict )



print(combine.isnull().sum())


  1. age 0

  2. cmp 42855

  3. code_dept 654

  4. id_opportunite 0

  5. mca 4539

  6. nb_enfants 1624

  7. nom_organisme 0

  8. situation 58

Okay , some potential predictors have a variety of values of 'nulls'!



NOTE :



Situation : type = string , indicating the social situation of the person.



Nb_Enfants : type = int , indicating number of children.



cmp : type = string, indicating the name of the company this person was contracted with ( which is not a common information to get from a person over a phone call, which explains the huge number of nulls on that column )



nom_organisme : type=string , name of the call center that referred this person to us ( might be a good predictor , some centers send statiscally people more likely to sign than other centers according to my plot )



Age : type=int, no explanation needed i suppose



code_dept : type=int, refers to id's of departments , a department is a slice of a big town so these are id's to locations. ( it shoud be logically a good predictor since chance of signature is higher in some departments than others )



Question 1: is there a conclusion to draw from this ? or a predictor to eliminate like 'cmp' ? or a must-do like replace the nulls or predict them?



These are most of the features i'll be using. i want to predict if a person will sign or not with the company. I ommited the sign column, but it's a binary value



This is my output distribution. just below




Signature: 11674 (17.7 percent), Non Signature: 54250 (82.3 percent), Total: 65924




*So i have 2 output classes 0 and 1 and they are imbalanced 17% to 82%!



Question 2: should i use SMOTE in this case? or not? and are there other conclusions to draw from this?*



I started by removing what you guys call Outliers that of course i made sure they were random noises and mistakes in the data capture. ( Yes i made sure it was noise rather than a pattern that needed to be taught to the model )



Then, i visualized the columns one by one and noted the behavior against the sign column. These are some examples :



nb_enfants



This shows the percentage of people who signed in Yellow and those who have not in Blue per number of children ( from 0 to 5 ).



Also i figured a value_counts was necessary!



df3.nb_enfants.value_counts()


nb_enfants!



Question 3 : Are there other conclusions to draw from this plot? or other ways of extracting information from the possible feature? and can I decide just by looking at this plot wether nb_enfants is a good predictor or not?



then i went on and 'manually' picked features i liked ( That's how i roll ).



Just kidding, i picked features that would logically have an impact on the prediction ( I know there are ways to spot features unseen to the naked eye, but i think i'm not on that statistical level of displaying data and extracting informations from, any advice on how to do that is welcome )



Question 4 : What's a better indicator to a good predictor in my case?



I also manually encoded age into 7 classes, situation ( which contains strings, needed encoding anyway ) into 6 classes, created 2 features based on code_dept ( where the person lives ) and nom_organisme ( which company referred this person to our company ) where i assign a class to each code_dept or nom_organisme based on the percentage of people who signed.



I didn't like the dummy variables since, i'm deploying this machine later on a server, to be used as a webservice, and i need to encode the "people" that get sent over requests the same way i encoded my train dataset, and the only way i managed to do that is by static encoding via IF's in a function that i apply to the "person" that gets through.



Question5 : I know static is bad for maintenance or updating in general. Is there way to do that webservice transformation without using static encoding? Also is it recommended at all to use static encoding on every feature?



This is a sample of a train row



Question6 : Any remarks on the data ?



I set up a correlation matrix for some features by doing this :



plt.figure(figsize=(16,14)) 
foo=sns.heatmap(train.drop(['id_opportunite','mca','code_dept','nom_organisme'],axis=1).corr(), vmax=0.6,square=True, annot=True)


Correlation Matrix



Question7 : What conclusions one should draw from this matrix?



I went with :
X = train[['age','classe_dep','situation','nb_enfants','classe_organisme']]
y = train.loc[:,'signer']



Result on decision tree :



le score maximale de l'algorithme Decision Tree est : 84.11% pour max_leaf_nodes = 52


Result on the test set : 75% accuracy.



Question8 : Any general thoughts or remarks? Thanks!



Answer should be verification of my steps, and some remarks on one or many of my questions! Thanks!







machine-learning python predictive-modeling supervised-learning






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 26 at 9:44







Blenzus

















asked Mar 25 at 9:55









BlenzusBlenzus

768




768




closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.









closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.













  • $begingroup$
    I'm not a big fun of clustering numerical variables such as the age in group, i prefer to let the tree decide which are the best splits. For the rest you didn't provide much details but I don't see anything clearly wrong. Why you created the correlation matrix?
    $endgroup$
    – VD93
    Mar 25 at 11:30










  • $begingroup$
    To see how the potential predictors are interacting with each other , and with the prediction column. What other details would you rather see? i can always edit my question for clarity. Thanks.
    $endgroup$
    – Blenzus
    Mar 25 at 11:38










  • $begingroup$
    Hi Blenzus, and welcome! This question is way too broad. If you could split it up into a more specific, detailed question, you're much more likely to get a useful answer. Asking "what am I doing wrong?" with your entire project, is unlikely to get a response. To add to this: if you break your problem down into smaller specific questions, I'm sure you'll find the answers to those questions already on this site. Good luck :)
    $endgroup$
    – Dan Carter
    Mar 25 at 12:01











  • $begingroup$
    Thanks for the input , I will!
    $endgroup$
    – Blenzus
    Mar 25 at 13:19
















  • $begingroup$
    I'm not a big fun of clustering numerical variables such as the age in group, i prefer to let the tree decide which are the best splits. For the rest you didn't provide much details but I don't see anything clearly wrong. Why you created the correlation matrix?
    $endgroup$
    – VD93
    Mar 25 at 11:30










  • $begingroup$
    To see how the potential predictors are interacting with each other , and with the prediction column. What other details would you rather see? i can always edit my question for clarity. Thanks.
    $endgroup$
    – Blenzus
    Mar 25 at 11:38










  • $begingroup$
    Hi Blenzus, and welcome! This question is way too broad. If you could split it up into a more specific, detailed question, you're much more likely to get a useful answer. Asking "what am I doing wrong?" with your entire project, is unlikely to get a response. To add to this: if you break your problem down into smaller specific questions, I'm sure you'll find the answers to those questions already on this site. Good luck :)
    $endgroup$
    – Dan Carter
    Mar 25 at 12:01











  • $begingroup$
    Thanks for the input , I will!
    $endgroup$
    – Blenzus
    Mar 25 at 13:19















$begingroup$
I'm not a big fun of clustering numerical variables such as the age in group, i prefer to let the tree decide which are the best splits. For the rest you didn't provide much details but I don't see anything clearly wrong. Why you created the correlation matrix?
$endgroup$
– VD93
Mar 25 at 11:30




$begingroup$
I'm not a big fun of clustering numerical variables such as the age in group, i prefer to let the tree decide which are the best splits. For the rest you didn't provide much details but I don't see anything clearly wrong. Why you created the correlation matrix?
$endgroup$
– VD93
Mar 25 at 11:30












$begingroup$
To see how the potential predictors are interacting with each other , and with the prediction column. What other details would you rather see? i can always edit my question for clarity. Thanks.
$endgroup$
– Blenzus
Mar 25 at 11:38




$begingroup$
To see how the potential predictors are interacting with each other , and with the prediction column. What other details would you rather see? i can always edit my question for clarity. Thanks.
$endgroup$
– Blenzus
Mar 25 at 11:38












$begingroup$
Hi Blenzus, and welcome! This question is way too broad. If you could split it up into a more specific, detailed question, you're much more likely to get a useful answer. Asking "what am I doing wrong?" with your entire project, is unlikely to get a response. To add to this: if you break your problem down into smaller specific questions, I'm sure you'll find the answers to those questions already on this site. Good luck :)
$endgroup$
– Dan Carter
Mar 25 at 12:01





$begingroup$
Hi Blenzus, and welcome! This question is way too broad. If you could split it up into a more specific, detailed question, you're much more likely to get a useful answer. Asking "what am I doing wrong?" with your entire project, is unlikely to get a response. To add to this: if you break your problem down into smaller specific questions, I'm sure you'll find the answers to those questions already on this site. Good luck :)
$endgroup$
– Dan Carter
Mar 25 at 12:01













$begingroup$
Thanks for the input , I will!
$endgroup$
– Blenzus
Mar 25 at 13:19




$begingroup$
Thanks for the input , I will!
$endgroup$
– Blenzus
Mar 25 at 13:19










1 Answer
1






active

oldest

votes


















1












$begingroup$

Plenty of questions there. I will answer about the accuracy one:



75% is larger than random chance and might have use. But you need to consider what is relevant for your application.



For example, suppose you are dealing with a security issue. Denying access to someone that is entitled to it is less damaging than allowing access to someone who is not.



If you want to reduce the number of phone calls to sell a product you want a model that will tell you to call the maximum of potential clients and even a shrinking of 10% of useless call is a good model with good profit if you don't make it absurdly expensive to keep operating.






share|improve this answer











$endgroup$



















    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1












    $begingroup$

    Plenty of questions there. I will answer about the accuracy one:



    75% is larger than random chance and might have use. But you need to consider what is relevant for your application.



    For example, suppose you are dealing with a security issue. Denying access to someone that is entitled to it is less damaging than allowing access to someone who is not.



    If you want to reduce the number of phone calls to sell a product you want a model that will tell you to call the maximum of potential clients and even a shrinking of 10% of useless call is a good model with good profit if you don't make it absurdly expensive to keep operating.






    share|improve this answer











    $endgroup$

















      1












      $begingroup$

      Plenty of questions there. I will answer about the accuracy one:



      75% is larger than random chance and might have use. But you need to consider what is relevant for your application.



      For example, suppose you are dealing with a security issue. Denying access to someone that is entitled to it is less damaging than allowing access to someone who is not.



      If you want to reduce the number of phone calls to sell a product you want a model that will tell you to call the maximum of potential clients and even a shrinking of 10% of useless call is a good model with good profit if you don't make it absurdly expensive to keep operating.






      share|improve this answer











      $endgroup$















        1












        1








        1





        $begingroup$

        Plenty of questions there. I will answer about the accuracy one:



        75% is larger than random chance and might have use. But you need to consider what is relevant for your application.



        For example, suppose you are dealing with a security issue. Denying access to someone that is entitled to it is less damaging than allowing access to someone who is not.



        If you want to reduce the number of phone calls to sell a product you want a model that will tell you to call the maximum of potential clients and even a shrinking of 10% of useless call is a good model with good profit if you don't make it absurdly expensive to keep operating.






        share|improve this answer











        $endgroup$



        Plenty of questions there. I will answer about the accuracy one:



        75% is larger than random chance and might have use. But you need to consider what is relevant for your application.



        For example, suppose you are dealing with a security issue. Denying access to someone that is entitled to it is less damaging than allowing access to someone who is not.



        If you want to reduce the number of phone calls to sell a product you want a model that will tell you to call the maximum of potential clients and even a shrinking of 10% of useless call is a good model with good profit if you don't make it absurdly expensive to keep operating.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Mar 25 at 18:27









        Blenzus

        768




        768










        answered Mar 25 at 15:27









        Pedro Henrique MonfortePedro Henrique Monforte

        1156




        1156













            Popular posts from this blog

            Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

            Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

            Do these cracks on my tires look bad? The Next CEO of Stack OverflowDry rot tire should I replace?Having to replace tiresFishtailed so easily? Bad tires? ABS?Filling the tires with something other than air, to avoid puncture hassles?Used Michelin tires safe to install?Do these tyre cracks necessitate replacement?Rumbling noise: tires or mechanicalIs it possible to fix noisy feathered tires?Are bad winter tires still better than summer tires in winter?Torque converter failure - Related to replacing only 2 tires?Why use snow tires on all 4 wheels on 2-wheel-drive cars?