How to approach a machine learning problem? [closed] The Next CEO of Stack Overflow2019 Community Moderator ElectionAge range prediction according to browsing historyWhat approach for creating a multi-classification model based on all categorical features (1 with 5,000 levels)?How to do feature engineering for email cleaning / text extraction?How to handle large number of features in machine learning?How to fake data based on the condition and weightHow to model machine learning problem for cache replacement policy?Classifier not predicting real datausing neural networks to predict set of characterticsTime series data into supervised learning problem PythonManual feature engineering based on the output

How to pronounce fünf in 45

How can I separate the number from the unit in argument?

Shortening a title without changing its meaning

Creating a script with console commands

Is the 21st century's idea of "freedom of speech" based on precedent?

Can you teleport closer to a creature you are Frightened of?

Why can't we say "I have been having a dog"?

Is there a rule of thumb for determining the amount one should accept for a settlement offer?

Why do we say “un seul M” and not “une seule M” even though M is a “consonne”?

Is this a new Fibonacci Identity?

Is a distribution that is normal, but highly skewed, considered Gaussian?

Is it okay to majorly distort historical facts while writing a fiction story?

Gödel's incompleteness theorems - what are the religious implications?

Can a PhD from a non-TU9 German university become a professor in a TU9 university?

What is a typical Mizrachi Seder like?

Does Germany produce more waste than the US?

Is it correct to say moon starry nights?

My ex-girlfriend uses my Apple ID to login to her iPad, do I have to give her my Apple ID password to reset it?

Find the majority element, which appears more than half the time

Identify and count spells (Distinctive events within each group)

Arrows in tikz Markov chain diagram overlap

"Eavesdropping" vs "Listen in on"

What steps are necessary to read a Modern SSD in Medieval Europe?

How to compactly explain secondary and tertiary characters without resorting to stereotypes?

How to approach a machine learning problem? [closed]

The Next CEO of Stack Overflow

2019 Community Moderator ElectionAge range prediction according to browsing historyWhat approach for creating a multi-classification model based on all categorical features (1 with 5,000 levels)?How to do feature engineering for email cleaning / text extraction?How to handle large number of features in machine learning?How to fake data based on the condition and weightHow to model machine learning problem for cache replacement policy?Classifier not predicting real datausing neural networks to predict set of characterticsTime series data into supervised learning problem PythonManual feature engineering based on the output

I'm a beginner in machine learning, and no real statistical background ( just basic knowledge ). I comprehend half of what is said on forums about statistical methods and techniques for normalizing data and putting up plots to see data distribution. Anyway, i managed to create multiple predictive models following steps from people on Kaggle. However, i would like to receive some knowledge on the steps that must be taken while building a model. I'll use an example i worked on to demonstrate my steps, and if you spot a beginner's mistake and you will, just point.

I started by viewing my data ( combine is the test set + trainset minus the target column to predict )

print(combine.isnull().sum())

age 0

cmp 42855

code_dept 654

id_opportunite 0

mca 4539

nb_enfants 1624

nom_organisme 0

situation 58

Okay , some potential predictors have a variety of values of 'nulls'!

NOTE :

Situation : type = string , indicating the social situation of the person.

Nb_Enfants : type = int , indicating number of children.

cmp : type = string, indicating the name of the company this person was contracted with ( which is not a common information to get from a person over a phone call, which explains the huge number of nulls on that column )

nom_organisme : type=string , name of the call center that referred this person to us ( might be a good predictor , some centers send statiscally people more likely to sign than other centers according to my plot )

Age : type=int, no explanation needed i suppose

code_dept : type=int, refers to id's of departments , a department is a slice of a big town so these are id's to locations. ( it shoud be logically a good predictor since chance of signature is higher in some departments than others )

Question 1: is there a conclusion to draw from this ? or a predictor to eliminate like 'cmp' ? or a must-do like replace the nulls or predict them?

These are most of the features i'll be using. i want to predict if a person will sign or not with the company. I ommited the sign column, but it's a binary value

This is my output distribution. just below

Signature: 11674 (17.7 percent), Non Signature: 54250 (82.3 percent), Total: 65924

*So i have 2 output classes 0 and 1 and they are imbalanced 17% to 82%!

Question 2: should i use SMOTE in this case? or not? and are there other conclusions to draw from this?*

I started by removing what you guys call Outliers that of course i made sure they were random noises and mistakes in the data capture. ( Yes i made sure it was noise rather than a pattern that needed to be taught to the model )

Then, i visualized the columns one by one and noted the behavior against the sign column. These are some examples :

nb_enfants

This shows the percentage of people who signed in Yellow and those who have not in Blue per number of children ( from 0 to 5 ).

Also i figured a value_counts was necessary!

df3.nb_enfants.value_counts()

nb_enfants!

Question 3 : Are there other conclusions to draw from this plot? or other ways of extracting information from the possible feature? and can I decide just by looking at this plot wether nb_enfants is a good predictor or not?

then i went on and 'manually' picked features i liked ( That's how i roll ).

Just kidding, i picked features that would logically have an impact on the prediction ( I know there are ways to spot features unseen to the naked eye, but i think i'm not on that statistical level of displaying data and extracting informations from, any advice on how to do that is welcome )

Question 4 : What's a better indicator to a good predictor in my case?

I also manually encoded age into 7 classes, situation ( which contains strings, needed encoding anyway ) into 6 classes, created 2 features based on code_dept ( where the person lives ) and nom_organisme ( which company referred this person to our company ) where i assign a class to each code_dept or nom_organisme based on the percentage of people who signed.

I didn't like the dummy variables since, i'm deploying this machine later on a server, to be used as a webservice, and i need to encode the "people" that get sent over requests the same way i encoded my train dataset, and the only way i managed to do that is by static encoding via IF's in a function that i apply to the "person" that gets through.

Question5 : I know static is bad for maintenance or updating in general. Is there way to do that webservice transformation without using static encoding? Also is it recommended at all to use static encoding on every feature?

This is a sample of a train row

Question6 : Any remarks on the data ?

I set up a correlation matrix for some features by doing this :

plt.figure(figsize=(16,14)) 
foo=sns.heatmap(train.drop(['id_opportunite','mca','code_dept','nom_organisme'],axis=1).corr(), vmax=0.6,square=True, annot=True)

Correlation Matrix

Question7 : What conclusions one should draw from this matrix?

I went with :
X = train[['age','classe_dep','situation','nb_enfants','classe_organisme']] y = train.loc[:,'signer']

Result on decision tree :

le score maximale de l'algorithme Decision Tree est : 84.11% pour max_leaf_nodes = 52

Result on the test set : 75% accuracy.

Question8 : Any general thoughts or remarks? Thanks!

Answer should be verification of my steps, and some remarks on one or many of my questions! Thanks!

edited Mar 26 at 9:44

asked Mar 25 at 9:55

Blenzus

768

closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

$begingroup$
I'm not a big fun of clustering numerical variables such as the age in group, i prefer to let the tree decide which are the best splits. For the rest you didn't provide much details but I don't see anything clearly wrong. Why you created the correlation matrix?
$endgroup$
– VD93
Mar 25 at 11:30

$begingroup$
To see how the potential predictors are interacting with each other , and with the prediction column. What other details would you rather see? i can always edit my question for clarity. Thanks.
$endgroup$
– Blenzus
Mar 25 at 11:38

$begingroup$
Hi Blenzus, and welcome! This question is way too broad. If you could split it up into a more specific, detailed question, you're much more likely to get a useful answer. Asking "what am I doing wrong?" with your entire project, is unlikely to get a response. To add to this: if you break your problem down into smaller specific questions, I'm sure you'll find the answers to those questions already on this site. Good luck :)
$endgroup$
– Dan Carter
Mar 25 at 12:01

$begingroup$
Thanks for the input , I will!
$endgroup$
– Blenzus
Mar 25 at 13:19

add a comment |

I started by viewing my data ( combine is the test set + trainset minus the target column to predict )

print(combine.isnull().sum())

age 0

cmp 42855

code_dept 654

id_opportunite 0

mca 4539

nb_enfants 1624

nom_organisme 0

situation 58

Okay , some potential predictors have a variety of values of 'nulls'!

NOTE :

Situation : type = string , indicating the social situation of the person.

Nb_Enfants : type = int , indicating number of children.

Age : type=int, no explanation needed i suppose

Question 1: is there a conclusion to draw from this ? or a predictor to eliminate like 'cmp' ? or a must-do like replace the nulls or predict them?

These are most of the features i'll be using. i want to predict if a person will sign or not with the company. I ommited the sign column, but it's a binary value

This is my output distribution. just below

Signature: 11674 (17.7 percent), Non Signature: 54250 (82.3 percent), Total: 65924

*So i have 2 output classes 0 and 1 and they are imbalanced 17% to 82%!

Question 2: should i use SMOTE in this case? or not? and are there other conclusions to draw from this?*

Then, i visualized the columns one by one and noted the behavior against the sign column. These are some examples :

nb_enfants

This shows the percentage of people who signed in Yellow and those who have not in Blue per number of children ( from 0 to 5 ).

Also i figured a value_counts was necessary!

df3.nb_enfants.value_counts()

nb_enfants!

then i went on and 'manually' picked features i liked ( That's how i roll ).

Question 4 : What's a better indicator to a good predictor in my case?

This is a sample of a train row

Question6 : Any remarks on the data ?

I set up a correlation matrix for some features by doing this :

plt.figure(figsize=(16,14)) 
foo=sns.heatmap(train.drop(['id_opportunite','mca','code_dept','nom_organisme'],axis=1).corr(), vmax=0.6,square=True, annot=True)

Correlation Matrix

Question7 : What conclusions one should draw from this matrix?

I went with :
X = train[['age','classe_dep','situation','nb_enfants','classe_organisme']] y = train.loc[:,'signer']

Result on decision tree :

le score maximale de l'algorithme Decision Tree est : 84.11% pour max_leaf_nodes = 52

Result on the test set : 75% accuracy.

Question8 : Any general thoughts or remarks? Thanks!

Answer should be verification of my steps, and some remarks on one or many of my questions! Thanks!

edited Mar 26 at 9:44

asked Mar 25 at 9:55

Blenzus

768

closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27

$begingroup$
I'm not a big fun of clustering numerical variables such as the age in group, i prefer to let the tree decide which are the best splits. For the rest you didn't provide much details but I don't see anything clearly wrong. Why you created the correlation matrix?
$endgroup$
– VD93
Mar 25 at 11:30

$begingroup$
To see how the potential predictors are interacting with each other , and with the prediction column. What other details would you rather see? i can always edit my question for clarity. Thanks.
$endgroup$
– Blenzus
Mar 25 at 11:38

$begingroup$
Hi Blenzus, and welcome! This question is way too broad. If you could split it up into a more specific, detailed question, you're much more likely to get a useful answer. Asking "what am I doing wrong?" with your entire project, is unlikely to get a response. To add to this: if you break your problem down into smaller specific questions, I'm sure you'll find the answers to those questions already on this site. Good luck :)
$endgroup$
– Dan Carter
Mar 25 at 12:01

$begingroup$
Thanks for the input , I will!
$endgroup$
– Blenzus
Mar 25 at 13:19

add a comment |

I started by viewing my data ( combine is the test set + trainset minus the target column to predict )

print(combine.isnull().sum())

age 0

cmp 42855

code_dept 654

id_opportunite 0

mca 4539

nb_enfants 1624

nom_organisme 0

situation 58

Okay , some potential predictors have a variety of values of 'nulls'!

NOTE :

Situation : type = string , indicating the social situation of the person.

Nb_Enfants : type = int , indicating number of children.

Age : type=int, no explanation needed i suppose

Question 1: is there a conclusion to draw from this ? or a predictor to eliminate like 'cmp' ? or a must-do like replace the nulls or predict them?

These are most of the features i'll be using. i want to predict if a person will sign or not with the company. I ommited the sign column, but it's a binary value

This is my output distribution. just below

Signature: 11674 (17.7 percent), Non Signature: 54250 (82.3 percent), Total: 65924

*So i have 2 output classes 0 and 1 and they are imbalanced 17% to 82%!

Question 2: should i use SMOTE in this case? or not? and are there other conclusions to draw from this?*

Then, i visualized the columns one by one and noted the behavior against the sign column. These are some examples :

nb_enfants

This shows the percentage of people who signed in Yellow and those who have not in Blue per number of children ( from 0 to 5 ).

Also i figured a value_counts was necessary!

df3.nb_enfants.value_counts()

nb_enfants!

then i went on and 'manually' picked features i liked ( That's how i roll ).

Question 4 : What's a better indicator to a good predictor in my case?

This is a sample of a train row

Question6 : Any remarks on the data ?

I set up a correlation matrix for some features by doing this :

plt.figure(figsize=(16,14)) 
foo=sns.heatmap(train.drop(['id_opportunite','mca','code_dept','nom_organisme'],axis=1).corr(), vmax=0.6,square=True, annot=True)

Correlation Matrix

Question7 : What conclusions one should draw from this matrix?

I went with :
X = train[['age','classe_dep','situation','nb_enfants','classe_organisme']] y = train.loc[:,'signer']

Result on decision tree :

le score maximale de l'algorithme Decision Tree est : 84.11% pour max_leaf_nodes = 52

Result on the test set : 75% accuracy.

Question8 : Any general thoughts or remarks? Thanks!

Answer should be verification of my steps, and some remarks on one or many of my questions! Thanks!

edited Mar 26 at 9:44

asked Mar 25 at 9:55

Blenzus

768

I started by viewing my data ( combine is the test set + trainset minus the target column to predict )

print(combine.isnull().sum())

age 0

cmp 42855

code_dept 654

id_opportunite 0

mca 4539

nb_enfants 1624

nom_organisme 0

situation 58

Okay , some potential predictors have a variety of values of 'nulls'!

NOTE :

Situation : type = string , indicating the social situation of the person.

Nb_Enfants : type = int , indicating number of children.

Age : type=int, no explanation needed i suppose

Question 1: is there a conclusion to draw from this ? or a predictor to eliminate like 'cmp' ? or a must-do like replace the nulls or predict them?

These are most of the features i'll be using. i want to predict if a person will sign or not with the company. I ommited the sign column, but it's a binary value

This is my output distribution. just below

Signature: 11674 (17.7 percent), Non Signature: 54250 (82.3 percent), Total: 65924

*So i have 2 output classes 0 and 1 and they are imbalanced 17% to 82%!

Question 2: should i use SMOTE in this case? or not? and are there other conclusions to draw from this?*

Then, i visualized the columns one by one and noted the behavior against the sign column. These are some examples :

nb_enfants

This shows the percentage of people who signed in Yellow and those who have not in Blue per number of children ( from 0 to 5 ).

Also i figured a value_counts was necessary!

df3.nb_enfants.value_counts()

nb_enfants!

then i went on and 'manually' picked features i liked ( That's how i roll ).

Question 4 : What's a better indicator to a good predictor in my case?

This is a sample of a train row

Question6 : Any remarks on the data ?

I set up a correlation matrix for some features by doing this :

plt.figure(figsize=(16,14)) 
foo=sns.heatmap(train.drop(['id_opportunite','mca','code_dept','nom_organisme'],axis=1).corr(), vmax=0.6,square=True, annot=True)

Correlation Matrix

Question7 : What conclusions one should draw from this matrix?

I went with :
X = train[['age','classe_dep','situation','nb_enfants','classe_organisme']] y = train.loc[:,'signer']

Result on decision tree :

le score maximale de l'algorithme Decision Tree est : 84.11% pour max_leaf_nodes = 52

Result on the test set : 75% accuracy.

Question8 : Any general thoughts or remarks? Thanks!

Answer should be verification of my steps, and some remarks on one or many of my questions! Thanks!

machine-learning python predictive-modeling supervised-learning

edited Mar 26 at 9:44

asked Mar 25 at 9:55

Blenzus

768

edited Mar 26 at 9:44

asked Mar 25 at 9:55

Blenzus

768

edited Mar 26 at 9:44

asked Mar 25 at 9:55

Blenzus

768

asked Mar 25 at 9:55

Blenzus

768

asked Mar 25 at 9:55

Blenzus

768

closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27

$begingroup$
I'm not a big fun of clustering numerical variables such as the age in group, i prefer to let the tree decide which are the best splits. For the rest you didn't provide much details but I don't see anything clearly wrong. Why you created the correlation matrix?
$endgroup$
– VD93
Mar 25 at 11:30

$begingroup$
To see how the potential predictors are interacting with each other , and with the prediction column. What other details would you rather see? i can always edit my question for clarity. Thanks.
$endgroup$
– Blenzus
Mar 25 at 11:38

$begingroup$
Hi Blenzus, and welcome! This question is way too broad. If you could split it up into a more specific, detailed question, you're much more likely to get a useful answer. Asking "what am I doing wrong?" with your entire project, is unlikely to get a response. To add to this: if you break your problem down into smaller specific questions, I'm sure you'll find the answers to those questions already on this site. Good luck :)
$endgroup$
– Dan Carter
Mar 25 at 12:01

$begingroup$
Thanks for the input , I will!
$endgroup$
– Blenzus
Mar 25 at 13:19

add a comment |

$begingroup$
I'm not a big fun of clustering numerical variables such as the age in group, i prefer to let the tree decide which are the best splits. For the rest you didn't provide much details but I don't see anything clearly wrong. Why you created the correlation matrix?
$endgroup$
– VD93
Mar 25 at 11:30

$begingroup$
To see how the potential predictors are interacting with each other , and with the prediction column. What other details would you rather see? i can always edit my question for clarity. Thanks.
$endgroup$
– Blenzus
Mar 25 at 11:38

$begingroup$
Hi Blenzus, and welcome! This question is way too broad. If you could split it up into a more specific, detailed question, you're much more likely to get a useful answer. Asking "what am I doing wrong?" with your entire project, is unlikely to get a response. To add to this: if you break your problem down into smaller specific questions, I'm sure you'll find the answers to those questions already on this site. Good luck :)
$endgroup$
– Dan Carter
Mar 25 at 12:01

$begingroup$
Thanks for the input , I will!
$endgroup$
– Blenzus
Mar 25 at 13:19

I'm not a big fun of clustering numerical variables such as the age in group, i prefer to let the tree decide which are the best splits. For the rest you didn't provide much details but I don't see anything clearly wrong. Why you created the correlation matrix?

– VD93
Mar 25 at 11:30

To see how the potential predictors are interacting with each other , and with the prediction column. What other details would you rather see? i can always edit my question for clarity. Thanks.

– Blenzus
Mar 25 at 11:38

Hi Blenzus, and welcome! This question is way too broad. If you could split it up into a more specific, detailed question, you're much more likely to get a useful answer. Asking "what am I doing wrong?" with your entire project, is unlikely to get a response. To add to this: if you break your problem down into smaller specific questions, I'm sure you'll find the answers to those questions already on this site. Good luck :)

– Dan Carter
Mar 25 at 12:01

Thanks for the input , I will!

– Blenzus
Mar 25 at 13:19

add a comment |

1 Answer
1

active

oldest

votes

Plenty of questions there. I will answer about the accuracy one:

75% is larger than random chance and might have use. But you need to consider what is relevant for your application.

For example, suppose you are dealing with a security issue. Denying access to someone that is entitled to it is less damaging than allowing access to someone who is not.

If you want to reduce the number of phone calls to sell a product you want a model that will tell you to call the maximum of potential clients and even a shrinking of 10% of useless call is a good model with good profit if you don't make it absurdly expensive to keep operating.

edited Mar 25 at 18:27

Blenzus

768

answered Mar 25 at 15:27

Pedro Henrique Monforte

1156

add a comment |

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Plenty of questions there. I will answer about the accuracy one:

75% is larger than random chance and might have use. But you need to consider what is relevant for your application.

For example, suppose you are dealing with a security issue. Denying access to someone that is entitled to it is less damaging than allowing access to someone who is not.

edited Mar 25 at 18:27

Blenzus

768

answered Mar 25 at 15:27

Pedro Henrique Monforte

1156

add a comment |

Plenty of questions there. I will answer about the accuracy one:

75% is larger than random chance and might have use. But you need to consider what is relevant for your application.

For example, suppose you are dealing with a security issue. Denying access to someone that is entitled to it is less damaging than allowing access to someone who is not.

edited Mar 25 at 18:27

Blenzus

768

answered Mar 25 at 15:27

Pedro Henrique Monforte

1156

add a comment |

Plenty of questions there. I will answer about the accuracy one:

75% is larger than random chance and might have use. But you need to consider what is relevant for your application.

For example, suppose you are dealing with a security issue. Denying access to someone that is entitled to it is less damaging than allowing access to someone who is not.

edited Mar 25 at 18:27

Blenzus

768

answered Mar 25 at 15:27

Pedro Henrique Monforte

1156

Plenty of questions there. I will answer about the accuracy one:

75% is larger than random chance and might have use. But you need to consider what is relevant for your application.

For example, suppose you are dealing with a security issue. Denying access to someone that is entitled to it is less damaging than allowing access to someone who is not.

edited Mar 25 at 18:27

Blenzus

768

answered Mar 25 at 15:27

Pedro Henrique Monforte

1156

edited Mar 25 at 18:27

Blenzus

768

edited Mar 25 at 18:27

Blenzus

768

edited Mar 25 at 18:27

Blenzus

768

answered Mar 25 at 15:27

Pedro Henrique Monforte

1156

answered Mar 25 at 15:27

Pedro Henrique Monforte

1156

answered Mar 25 at 15:27

Pedro Henrique Monforte

1156

add a comment |

This page is only for reference, If you need detailed information, please check here

iNZznHM,nZjil3zw w0kL,hg1GNJTD2uiCg0EtK6G

搜尋此網誌

Trjtdtk

closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27

closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27

closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27

closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27

1 Answer
1

1 Answer
1

1 Answer
1

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27

closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27

closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27

closed as too broad by Dan Carter, Simon Larsson, Siong Thye Goh, Mark.F, oW_ Mar 25 at 18:27

1 Answer 1

1 Answer 1

1 Answer 1

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1