Right Regression Model to useRegression Model for explained model(Details inside)Best regression model to use for sales predictionChose the right regression analysisMaking Use of the Target Values for RegressionIs removing poorly predicted data points a valid approach?Regression turned into classifficationLinear Model for Linear RegressionSelecting the right time series modelWhen to use an ordinal logistic regression modelCan I use Linear Regression to model a nonlinear function?

"Whatever a Russian does, they end up making the Kalashnikov gun"? Are there any similar proverbs in English?

How would 10 generations of living underground change the human body?

Why didn't the Space Shuttle bounce back into space as many times as possible so as to lose a lot of kinetic energy up there?

How do I reattach a shelf to the wall when it ripped out of the wall?

Was there a shared-world project before "Thieves World"?

Contradiction proof for inequality of P and NP?

Why did some of my point & shoot film photos come back with one third light white or orange?

Pulling the rope with one hand is as heavy as with two hands?

Read line from file and process something

How to have a sharp product image?

What term is being referred to with "reflected-sound-of-underground-spirits"?

On The Origin of Dissonant Chords

How to stop co-workers from teasing me because I know Russian?

Your bread will be buttered on both sides

How can Republicans who favour free markets, consistently express anger when they don't like the outcome of that choice?

Negative Resistance

Does a large simulator bay have standard public address announcements?

Rivers without rain

Why does Mind Blank stop the Feeblemind spell?

Critique of timeline aesthetic

I preordered a game on my Xbox while on the home screen of my friend's account. Which of us owns the game?

Is there really no use for MD5 anymore?

Mistake in years of experience in resume?

How could Tony Stark make this in Endgame?



Right Regression Model to use


Regression Model for explained model(Details inside)Best regression model to use for sales predictionChose the right regression analysisMaking Use of the Target Values for RegressionIs removing poorly predicted data points a valid approach?Regression turned into classifficationLinear Model for Linear RegressionSelecting the right time series modelWhen to use an ordinal logistic regression modelCan I use Linear Regression to model a nonlinear function?













0












$begingroup$


I am trying to predict reservation count from a dataset with few features. Features are both categorical and continuous.



The dependent variable reservations looks like below: My dataset size is around 917 obs.



array([ 1, 7, 17, 2, 2, 13, 8, 11, 9, 4, 4, 3, 5, 2, 5, 7, 3,
12, 9, 13, 5, 2, 11, 13, 14, 19, 9, 11, 3, 6, 7, 10, 1, 6,
5, 10, 8, 5, 4, 3, 2, 10, 10, 10, 8, 13, 16, 6, 4, 6, 3,
11, 10, 1, 18, 7, 2, 12, 17, 4, 2, 19, 3, 4, 17, 13, 10, 2,
10, 1, 3, 4, 20, 3, 2, 1, 3, 5, 8, 8, 4, 3, 13, 3, 3,
5, 4, 17, 7, 6, 10, 5, 3, 9, 9, 8, 1, 5, 17, 5, 10, 9,
2, 7, 13, 2, 9, 1, 15, 13, 10, 4, 2, 4, 5, 4, 3, 3, 10,
4, 7, 5, 13, 12, 7, 5, 6, 9, 5, 11, 7, 1, 4, 12, 4, 3,
11, 1, 4, 4, 3, 7, 4, 11, 4, 1, 9, 2, 10, 10, 3, 4, 4,
3, 2, 7, 10, 7, 6, 1, 3, 19, 9, 3, 8, 20, 1, 12, 9, 13,
13, 2, 9, 4, 9, 2, 5, 6, 18, 3, 6, 8, 6, 4, 5, 13, 4,
8, 9, 5, 4, 8, 5, 2, 1, 6, 8, 3, 6, 4, 2, 6, 11, 5,
1, 5, 1, 5, 11, 11, 9, 3, 12, 2, 2, 9, 19, 7, 13, 13, 9,
2, 1, 1, 4, 3, 4, 9, 1, 25, 12, 8, 5, 18, 3, 1, 6, 17,
7, 4, 6, 9, 8, 10, 3, 8, 12, 5, 4, 4, 1, 9, 21, 4, 3,
3, 7, 13, 5, 12, 8, 8, 6, 3, 6, 7, 5, 3, 7, 3, 14, 3,
5, 2, 14, 16, 3, 8, 6, 13, 9, 3, 5, 4, 9, 4, 12, 12, 4,
9, 8, 11, 5, 13, 3, 2, 5, 4, 2, 1, 8, 8, 18, 11, 2, 5,
13, 4, 1, 2, 4, 1, 2, 2, 12, 2, 6, 19, 7, 20, 2, 10, 2,
9, 12, 9, 8, 1, 4, 8, 8, 12, 4, 8, 1, 3, 6, 9, 4, 3,
8, 2, 7, 15, 6, 5, 10, 6, 4, 3, 12, 5, 4, 13, 7, 2, 8,
5, 2, 4, 3, 14, 12, 3, 4, 3, 2, 15, 6, 14, 12, 11, 9, 5,
5, 7, 11, 10, 7, 9, 9, 7, 11, 5, 11, 3, 2, 5, 17, 5, 2,
6, 1, 10, 3, 13, 19, 5, 1, 3, 5, 3, 5, 6, 3, 9, 8, 2,
3, 2, 3, 7, 4, 9, 5, 1, 6, 14, 4, 8, 17, 13, 7, 1, 4,
5, 10, 5, 6, 2, 12, 5, 9, 3, 9, 9, 1, 5, 1, 2, 2, 5,
1, 4, 4, 13, 4, 25, 9, 10, 4, 3, 9, 13, 13, 2, 9, 2, 12,
4, 1, 20, 9, 10, 2, 5, 4, 10, 2, 6, 1, 7, 7, 7, 4, 8,
4, 3, 4, 13, 8, 3, 13, 12, 19, 9, 3, 2, 6, 7, 13, 8, 16,
7, 3, 11, 4, 10, 9, 12, 2, 8, 5, 2, 3, 4, 2, 1, 11, 5,
4, 2, 8, 12, 7, 5, 7, 7, 4, 6, 18, 2, 1, 6, 15, 11, 2,
5, 8, 3, 5, 9, 11, 5, 8, 6, 20, 1, 10, 3, 7, 1, 3, 5,
4, 4, 10, 11, 6, 1, 5, 4, 1, 2, 10, 4, 4, 11, 20, 5, 3,
2, 7, 8, 2, 10, 5, 1, 18, 5, 10, 5, 3, 8, 15, 2, 1, 14,
10, 7, 3, 5, 9, 3, 4, 21, 14, 1, 2, 1, 2, 4, 11, 9, 7,
6, 9, 18, 4, 6, 18, 12, 12, 4, 6, 3, 3, 9, 5, 12, 15, 3,
7, 3, 7, 4, 2, 15, 14, 7, 10, 5, 5, 5, 9, 3, 6, 3, 1,
11, 1, 5, 25, 8, 2, 24, 1, 12, 1, 6, 8, 5, 13, 4, 3, 3,
13, 4, 4, 18, 7, 13, 2, 8, 3, 4, 9, 2, 13, 12, 4, 5, 10,
9, 15, 1, 8, 8, 15, 10, 1, 9, 2, 2, 2, 2, 3, 6, 17, 7,
5, 5, 6, 12, 1, 8, 3, 1, 11, 4, 7, 8, 15, 6, 11, 9, 9,
13, 2, 3, 5, 3, 5, 12, 4, 4, 8, 7, 12, 2, 2, 4, 4, 12,
8, 11, 10, 6, 5, 1, 4, 2, 7, 3, 5, 15, 12, 12, 2, 9, 7,
4, 4, 5, 15, 5, 8, 13, 7, 2, 8, 12, 2, 13, 6, 24, 14, 3,
4, 1, 2, 8, 7, 5, 12, 8, 2, 6, 3, 7, 5, 2, 7, 3, 3,
1, 9, 9, 3, 12, 3, 2, 11, 11, 6, 3, 9, 12, 4, 8, 7, 5,
2, 10, 19, 1, 1, 10, 6, 2, 4, 2, 4, 4, 3, 7, 13, 9, 6,
2, 2, 2, 5, 13, 12, 2, 13, 12, 11, 10, 5, 8, 8, 15, 12, 3,
3, 9, 4, 6, 13, 15, 4, 7, 1, 12, 10, 9, 7, 3, 7, 4, 9,
2, 10, 2, 11, 10, 14, 3, 13, 8, 3, 12, 11, 10, 7, 5, 3, 3,
11, 3, 13, 9, 10, 20, 7, 12, 3, 6, 6, 18, 3, 10, 11, 10, 5,
6, 11, 4, 6, 7, 9, 13, 1, 14, 14, 13, 4, 3, 8, 5, 7, 14,
13, 13, 12, 8, 11, 12, 9, 8, 9, 4, 5, 4, 7, 5, 2, 3, 1,
7, 2, 1, 13, 5, 19, 9, 6, 9, 7])


When I plot the histogram of dependent variable I get this



enter image description here



So I used a log transform to remove some of the skewness.



as y=np.log(df["reservartions"].values)



Now the plot of distribution looks below:



enter image description here



Some of features.



type actual_price recommended_price num_videos image_ava text_length
1 67.85 59 5 0 7
0 100.70 53 5 0 224
0 74.00 74 4 1 21
0 135.00 75 1 0 184
0 59.36 53 2 1 31


Since actual_price and Recommended_price have huge correlation, I created a difference price of these two and dropped actual_price and recommended price.



But after running Linear regression or Random Forest Regression I get very poor results with R2 as 0.12 for both.



This shows the model is clearly not predicting and fitting well.



My dependent variable is clearly a positive variable. Is Linear Regression still right? Should I use Poisson regression? Log transformation makes sense?



EDIT:



Tried Poisson from Statsmodels. Gives worse results



 import statsmodels.api as sm 
poisson_mod = sm.Poisson(train_Y, train_X)
poisson_res = poisson_mod.fit(method="newton")
print(poisson_res.summary())

Optimization terminated successfully.
Current function value: 2.958960
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 637
Model: Poisson Df Residuals: 628
Method: MLE Df Model: 8
Date: Mon, 08 Oct 2018 Pseudo R-squ.: 0.09479
Time: 13:57:37 Log-Likelihood: -1884.9
converged: True LL-Null: -2082.2
LLR p-value: 2.506e-80
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
technology 0.0080 0.040 0.200 0.842 -0.070 0.086
street_parked 0.0014 0.030 0.046 0.963 -0.058 0.061
description 0.0002 0.000 0.884 0.377 -0.000 0.001
num_images_2 -0.0230 0.054 -0.430 0.667 -0.128 0.082
num_images_3 0.0619 0.053 1.160 0.246 -0.043 0.167
num_images_4 0.2234 0.050 4.501 0.000 0.126 0.321
num_images_5 0.2391 0.053 4.521 0.000 0.135 0.343
price_diff -0.0146 0.001 -16.300 0.000 -0.016 -0.013
Bias 2.1325 0.052 41.016 0.000 2.031 2.234
=================================================================================









share|improve this question











$endgroup$











  • $begingroup$
    I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
    $endgroup$
    – parvij
    Oct 9 '18 at 5:30










  • $begingroup$
    These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
    $endgroup$
    – Baktaawar
    Oct 10 '18 at 17:22











  • $begingroup$
    data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
    $endgroup$
    – parvij
    Oct 11 '18 at 5:25










  • $begingroup$
    That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
    $endgroup$
    – Baktaawar
    Oct 11 '18 at 23:00















0












$begingroup$


I am trying to predict reservation count from a dataset with few features. Features are both categorical and continuous.



The dependent variable reservations looks like below: My dataset size is around 917 obs.



array([ 1, 7, 17, 2, 2, 13, 8, 11, 9, 4, 4, 3, 5, 2, 5, 7, 3,
12, 9, 13, 5, 2, 11, 13, 14, 19, 9, 11, 3, 6, 7, 10, 1, 6,
5, 10, 8, 5, 4, 3, 2, 10, 10, 10, 8, 13, 16, 6, 4, 6, 3,
11, 10, 1, 18, 7, 2, 12, 17, 4, 2, 19, 3, 4, 17, 13, 10, 2,
10, 1, 3, 4, 20, 3, 2, 1, 3, 5, 8, 8, 4, 3, 13, 3, 3,
5, 4, 17, 7, 6, 10, 5, 3, 9, 9, 8, 1, 5, 17, 5, 10, 9,
2, 7, 13, 2, 9, 1, 15, 13, 10, 4, 2, 4, 5, 4, 3, 3, 10,
4, 7, 5, 13, 12, 7, 5, 6, 9, 5, 11, 7, 1, 4, 12, 4, 3,
11, 1, 4, 4, 3, 7, 4, 11, 4, 1, 9, 2, 10, 10, 3, 4, 4,
3, 2, 7, 10, 7, 6, 1, 3, 19, 9, 3, 8, 20, 1, 12, 9, 13,
13, 2, 9, 4, 9, 2, 5, 6, 18, 3, 6, 8, 6, 4, 5, 13, 4,
8, 9, 5, 4, 8, 5, 2, 1, 6, 8, 3, 6, 4, 2, 6, 11, 5,
1, 5, 1, 5, 11, 11, 9, 3, 12, 2, 2, 9, 19, 7, 13, 13, 9,
2, 1, 1, 4, 3, 4, 9, 1, 25, 12, 8, 5, 18, 3, 1, 6, 17,
7, 4, 6, 9, 8, 10, 3, 8, 12, 5, 4, 4, 1, 9, 21, 4, 3,
3, 7, 13, 5, 12, 8, 8, 6, 3, 6, 7, 5, 3, 7, 3, 14, 3,
5, 2, 14, 16, 3, 8, 6, 13, 9, 3, 5, 4, 9, 4, 12, 12, 4,
9, 8, 11, 5, 13, 3, 2, 5, 4, 2, 1, 8, 8, 18, 11, 2, 5,
13, 4, 1, 2, 4, 1, 2, 2, 12, 2, 6, 19, 7, 20, 2, 10, 2,
9, 12, 9, 8, 1, 4, 8, 8, 12, 4, 8, 1, 3, 6, 9, 4, 3,
8, 2, 7, 15, 6, 5, 10, 6, 4, 3, 12, 5, 4, 13, 7, 2, 8,
5, 2, 4, 3, 14, 12, 3, 4, 3, 2, 15, 6, 14, 12, 11, 9, 5,
5, 7, 11, 10, 7, 9, 9, 7, 11, 5, 11, 3, 2, 5, 17, 5, 2,
6, 1, 10, 3, 13, 19, 5, 1, 3, 5, 3, 5, 6, 3, 9, 8, 2,
3, 2, 3, 7, 4, 9, 5, 1, 6, 14, 4, 8, 17, 13, 7, 1, 4,
5, 10, 5, 6, 2, 12, 5, 9, 3, 9, 9, 1, 5, 1, 2, 2, 5,
1, 4, 4, 13, 4, 25, 9, 10, 4, 3, 9, 13, 13, 2, 9, 2, 12,
4, 1, 20, 9, 10, 2, 5, 4, 10, 2, 6, 1, 7, 7, 7, 4, 8,
4, 3, 4, 13, 8, 3, 13, 12, 19, 9, 3, 2, 6, 7, 13, 8, 16,
7, 3, 11, 4, 10, 9, 12, 2, 8, 5, 2, 3, 4, 2, 1, 11, 5,
4, 2, 8, 12, 7, 5, 7, 7, 4, 6, 18, 2, 1, 6, 15, 11, 2,
5, 8, 3, 5, 9, 11, 5, 8, 6, 20, 1, 10, 3, 7, 1, 3, 5,
4, 4, 10, 11, 6, 1, 5, 4, 1, 2, 10, 4, 4, 11, 20, 5, 3,
2, 7, 8, 2, 10, 5, 1, 18, 5, 10, 5, 3, 8, 15, 2, 1, 14,
10, 7, 3, 5, 9, 3, 4, 21, 14, 1, 2, 1, 2, 4, 11, 9, 7,
6, 9, 18, 4, 6, 18, 12, 12, 4, 6, 3, 3, 9, 5, 12, 15, 3,
7, 3, 7, 4, 2, 15, 14, 7, 10, 5, 5, 5, 9, 3, 6, 3, 1,
11, 1, 5, 25, 8, 2, 24, 1, 12, 1, 6, 8, 5, 13, 4, 3, 3,
13, 4, 4, 18, 7, 13, 2, 8, 3, 4, 9, 2, 13, 12, 4, 5, 10,
9, 15, 1, 8, 8, 15, 10, 1, 9, 2, 2, 2, 2, 3, 6, 17, 7,
5, 5, 6, 12, 1, 8, 3, 1, 11, 4, 7, 8, 15, 6, 11, 9, 9,
13, 2, 3, 5, 3, 5, 12, 4, 4, 8, 7, 12, 2, 2, 4, 4, 12,
8, 11, 10, 6, 5, 1, 4, 2, 7, 3, 5, 15, 12, 12, 2, 9, 7,
4, 4, 5, 15, 5, 8, 13, 7, 2, 8, 12, 2, 13, 6, 24, 14, 3,
4, 1, 2, 8, 7, 5, 12, 8, 2, 6, 3, 7, 5, 2, 7, 3, 3,
1, 9, 9, 3, 12, 3, 2, 11, 11, 6, 3, 9, 12, 4, 8, 7, 5,
2, 10, 19, 1, 1, 10, 6, 2, 4, 2, 4, 4, 3, 7, 13, 9, 6,
2, 2, 2, 5, 13, 12, 2, 13, 12, 11, 10, 5, 8, 8, 15, 12, 3,
3, 9, 4, 6, 13, 15, 4, 7, 1, 12, 10, 9, 7, 3, 7, 4, 9,
2, 10, 2, 11, 10, 14, 3, 13, 8, 3, 12, 11, 10, 7, 5, 3, 3,
11, 3, 13, 9, 10, 20, 7, 12, 3, 6, 6, 18, 3, 10, 11, 10, 5,
6, 11, 4, 6, 7, 9, 13, 1, 14, 14, 13, 4, 3, 8, 5, 7, 14,
13, 13, 12, 8, 11, 12, 9, 8, 9, 4, 5, 4, 7, 5, 2, 3, 1,
7, 2, 1, 13, 5, 19, 9, 6, 9, 7])


When I plot the histogram of dependent variable I get this



enter image description here



So I used a log transform to remove some of the skewness.



as y=np.log(df["reservartions"].values)



Now the plot of distribution looks below:



enter image description here



Some of features.



type actual_price recommended_price num_videos image_ava text_length
1 67.85 59 5 0 7
0 100.70 53 5 0 224
0 74.00 74 4 1 21
0 135.00 75 1 0 184
0 59.36 53 2 1 31


Since actual_price and Recommended_price have huge correlation, I created a difference price of these two and dropped actual_price and recommended price.



But after running Linear regression or Random Forest Regression I get very poor results with R2 as 0.12 for both.



This shows the model is clearly not predicting and fitting well.



My dependent variable is clearly a positive variable. Is Linear Regression still right? Should I use Poisson regression? Log transformation makes sense?



EDIT:



Tried Poisson from Statsmodels. Gives worse results



 import statsmodels.api as sm 
poisson_mod = sm.Poisson(train_Y, train_X)
poisson_res = poisson_mod.fit(method="newton")
print(poisson_res.summary())

Optimization terminated successfully.
Current function value: 2.958960
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 637
Model: Poisson Df Residuals: 628
Method: MLE Df Model: 8
Date: Mon, 08 Oct 2018 Pseudo R-squ.: 0.09479
Time: 13:57:37 Log-Likelihood: -1884.9
converged: True LL-Null: -2082.2
LLR p-value: 2.506e-80
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
technology 0.0080 0.040 0.200 0.842 -0.070 0.086
street_parked 0.0014 0.030 0.046 0.963 -0.058 0.061
description 0.0002 0.000 0.884 0.377 -0.000 0.001
num_images_2 -0.0230 0.054 -0.430 0.667 -0.128 0.082
num_images_3 0.0619 0.053 1.160 0.246 -0.043 0.167
num_images_4 0.2234 0.050 4.501 0.000 0.126 0.321
num_images_5 0.2391 0.053 4.521 0.000 0.135 0.343
price_diff -0.0146 0.001 -16.300 0.000 -0.016 -0.013
Bias 2.1325 0.052 41.016 0.000 2.031 2.234
=================================================================================









share|improve this question











$endgroup$











  • $begingroup$
    I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
    $endgroup$
    – parvij
    Oct 9 '18 at 5:30










  • $begingroup$
    These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
    $endgroup$
    – Baktaawar
    Oct 10 '18 at 17:22











  • $begingroup$
    data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
    $endgroup$
    – parvij
    Oct 11 '18 at 5:25










  • $begingroup$
    That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
    $endgroup$
    – Baktaawar
    Oct 11 '18 at 23:00













0












0








0





$begingroup$


I am trying to predict reservation count from a dataset with few features. Features are both categorical and continuous.



The dependent variable reservations looks like below: My dataset size is around 917 obs.



array([ 1, 7, 17, 2, 2, 13, 8, 11, 9, 4, 4, 3, 5, 2, 5, 7, 3,
12, 9, 13, 5, 2, 11, 13, 14, 19, 9, 11, 3, 6, 7, 10, 1, 6,
5, 10, 8, 5, 4, 3, 2, 10, 10, 10, 8, 13, 16, 6, 4, 6, 3,
11, 10, 1, 18, 7, 2, 12, 17, 4, 2, 19, 3, 4, 17, 13, 10, 2,
10, 1, 3, 4, 20, 3, 2, 1, 3, 5, 8, 8, 4, 3, 13, 3, 3,
5, 4, 17, 7, 6, 10, 5, 3, 9, 9, 8, 1, 5, 17, 5, 10, 9,
2, 7, 13, 2, 9, 1, 15, 13, 10, 4, 2, 4, 5, 4, 3, 3, 10,
4, 7, 5, 13, 12, 7, 5, 6, 9, 5, 11, 7, 1, 4, 12, 4, 3,
11, 1, 4, 4, 3, 7, 4, 11, 4, 1, 9, 2, 10, 10, 3, 4, 4,
3, 2, 7, 10, 7, 6, 1, 3, 19, 9, 3, 8, 20, 1, 12, 9, 13,
13, 2, 9, 4, 9, 2, 5, 6, 18, 3, 6, 8, 6, 4, 5, 13, 4,
8, 9, 5, 4, 8, 5, 2, 1, 6, 8, 3, 6, 4, 2, 6, 11, 5,
1, 5, 1, 5, 11, 11, 9, 3, 12, 2, 2, 9, 19, 7, 13, 13, 9,
2, 1, 1, 4, 3, 4, 9, 1, 25, 12, 8, 5, 18, 3, 1, 6, 17,
7, 4, 6, 9, 8, 10, 3, 8, 12, 5, 4, 4, 1, 9, 21, 4, 3,
3, 7, 13, 5, 12, 8, 8, 6, 3, 6, 7, 5, 3, 7, 3, 14, 3,
5, 2, 14, 16, 3, 8, 6, 13, 9, 3, 5, 4, 9, 4, 12, 12, 4,
9, 8, 11, 5, 13, 3, 2, 5, 4, 2, 1, 8, 8, 18, 11, 2, 5,
13, 4, 1, 2, 4, 1, 2, 2, 12, 2, 6, 19, 7, 20, 2, 10, 2,
9, 12, 9, 8, 1, 4, 8, 8, 12, 4, 8, 1, 3, 6, 9, 4, 3,
8, 2, 7, 15, 6, 5, 10, 6, 4, 3, 12, 5, 4, 13, 7, 2, 8,
5, 2, 4, 3, 14, 12, 3, 4, 3, 2, 15, 6, 14, 12, 11, 9, 5,
5, 7, 11, 10, 7, 9, 9, 7, 11, 5, 11, 3, 2, 5, 17, 5, 2,
6, 1, 10, 3, 13, 19, 5, 1, 3, 5, 3, 5, 6, 3, 9, 8, 2,
3, 2, 3, 7, 4, 9, 5, 1, 6, 14, 4, 8, 17, 13, 7, 1, 4,
5, 10, 5, 6, 2, 12, 5, 9, 3, 9, 9, 1, 5, 1, 2, 2, 5,
1, 4, 4, 13, 4, 25, 9, 10, 4, 3, 9, 13, 13, 2, 9, 2, 12,
4, 1, 20, 9, 10, 2, 5, 4, 10, 2, 6, 1, 7, 7, 7, 4, 8,
4, 3, 4, 13, 8, 3, 13, 12, 19, 9, 3, 2, 6, 7, 13, 8, 16,
7, 3, 11, 4, 10, 9, 12, 2, 8, 5, 2, 3, 4, 2, 1, 11, 5,
4, 2, 8, 12, 7, 5, 7, 7, 4, 6, 18, 2, 1, 6, 15, 11, 2,
5, 8, 3, 5, 9, 11, 5, 8, 6, 20, 1, 10, 3, 7, 1, 3, 5,
4, 4, 10, 11, 6, 1, 5, 4, 1, 2, 10, 4, 4, 11, 20, 5, 3,
2, 7, 8, 2, 10, 5, 1, 18, 5, 10, 5, 3, 8, 15, 2, 1, 14,
10, 7, 3, 5, 9, 3, 4, 21, 14, 1, 2, 1, 2, 4, 11, 9, 7,
6, 9, 18, 4, 6, 18, 12, 12, 4, 6, 3, 3, 9, 5, 12, 15, 3,
7, 3, 7, 4, 2, 15, 14, 7, 10, 5, 5, 5, 9, 3, 6, 3, 1,
11, 1, 5, 25, 8, 2, 24, 1, 12, 1, 6, 8, 5, 13, 4, 3, 3,
13, 4, 4, 18, 7, 13, 2, 8, 3, 4, 9, 2, 13, 12, 4, 5, 10,
9, 15, 1, 8, 8, 15, 10, 1, 9, 2, 2, 2, 2, 3, 6, 17, 7,
5, 5, 6, 12, 1, 8, 3, 1, 11, 4, 7, 8, 15, 6, 11, 9, 9,
13, 2, 3, 5, 3, 5, 12, 4, 4, 8, 7, 12, 2, 2, 4, 4, 12,
8, 11, 10, 6, 5, 1, 4, 2, 7, 3, 5, 15, 12, 12, 2, 9, 7,
4, 4, 5, 15, 5, 8, 13, 7, 2, 8, 12, 2, 13, 6, 24, 14, 3,
4, 1, 2, 8, 7, 5, 12, 8, 2, 6, 3, 7, 5, 2, 7, 3, 3,
1, 9, 9, 3, 12, 3, 2, 11, 11, 6, 3, 9, 12, 4, 8, 7, 5,
2, 10, 19, 1, 1, 10, 6, 2, 4, 2, 4, 4, 3, 7, 13, 9, 6,
2, 2, 2, 5, 13, 12, 2, 13, 12, 11, 10, 5, 8, 8, 15, 12, 3,
3, 9, 4, 6, 13, 15, 4, 7, 1, 12, 10, 9, 7, 3, 7, 4, 9,
2, 10, 2, 11, 10, 14, 3, 13, 8, 3, 12, 11, 10, 7, 5, 3, 3,
11, 3, 13, 9, 10, 20, 7, 12, 3, 6, 6, 18, 3, 10, 11, 10, 5,
6, 11, 4, 6, 7, 9, 13, 1, 14, 14, 13, 4, 3, 8, 5, 7, 14,
13, 13, 12, 8, 11, 12, 9, 8, 9, 4, 5, 4, 7, 5, 2, 3, 1,
7, 2, 1, 13, 5, 19, 9, 6, 9, 7])


When I plot the histogram of dependent variable I get this



enter image description here



So I used a log transform to remove some of the skewness.



as y=np.log(df["reservartions"].values)



Now the plot of distribution looks below:



enter image description here



Some of features.



type actual_price recommended_price num_videos image_ava text_length
1 67.85 59 5 0 7
0 100.70 53 5 0 224
0 74.00 74 4 1 21
0 135.00 75 1 0 184
0 59.36 53 2 1 31


Since actual_price and Recommended_price have huge correlation, I created a difference price of these two and dropped actual_price and recommended price.



But after running Linear regression or Random Forest Regression I get very poor results with R2 as 0.12 for both.



This shows the model is clearly not predicting and fitting well.



My dependent variable is clearly a positive variable. Is Linear Regression still right? Should I use Poisson regression? Log transformation makes sense?



EDIT:



Tried Poisson from Statsmodels. Gives worse results



 import statsmodels.api as sm 
poisson_mod = sm.Poisson(train_Y, train_X)
poisson_res = poisson_mod.fit(method="newton")
print(poisson_res.summary())

Optimization terminated successfully.
Current function value: 2.958960
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 637
Model: Poisson Df Residuals: 628
Method: MLE Df Model: 8
Date: Mon, 08 Oct 2018 Pseudo R-squ.: 0.09479
Time: 13:57:37 Log-Likelihood: -1884.9
converged: True LL-Null: -2082.2
LLR p-value: 2.506e-80
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
technology 0.0080 0.040 0.200 0.842 -0.070 0.086
street_parked 0.0014 0.030 0.046 0.963 -0.058 0.061
description 0.0002 0.000 0.884 0.377 -0.000 0.001
num_images_2 -0.0230 0.054 -0.430 0.667 -0.128 0.082
num_images_3 0.0619 0.053 1.160 0.246 -0.043 0.167
num_images_4 0.2234 0.050 4.501 0.000 0.126 0.321
num_images_5 0.2391 0.053 4.521 0.000 0.135 0.343
price_diff -0.0146 0.001 -16.300 0.000 -0.016 -0.013
Bias 2.1325 0.052 41.016 0.000 2.031 2.234
=================================================================================









share|improve this question











$endgroup$




I am trying to predict reservation count from a dataset with few features. Features are both categorical and continuous.



The dependent variable reservations looks like below: My dataset size is around 917 obs.



array([ 1, 7, 17, 2, 2, 13, 8, 11, 9, 4, 4, 3, 5, 2, 5, 7, 3,
12, 9, 13, 5, 2, 11, 13, 14, 19, 9, 11, 3, 6, 7, 10, 1, 6,
5, 10, 8, 5, 4, 3, 2, 10, 10, 10, 8, 13, 16, 6, 4, 6, 3,
11, 10, 1, 18, 7, 2, 12, 17, 4, 2, 19, 3, 4, 17, 13, 10, 2,
10, 1, 3, 4, 20, 3, 2, 1, 3, 5, 8, 8, 4, 3, 13, 3, 3,
5, 4, 17, 7, 6, 10, 5, 3, 9, 9, 8, 1, 5, 17, 5, 10, 9,
2, 7, 13, 2, 9, 1, 15, 13, 10, 4, 2, 4, 5, 4, 3, 3, 10,
4, 7, 5, 13, 12, 7, 5, 6, 9, 5, 11, 7, 1, 4, 12, 4, 3,
11, 1, 4, 4, 3, 7, 4, 11, 4, 1, 9, 2, 10, 10, 3, 4, 4,
3, 2, 7, 10, 7, 6, 1, 3, 19, 9, 3, 8, 20, 1, 12, 9, 13,
13, 2, 9, 4, 9, 2, 5, 6, 18, 3, 6, 8, 6, 4, 5, 13, 4,
8, 9, 5, 4, 8, 5, 2, 1, 6, 8, 3, 6, 4, 2, 6, 11, 5,
1, 5, 1, 5, 11, 11, 9, 3, 12, 2, 2, 9, 19, 7, 13, 13, 9,
2, 1, 1, 4, 3, 4, 9, 1, 25, 12, 8, 5, 18, 3, 1, 6, 17,
7, 4, 6, 9, 8, 10, 3, 8, 12, 5, 4, 4, 1, 9, 21, 4, 3,
3, 7, 13, 5, 12, 8, 8, 6, 3, 6, 7, 5, 3, 7, 3, 14, 3,
5, 2, 14, 16, 3, 8, 6, 13, 9, 3, 5, 4, 9, 4, 12, 12, 4,
9, 8, 11, 5, 13, 3, 2, 5, 4, 2, 1, 8, 8, 18, 11, 2, 5,
13, 4, 1, 2, 4, 1, 2, 2, 12, 2, 6, 19, 7, 20, 2, 10, 2,
9, 12, 9, 8, 1, 4, 8, 8, 12, 4, 8, 1, 3, 6, 9, 4, 3,
8, 2, 7, 15, 6, 5, 10, 6, 4, 3, 12, 5, 4, 13, 7, 2, 8,
5, 2, 4, 3, 14, 12, 3, 4, 3, 2, 15, 6, 14, 12, 11, 9, 5,
5, 7, 11, 10, 7, 9, 9, 7, 11, 5, 11, 3, 2, 5, 17, 5, 2,
6, 1, 10, 3, 13, 19, 5, 1, 3, 5, 3, 5, 6, 3, 9, 8, 2,
3, 2, 3, 7, 4, 9, 5, 1, 6, 14, 4, 8, 17, 13, 7, 1, 4,
5, 10, 5, 6, 2, 12, 5, 9, 3, 9, 9, 1, 5, 1, 2, 2, 5,
1, 4, 4, 13, 4, 25, 9, 10, 4, 3, 9, 13, 13, 2, 9, 2, 12,
4, 1, 20, 9, 10, 2, 5, 4, 10, 2, 6, 1, 7, 7, 7, 4, 8,
4, 3, 4, 13, 8, 3, 13, 12, 19, 9, 3, 2, 6, 7, 13, 8, 16,
7, 3, 11, 4, 10, 9, 12, 2, 8, 5, 2, 3, 4, 2, 1, 11, 5,
4, 2, 8, 12, 7, 5, 7, 7, 4, 6, 18, 2, 1, 6, 15, 11, 2,
5, 8, 3, 5, 9, 11, 5, 8, 6, 20, 1, 10, 3, 7, 1, 3, 5,
4, 4, 10, 11, 6, 1, 5, 4, 1, 2, 10, 4, 4, 11, 20, 5, 3,
2, 7, 8, 2, 10, 5, 1, 18, 5, 10, 5, 3, 8, 15, 2, 1, 14,
10, 7, 3, 5, 9, 3, 4, 21, 14, 1, 2, 1, 2, 4, 11, 9, 7,
6, 9, 18, 4, 6, 18, 12, 12, 4, 6, 3, 3, 9, 5, 12, 15, 3,
7, 3, 7, 4, 2, 15, 14, 7, 10, 5, 5, 5, 9, 3, 6, 3, 1,
11, 1, 5, 25, 8, 2, 24, 1, 12, 1, 6, 8, 5, 13, 4, 3, 3,
13, 4, 4, 18, 7, 13, 2, 8, 3, 4, 9, 2, 13, 12, 4, 5, 10,
9, 15, 1, 8, 8, 15, 10, 1, 9, 2, 2, 2, 2, 3, 6, 17, 7,
5, 5, 6, 12, 1, 8, 3, 1, 11, 4, 7, 8, 15, 6, 11, 9, 9,
13, 2, 3, 5, 3, 5, 12, 4, 4, 8, 7, 12, 2, 2, 4, 4, 12,
8, 11, 10, 6, 5, 1, 4, 2, 7, 3, 5, 15, 12, 12, 2, 9, 7,
4, 4, 5, 15, 5, 8, 13, 7, 2, 8, 12, 2, 13, 6, 24, 14, 3,
4, 1, 2, 8, 7, 5, 12, 8, 2, 6, 3, 7, 5, 2, 7, 3, 3,
1, 9, 9, 3, 12, 3, 2, 11, 11, 6, 3, 9, 12, 4, 8, 7, 5,
2, 10, 19, 1, 1, 10, 6, 2, 4, 2, 4, 4, 3, 7, 13, 9, 6,
2, 2, 2, 5, 13, 12, 2, 13, 12, 11, 10, 5, 8, 8, 15, 12, 3,
3, 9, 4, 6, 13, 15, 4, 7, 1, 12, 10, 9, 7, 3, 7, 4, 9,
2, 10, 2, 11, 10, 14, 3, 13, 8, 3, 12, 11, 10, 7, 5, 3, 3,
11, 3, 13, 9, 10, 20, 7, 12, 3, 6, 6, 18, 3, 10, 11, 10, 5,
6, 11, 4, 6, 7, 9, 13, 1, 14, 14, 13, 4, 3, 8, 5, 7, 14,
13, 13, 12, 8, 11, 12, 9, 8, 9, 4, 5, 4, 7, 5, 2, 3, 1,
7, 2, 1, 13, 5, 19, 9, 6, 9, 7])


When I plot the histogram of dependent variable I get this



enter image description here



So I used a log transform to remove some of the skewness.



as y=np.log(df["reservartions"].values)



Now the plot of distribution looks below:



enter image description here



Some of features.



type actual_price recommended_price num_videos image_ava text_length
1 67.85 59 5 0 7
0 100.70 53 5 0 224
0 74.00 74 4 1 21
0 135.00 75 1 0 184
0 59.36 53 2 1 31


Since actual_price and Recommended_price have huge correlation, I created a difference price of these two and dropped actual_price and recommended price.



But after running Linear regression or Random Forest Regression I get very poor results with R2 as 0.12 for both.



This shows the model is clearly not predicting and fitting well.



My dependent variable is clearly a positive variable. Is Linear Regression still right? Should I use Poisson regression? Log transformation makes sense?



EDIT:



Tried Poisson from Statsmodels. Gives worse results



 import statsmodels.api as sm 
poisson_mod = sm.Poisson(train_Y, train_X)
poisson_res = poisson_mod.fit(method="newton")
print(poisson_res.summary())

Optimization terminated successfully.
Current function value: 2.958960
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 637
Model: Poisson Df Residuals: 628
Method: MLE Df Model: 8
Date: Mon, 08 Oct 2018 Pseudo R-squ.: 0.09479
Time: 13:57:37 Log-Likelihood: -1884.9
converged: True LL-Null: -2082.2
LLR p-value: 2.506e-80
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
technology 0.0080 0.040 0.200 0.842 -0.070 0.086
street_parked 0.0014 0.030 0.046 0.963 -0.058 0.061
description 0.0002 0.000 0.884 0.377 -0.000 0.001
num_images_2 -0.0230 0.054 -0.430 0.667 -0.128 0.082
num_images_3 0.0619 0.053 1.160 0.246 -0.043 0.167
num_images_4 0.2234 0.050 4.501 0.000 0.126 0.321
num_images_5 0.2391 0.053 4.521 0.000 0.135 0.343
price_diff -0.0146 0.001 -16.300 0.000 -0.016 -0.013
Bias 2.1325 0.052 41.016 0.000 2.031 2.234
=================================================================================






machine-learning regression statistics






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Oct 8 '18 at 21:21







Baktaawar

















asked Oct 8 '18 at 3:44









BaktaawarBaktaawar

1164




1164











  • $begingroup$
    I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
    $endgroup$
    – parvij
    Oct 9 '18 at 5:30










  • $begingroup$
    These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
    $endgroup$
    – Baktaawar
    Oct 10 '18 at 17:22











  • $begingroup$
    data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
    $endgroup$
    – parvij
    Oct 11 '18 at 5:25










  • $begingroup$
    That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
    $endgroup$
    – Baktaawar
    Oct 11 '18 at 23:00
















  • $begingroup$
    I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
    $endgroup$
    – parvij
    Oct 9 '18 at 5:30










  • $begingroup$
    These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
    $endgroup$
    – Baktaawar
    Oct 10 '18 at 17:22











  • $begingroup$
    data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
    $endgroup$
    – parvij
    Oct 11 '18 at 5:25










  • $begingroup$
    That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
    $endgroup$
    – Baktaawar
    Oct 11 '18 at 23:00















$begingroup$
I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
$endgroup$
– parvij
Oct 9 '18 at 5:30




$begingroup$
I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
$endgroup$
– parvij
Oct 9 '18 at 5:30












$begingroup$
These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
$endgroup$
– Baktaawar
Oct 10 '18 at 17:22





$begingroup$
These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
$endgroup$
– Baktaawar
Oct 10 '18 at 17:22













$begingroup$
data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
$endgroup$
– parvij
Oct 11 '18 at 5:25




$begingroup$
data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
$endgroup$
– parvij
Oct 11 '18 at 5:25












$begingroup$
That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
$endgroup$
– Baktaawar
Oct 11 '18 at 23:00




$begingroup$
That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
$endgroup$
– Baktaawar
Oct 11 '18 at 23:00










1 Answer
1






active

oldest

votes


















0












$begingroup$

I think because your target is count so their distribution is Poisson and you should use Poisson regression.



I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.






share|improve this answer











$endgroup$












  • $begingroup$
    Pls check the edit of Poisson I tried
    $endgroup$
    – Baktaawar
    Oct 8 '18 at 21:20












Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f39331%2fright-regression-model-to-use%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0












$begingroup$

I think because your target is count so their distribution is Poisson and you should use Poisson regression.



I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.






share|improve this answer











$endgroup$












  • $begingroup$
    Pls check the edit of Poisson I tried
    $endgroup$
    – Baktaawar
    Oct 8 '18 at 21:20
















0












$begingroup$

I think because your target is count so their distribution is Poisson and you should use Poisson regression.



I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.






share|improve this answer











$endgroup$












  • $begingroup$
    Pls check the edit of Poisson I tried
    $endgroup$
    – Baktaawar
    Oct 8 '18 at 21:20














0












0








0





$begingroup$

I think because your target is count so their distribution is Poisson and you should use Poisson regression.



I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.






share|improve this answer











$endgroup$



I think because your target is count so their distribution is Poisson and you should use Poisson regression.



I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.







share|improve this answer














share|improve this answer



share|improve this answer








edited Oct 9 '18 at 6:13

























answered Oct 8 '18 at 7:29









parvijparvij

485214




485214











  • $begingroup$
    Pls check the edit of Poisson I tried
    $endgroup$
    – Baktaawar
    Oct 8 '18 at 21:20

















  • $begingroup$
    Pls check the edit of Poisson I tried
    $endgroup$
    – Baktaawar
    Oct 8 '18 at 21:20
















$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20





$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20


















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f39331%2fright-regression-model-to-use%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High