Right Regression Model to useRegression Model for explained model(Details inside)Best regression model to use for sales predictionChose the right regression analysisMaking Use of the Target Values for RegressionIs removing poorly predicted data points a valid approach?Regression turned into classifficationLinear Model for Linear RegressionSelecting the right time series modelWhen to use an ordinal logistic regression modelCan I use Linear Regression to model a nonlinear function?

"Whatever a Russian does, they end up making the Kalashnikov gun"? Are there any similar proverbs in English?

How would 10 generations of living underground change the human body?

Why didn't the Space Shuttle bounce back into space as many times as possible so as to lose a lot of kinetic energy up there?

How do I reattach a shelf to the wall when it ripped out of the wall?

Was there a shared-world project before "Thieves World"?

Contradiction proof for inequality of P and NP?

Why did some of my point & shoot film photos come back with one third light white or orange?

Pulling the rope with one hand is as heavy as with two hands?

Read line from file and process something

How to have a sharp product image?

What term is being referred to with "reflected-sound-of-underground-spirits"?

On The Origin of Dissonant Chords

How to stop co-workers from teasing me because I know Russian?

Your bread will be buttered on both sides

How can Republicans who favour free markets, consistently express anger when they don't like the outcome of that choice?

Negative Resistance

Does a large simulator bay have standard public address announcements?

Rivers without rain

Why does Mind Blank stop the Feeblemind spell?

Critique of timeline aesthetic

I preordered a game on my Xbox while on the home screen of my friend's account. Which of us owns the game?

Is there really no use for MD5 anymore?

Mistake in years of experience in resume?

How could Tony Stark make this in Endgame?

Right Regression Model to use

Regression Model for explained model(Details inside)Best regression model to use for sales predictionChose the right regression analysisMaking Use of the Target Values for RegressionIs removing poorly predicted data points a valid approach?Regression turned into classifficationLinear Model for Linear RegressionSelecting the right time series modelWhen to use an ordinal logistic regression modelCan I use Linear Regression to model a nonlinear function?

I am trying to predict reservation count from a dataset with few features. Features are both categorical and continuous.

The dependent variable reservations looks like below: My dataset size is around 917 obs.

array([ 1, 7, 17, 2, 2, 13, 8, 11, 9, 4, 4, 3, 5, 2, 5, 7, 3,
 12, 9, 13, 5, 2, 11, 13, 14, 19, 9, 11, 3, 6, 7, 10, 1, 6,
 5, 10, 8, 5, 4, 3, 2, 10, 10, 10, 8, 13, 16, 6, 4, 6, 3,
 11, 10, 1, 18, 7, 2, 12, 17, 4, 2, 19, 3, 4, 17, 13, 10, 2,
 10, 1, 3, 4, 20, 3, 2, 1, 3, 5, 8, 8, 4, 3, 13, 3, 3,
 5, 4, 17, 7, 6, 10, 5, 3, 9, 9, 8, 1, 5, 17, 5, 10, 9,
 2, 7, 13, 2, 9, 1, 15, 13, 10, 4, 2, 4, 5, 4, 3, 3, 10,
 4, 7, 5, 13, 12, 7, 5, 6, 9, 5, 11, 7, 1, 4, 12, 4, 3,
 11, 1, 4, 4, 3, 7, 4, 11, 4, 1, 9, 2, 10, 10, 3, 4, 4,
 3, 2, 7, 10, 7, 6, 1, 3, 19, 9, 3, 8, 20, 1, 12, 9, 13,
 13, 2, 9, 4, 9, 2, 5, 6, 18, 3, 6, 8, 6, 4, 5, 13, 4,
 8, 9, 5, 4, 8, 5, 2, 1, 6, 8, 3, 6, 4, 2, 6, 11, 5,
 1, 5, 1, 5, 11, 11, 9, 3, 12, 2, 2, 9, 19, 7, 13, 13, 9,
 2, 1, 1, 4, 3, 4, 9, 1, 25, 12, 8, 5, 18, 3, 1, 6, 17,
 7, 4, 6, 9, 8, 10, 3, 8, 12, 5, 4, 4, 1, 9, 21, 4, 3,
 3, 7, 13, 5, 12, 8, 8, 6, 3, 6, 7, 5, 3, 7, 3, 14, 3,
 5, 2, 14, 16, 3, 8, 6, 13, 9, 3, 5, 4, 9, 4, 12, 12, 4,
 9, 8, 11, 5, 13, 3, 2, 5, 4, 2, 1, 8, 8, 18, 11, 2, 5,
 13, 4, 1, 2, 4, 1, 2, 2, 12, 2, 6, 19, 7, 20, 2, 10, 2,
 9, 12, 9, 8, 1, 4, 8, 8, 12, 4, 8, 1, 3, 6, 9, 4, 3,
 8, 2, 7, 15, 6, 5, 10, 6, 4, 3, 12, 5, 4, 13, 7, 2, 8,
 5, 2, 4, 3, 14, 12, 3, 4, 3, 2, 15, 6, 14, 12, 11, 9, 5,
 5, 7, 11, 10, 7, 9, 9, 7, 11, 5, 11, 3, 2, 5, 17, 5, 2,
 6, 1, 10, 3, 13, 19, 5, 1, 3, 5, 3, 5, 6, 3, 9, 8, 2,
 3, 2, 3, 7, 4, 9, 5, 1, 6, 14, 4, 8, 17, 13, 7, 1, 4,
 5, 10, 5, 6, 2, 12, 5, 9, 3, 9, 9, 1, 5, 1, 2, 2, 5,
 1, 4, 4, 13, 4, 25, 9, 10, 4, 3, 9, 13, 13, 2, 9, 2, 12,
 4, 1, 20, 9, 10, 2, 5, 4, 10, 2, 6, 1, 7, 7, 7, 4, 8,
 4, 3, 4, 13, 8, 3, 13, 12, 19, 9, 3, 2, 6, 7, 13, 8, 16,
 7, 3, 11, 4, 10, 9, 12, 2, 8, 5, 2, 3, 4, 2, 1, 11, 5,
 4, 2, 8, 12, 7, 5, 7, 7, 4, 6, 18, 2, 1, 6, 15, 11, 2,
 5, 8, 3, 5, 9, 11, 5, 8, 6, 20, 1, 10, 3, 7, 1, 3, 5,
 4, 4, 10, 11, 6, 1, 5, 4, 1, 2, 10, 4, 4, 11, 20, 5, 3,
 2, 7, 8, 2, 10, 5, 1, 18, 5, 10, 5, 3, 8, 15, 2, 1, 14,
 10, 7, 3, 5, 9, 3, 4, 21, 14, 1, 2, 1, 2, 4, 11, 9, 7,
 6, 9, 18, 4, 6, 18, 12, 12, 4, 6, 3, 3, 9, 5, 12, 15, 3,
 7, 3, 7, 4, 2, 15, 14, 7, 10, 5, 5, 5, 9, 3, 6, 3, 1,
 11, 1, 5, 25, 8, 2, 24, 1, 12, 1, 6, 8, 5, 13, 4, 3, 3,
 13, 4, 4, 18, 7, 13, 2, 8, 3, 4, 9, 2, 13, 12, 4, 5, 10,
 9, 15, 1, 8, 8, 15, 10, 1, 9, 2, 2, 2, 2, 3, 6, 17, 7,
 5, 5, 6, 12, 1, 8, 3, 1, 11, 4, 7, 8, 15, 6, 11, 9, 9,
 13, 2, 3, 5, 3, 5, 12, 4, 4, 8, 7, 12, 2, 2, 4, 4, 12,
 8, 11, 10, 6, 5, 1, 4, 2, 7, 3, 5, 15, 12, 12, 2, 9, 7,
 4, 4, 5, 15, 5, 8, 13, 7, 2, 8, 12, 2, 13, 6, 24, 14, 3,
 4, 1, 2, 8, 7, 5, 12, 8, 2, 6, 3, 7, 5, 2, 7, 3, 3,
 1, 9, 9, 3, 12, 3, 2, 11, 11, 6, 3, 9, 12, 4, 8, 7, 5,
 2, 10, 19, 1, 1, 10, 6, 2, 4, 2, 4, 4, 3, 7, 13, 9, 6,
 2, 2, 2, 5, 13, 12, 2, 13, 12, 11, 10, 5, 8, 8, 15, 12, 3,
 3, 9, 4, 6, 13, 15, 4, 7, 1, 12, 10, 9, 7, 3, 7, 4, 9,
 2, 10, 2, 11, 10, 14, 3, 13, 8, 3, 12, 11, 10, 7, 5, 3, 3,
 11, 3, 13, 9, 10, 20, 7, 12, 3, 6, 6, 18, 3, 10, 11, 10, 5,
 6, 11, 4, 6, 7, 9, 13, 1, 14, 14, 13, 4, 3, 8, 5, 7, 14,
 13, 13, 12, 8, 11, 12, 9, 8, 9, 4, 5, 4, 7, 5, 2, 3, 1,
 7, 2, 1, 13, 5, 19, 9, 6, 9, 7])

When I plot the histogram of dependent variable I get this

enter image description here

So I used a log transform to remove some of the skewness.

as y=np.log(df["reservartions"].values)

Now the plot of distribution looks below:

enter image description here

Some of features.

type actual_price recommended_price num_videos image_ava text_length
1 67.85 59 5 0 7
0 100.70 53 5 0 224
0 74.00 74 4 1 21
0 135.00 75 1 0 184
0 59.36 53 2 1 31

Since actual_price and Recommended_price have huge correlation, I created a difference price of these two and dropped actual_price and recommended price.

But after running Linear regression or Random Forest Regression I get very poor results with R2 as 0.12 for both.

This shows the model is clearly not predicting and fitting well.

My dependent variable is clearly a positive variable. Is Linear Regression still right? Should I use Poisson regression? Log transformation makes sense?

EDIT:

Tried Poisson from Statsmodels. Gives worse results

 import statsmodels.api as sm 
 poisson_mod = sm.Poisson(train_Y, train_X)
 poisson_res = poisson_mod.fit(method="newton")
 print(poisson_res.summary())

Optimization terminated successfully.
 Current function value: 2.958960
 Iterations 5
 Poisson Regression Results 
==============================================================================
Dep. Variable: y No. Observations: 637
Model: Poisson Df Residuals: 628
Method: MLE Df Model: 8
Date: Mon, 08 Oct 2018 Pseudo R-squ.: 0.09479
Time: 13:57:37 Log-Likelihood: -1884.9
converged: True LL-Null: -2082.2
 LLR p-value: 2.506e-80
=================================================================================
 coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
technology 0.0080 0.040 0.200 0.842 -0.070 0.086
street_parked 0.0014 0.030 0.046 0.963 -0.058 0.061
description 0.0002 0.000 0.884 0.377 -0.000 0.001
num_images_2 -0.0230 0.054 -0.430 0.667 -0.128 0.082
num_images_3 0.0619 0.053 1.160 0.246 -0.043 0.167
num_images_4 0.2234 0.050 4.501 0.000 0.126 0.321
num_images_5 0.2391 0.053 4.521 0.000 0.135 0.343
price_diff -0.0146 0.001 -16.300 0.000 -0.016 -0.013
Bias 2.1325 0.052 41.016 0.000 2.031 2.234
=================================================================================

edited Oct 8 '18 at 21:21

asked Oct 8 '18 at 3:44

Baktaawar

1164

$begingroup$
I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
$endgroup$
– parvij
Oct 9 '18 at 5:30

$begingroup$
These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
$endgroup$
– Baktaawar
Oct 10 '18 at 17:22

$begingroup$
data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
$endgroup$
– parvij
Oct 11 '18 at 5:25

$begingroup$
That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
$endgroup$
– Baktaawar
Oct 11 '18 at 23:00

add a comment |

I am trying to predict reservation count from a dataset with few features. Features are both categorical and continuous.

The dependent variable reservations looks like below: My dataset size is around 917 obs.

array([ 1, 7, 17, 2, 2, 13, 8, 11, 9, 4, 4, 3, 5, 2, 5, 7, 3,
 12, 9, 13, 5, 2, 11, 13, 14, 19, 9, 11, 3, 6, 7, 10, 1, 6,
 5, 10, 8, 5, 4, 3, 2, 10, 10, 10, 8, 13, 16, 6, 4, 6, 3,
 11, 10, 1, 18, 7, 2, 12, 17, 4, 2, 19, 3, 4, 17, 13, 10, 2,
 10, 1, 3, 4, 20, 3, 2, 1, 3, 5, 8, 8, 4, 3, 13, 3, 3,
 5, 4, 17, 7, 6, 10, 5, 3, 9, 9, 8, 1, 5, 17, 5, 10, 9,
 2, 7, 13, 2, 9, 1, 15, 13, 10, 4, 2, 4, 5, 4, 3, 3, 10,
 4, 7, 5, 13, 12, 7, 5, 6, 9, 5, 11, 7, 1, 4, 12, 4, 3,
 11, 1, 4, 4, 3, 7, 4, 11, 4, 1, 9, 2, 10, 10, 3, 4, 4,
 3, 2, 7, 10, 7, 6, 1, 3, 19, 9, 3, 8, 20, 1, 12, 9, 13,
 13, 2, 9, 4, 9, 2, 5, 6, 18, 3, 6, 8, 6, 4, 5, 13, 4,
 8, 9, 5, 4, 8, 5, 2, 1, 6, 8, 3, 6, 4, 2, 6, 11, 5,
 1, 5, 1, 5, 11, 11, 9, 3, 12, 2, 2, 9, 19, 7, 13, 13, 9,
 2, 1, 1, 4, 3, 4, 9, 1, 25, 12, 8, 5, 18, 3, 1, 6, 17,
 7, 4, 6, 9, 8, 10, 3, 8, 12, 5, 4, 4, 1, 9, 21, 4, 3,
 3, 7, 13, 5, 12, 8, 8, 6, 3, 6, 7, 5, 3, 7, 3, 14, 3,
 5, 2, 14, 16, 3, 8, 6, 13, 9, 3, 5, 4, 9, 4, 12, 12, 4,
 9, 8, 11, 5, 13, 3, 2, 5, 4, 2, 1, 8, 8, 18, 11, 2, 5,
 13, 4, 1, 2, 4, 1, 2, 2, 12, 2, 6, 19, 7, 20, 2, 10, 2,
 9, 12, 9, 8, 1, 4, 8, 8, 12, 4, 8, 1, 3, 6, 9, 4, 3,
 8, 2, 7, 15, 6, 5, 10, 6, 4, 3, 12, 5, 4, 13, 7, 2, 8,
 5, 2, 4, 3, 14, 12, 3, 4, 3, 2, 15, 6, 14, 12, 11, 9, 5,
 5, 7, 11, 10, 7, 9, 9, 7, 11, 5, 11, 3, 2, 5, 17, 5, 2,
 6, 1, 10, 3, 13, 19, 5, 1, 3, 5, 3, 5, 6, 3, 9, 8, 2,
 3, 2, 3, 7, 4, 9, 5, 1, 6, 14, 4, 8, 17, 13, 7, 1, 4,
 5, 10, 5, 6, 2, 12, 5, 9, 3, 9, 9, 1, 5, 1, 2, 2, 5,
 1, 4, 4, 13, 4, 25, 9, 10, 4, 3, 9, 13, 13, 2, 9, 2, 12,
 4, 1, 20, 9, 10, 2, 5, 4, 10, 2, 6, 1, 7, 7, 7, 4, 8,
 4, 3, 4, 13, 8, 3, 13, 12, 19, 9, 3, 2, 6, 7, 13, 8, 16,
 7, 3, 11, 4, 10, 9, 12, 2, 8, 5, 2, 3, 4, 2, 1, 11, 5,
 4, 2, 8, 12, 7, 5, 7, 7, 4, 6, 18, 2, 1, 6, 15, 11, 2,
 5, 8, 3, 5, 9, 11, 5, 8, 6, 20, 1, 10, 3, 7, 1, 3, 5,
 4, 4, 10, 11, 6, 1, 5, 4, 1, 2, 10, 4, 4, 11, 20, 5, 3,
 2, 7, 8, 2, 10, 5, 1, 18, 5, 10, 5, 3, 8, 15, 2, 1, 14,
 10, 7, 3, 5, 9, 3, 4, 21, 14, 1, 2, 1, 2, 4, 11, 9, 7,
 6, 9, 18, 4, 6, 18, 12, 12, 4, 6, 3, 3, 9, 5, 12, 15, 3,
 7, 3, 7, 4, 2, 15, 14, 7, 10, 5, 5, 5, 9, 3, 6, 3, 1,
 11, 1, 5, 25, 8, 2, 24, 1, 12, 1, 6, 8, 5, 13, 4, 3, 3,
 13, 4, 4, 18, 7, 13, 2, 8, 3, 4, 9, 2, 13, 12, 4, 5, 10,
 9, 15, 1, 8, 8, 15, 10, 1, 9, 2, 2, 2, 2, 3, 6, 17, 7,
 5, 5, 6, 12, 1, 8, 3, 1, 11, 4, 7, 8, 15, 6, 11, 9, 9,
 13, 2, 3, 5, 3, 5, 12, 4, 4, 8, 7, 12, 2, 2, 4, 4, 12,
 8, 11, 10, 6, 5, 1, 4, 2, 7, 3, 5, 15, 12, 12, 2, 9, 7,
 4, 4, 5, 15, 5, 8, 13, 7, 2, 8, 12, 2, 13, 6, 24, 14, 3,
 4, 1, 2, 8, 7, 5, 12, 8, 2, 6, 3, 7, 5, 2, 7, 3, 3,
 1, 9, 9, 3, 12, 3, 2, 11, 11, 6, 3, 9, 12, 4, 8, 7, 5,
 2, 10, 19, 1, 1, 10, 6, 2, 4, 2, 4, 4, 3, 7, 13, 9, 6,
 2, 2, 2, 5, 13, 12, 2, 13, 12, 11, 10, 5, 8, 8, 15, 12, 3,
 3, 9, 4, 6, 13, 15, 4, 7, 1, 12, 10, 9, 7, 3, 7, 4, 9,
 2, 10, 2, 11, 10, 14, 3, 13, 8, 3, 12, 11, 10, 7, 5, 3, 3,
 11, 3, 13, 9, 10, 20, 7, 12, 3, 6, 6, 18, 3, 10, 11, 10, 5,
 6, 11, 4, 6, 7, 9, 13, 1, 14, 14, 13, 4, 3, 8, 5, 7, 14,
 13, 13, 12, 8, 11, 12, 9, 8, 9, 4, 5, 4, 7, 5, 2, 3, 1,
 7, 2, 1, 13, 5, 19, 9, 6, 9, 7])

When I plot the histogram of dependent variable I get this

enter image description here

So I used a log transform to remove some of the skewness.

as y=np.log(df["reservartions"].values)

Now the plot of distribution looks below:

enter image description here

Some of features.

type actual_price recommended_price num_videos image_ava text_length
1 67.85 59 5 0 7
0 100.70 53 5 0 224
0 74.00 74 4 1 21
0 135.00 75 1 0 184
0 59.36 53 2 1 31

Since actual_price and Recommended_price have huge correlation, I created a difference price of these two and dropped actual_price and recommended price.

But after running Linear regression or Random Forest Regression I get very poor results with R2 as 0.12 for both.

This shows the model is clearly not predicting and fitting well.

My dependent variable is clearly a positive variable. Is Linear Regression still right? Should I use Poisson regression? Log transformation makes sense?

EDIT:

Tried Poisson from Statsmodels. Gives worse results

 import statsmodels.api as sm 
 poisson_mod = sm.Poisson(train_Y, train_X)
 poisson_res = poisson_mod.fit(method="newton")
 print(poisson_res.summary())

Optimization terminated successfully.
 Current function value: 2.958960
 Iterations 5
 Poisson Regression Results 
==============================================================================
Dep. Variable: y No. Observations: 637
Model: Poisson Df Residuals: 628
Method: MLE Df Model: 8
Date: Mon, 08 Oct 2018 Pseudo R-squ.: 0.09479
Time: 13:57:37 Log-Likelihood: -1884.9
converged: True LL-Null: -2082.2
 LLR p-value: 2.506e-80
=================================================================================
 coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
technology 0.0080 0.040 0.200 0.842 -0.070 0.086
street_parked 0.0014 0.030 0.046 0.963 -0.058 0.061
description 0.0002 0.000 0.884 0.377 -0.000 0.001
num_images_2 -0.0230 0.054 -0.430 0.667 -0.128 0.082
num_images_3 0.0619 0.053 1.160 0.246 -0.043 0.167
num_images_4 0.2234 0.050 4.501 0.000 0.126 0.321
num_images_5 0.2391 0.053 4.521 0.000 0.135 0.343
price_diff -0.0146 0.001 -16.300 0.000 -0.016 -0.013
Bias 2.1325 0.052 41.016 0.000 2.031 2.234
=================================================================================

edited Oct 8 '18 at 21:21

asked Oct 8 '18 at 3:44

Baktaawar

1164

$begingroup$
I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
$endgroup$
– parvij
Oct 9 '18 at 5:30

$begingroup$
These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
$endgroup$
– Baktaawar
Oct 10 '18 at 17:22

$begingroup$
data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
$endgroup$
– parvij
Oct 11 '18 at 5:25

$begingroup$
That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
$endgroup$
– Baktaawar
Oct 11 '18 at 23:00

add a comment |

I am trying to predict reservation count from a dataset with few features. Features are both categorical and continuous.

The dependent variable reservations looks like below: My dataset size is around 917 obs.

array([ 1, 7, 17, 2, 2, 13, 8, 11, 9, 4, 4, 3, 5, 2, 5, 7, 3,
 12, 9, 13, 5, 2, 11, 13, 14, 19, 9, 11, 3, 6, 7, 10, 1, 6,
 5, 10, 8, 5, 4, 3, 2, 10, 10, 10, 8, 13, 16, 6, 4, 6, 3,
 11, 10, 1, 18, 7, 2, 12, 17, 4, 2, 19, 3, 4, 17, 13, 10, 2,
 10, 1, 3, 4, 20, 3, 2, 1, 3, 5, 8, 8, 4, 3, 13, 3, 3,
 5, 4, 17, 7, 6, 10, 5, 3, 9, 9, 8, 1, 5, 17, 5, 10, 9,
 2, 7, 13, 2, 9, 1, 15, 13, 10, 4, 2, 4, 5, 4, 3, 3, 10,
 4, 7, 5, 13, 12, 7, 5, 6, 9, 5, 11, 7, 1, 4, 12, 4, 3,
 11, 1, 4, 4, 3, 7, 4, 11, 4, 1, 9, 2, 10, 10, 3, 4, 4,
 3, 2, 7, 10, 7, 6, 1, 3, 19, 9, 3, 8, 20, 1, 12, 9, 13,
 13, 2, 9, 4, 9, 2, 5, 6, 18, 3, 6, 8, 6, 4, 5, 13, 4,
 8, 9, 5, 4, 8, 5, 2, 1, 6, 8, 3, 6, 4, 2, 6, 11, 5,
 1, 5, 1, 5, 11, 11, 9, 3, 12, 2, 2, 9, 19, 7, 13, 13, 9,
 2, 1, 1, 4, 3, 4, 9, 1, 25, 12, 8, 5, 18, 3, 1, 6, 17,
 7, 4, 6, 9, 8, 10, 3, 8, 12, 5, 4, 4, 1, 9, 21, 4, 3,
 3, 7, 13, 5, 12, 8, 8, 6, 3, 6, 7, 5, 3, 7, 3, 14, 3,
 5, 2, 14, 16, 3, 8, 6, 13, 9, 3, 5, 4, 9, 4, 12, 12, 4,
 9, 8, 11, 5, 13, 3, 2, 5, 4, 2, 1, 8, 8, 18, 11, 2, 5,
 13, 4, 1, 2, 4, 1, 2, 2, 12, 2, 6, 19, 7, 20, 2, 10, 2,
 9, 12, 9, 8, 1, 4, 8, 8, 12, 4, 8, 1, 3, 6, 9, 4, 3,
 8, 2, 7, 15, 6, 5, 10, 6, 4, 3, 12, 5, 4, 13, 7, 2, 8,
 5, 2, 4, 3, 14, 12, 3, 4, 3, 2, 15, 6, 14, 12, 11, 9, 5,
 5, 7, 11, 10, 7, 9, 9, 7, 11, 5, 11, 3, 2, 5, 17, 5, 2,
 6, 1, 10, 3, 13, 19, 5, 1, 3, 5, 3, 5, 6, 3, 9, 8, 2,
 3, 2, 3, 7, 4, 9, 5, 1, 6, 14, 4, 8, 17, 13, 7, 1, 4,
 5, 10, 5, 6, 2, 12, 5, 9, 3, 9, 9, 1, 5, 1, 2, 2, 5,
 1, 4, 4, 13, 4, 25, 9, 10, 4, 3, 9, 13, 13, 2, 9, 2, 12,
 4, 1, 20, 9, 10, 2, 5, 4, 10, 2, 6, 1, 7, 7, 7, 4, 8,
 4, 3, 4, 13, 8, 3, 13, 12, 19, 9, 3, 2, 6, 7, 13, 8, 16,
 7, 3, 11, 4, 10, 9, 12, 2, 8, 5, 2, 3, 4, 2, 1, 11, 5,
 4, 2, 8, 12, 7, 5, 7, 7, 4, 6, 18, 2, 1, 6, 15, 11, 2,
 5, 8, 3, 5, 9, 11, 5, 8, 6, 20, 1, 10, 3, 7, 1, 3, 5,
 4, 4, 10, 11, 6, 1, 5, 4, 1, 2, 10, 4, 4, 11, 20, 5, 3,
 2, 7, 8, 2, 10, 5, 1, 18, 5, 10, 5, 3, 8, 15, 2, 1, 14,
 10, 7, 3, 5, 9, 3, 4, 21, 14, 1, 2, 1, 2, 4, 11, 9, 7,
 6, 9, 18, 4, 6, 18, 12, 12, 4, 6, 3, 3, 9, 5, 12, 15, 3,
 7, 3, 7, 4, 2, 15, 14, 7, 10, 5, 5, 5, 9, 3, 6, 3, 1,
 11, 1, 5, 25, 8, 2, 24, 1, 12, 1, 6, 8, 5, 13, 4, 3, 3,
 13, 4, 4, 18, 7, 13, 2, 8, 3, 4, 9, 2, 13, 12, 4, 5, 10,
 9, 15, 1, 8, 8, 15, 10, 1, 9, 2, 2, 2, 2, 3, 6, 17, 7,
 5, 5, 6, 12, 1, 8, 3, 1, 11, 4, 7, 8, 15, 6, 11, 9, 9,
 13, 2, 3, 5, 3, 5, 12, 4, 4, 8, 7, 12, 2, 2, 4, 4, 12,
 8, 11, 10, 6, 5, 1, 4, 2, 7, 3, 5, 15, 12, 12, 2, 9, 7,
 4, 4, 5, 15, 5, 8, 13, 7, 2, 8, 12, 2, 13, 6, 24, 14, 3,
 4, 1, 2, 8, 7, 5, 12, 8, 2, 6, 3, 7, 5, 2, 7, 3, 3,
 1, 9, 9, 3, 12, 3, 2, 11, 11, 6, 3, 9, 12, 4, 8, 7, 5,
 2, 10, 19, 1, 1, 10, 6, 2, 4, 2, 4, 4, 3, 7, 13, 9, 6,
 2, 2, 2, 5, 13, 12, 2, 13, 12, 11, 10, 5, 8, 8, 15, 12, 3,
 3, 9, 4, 6, 13, 15, 4, 7, 1, 12, 10, 9, 7, 3, 7, 4, 9,
 2, 10, 2, 11, 10, 14, 3, 13, 8, 3, 12, 11, 10, 7, 5, 3, 3,
 11, 3, 13, 9, 10, 20, 7, 12, 3, 6, 6, 18, 3, 10, 11, 10, 5,
 6, 11, 4, 6, 7, 9, 13, 1, 14, 14, 13, 4, 3, 8, 5, 7, 14,
 13, 13, 12, 8, 11, 12, 9, 8, 9, 4, 5, 4, 7, 5, 2, 3, 1,
 7, 2, 1, 13, 5, 19, 9, 6, 9, 7])

When I plot the histogram of dependent variable I get this

enter image description here

So I used a log transform to remove some of the skewness.

as y=np.log(df["reservartions"].values)

Now the plot of distribution looks below:

enter image description here

Some of features.

type actual_price recommended_price num_videos image_ava text_length
1 67.85 59 5 0 7
0 100.70 53 5 0 224
0 74.00 74 4 1 21
0 135.00 75 1 0 184
0 59.36 53 2 1 31

Since actual_price and Recommended_price have huge correlation, I created a difference price of these two and dropped actual_price and recommended price.

But after running Linear regression or Random Forest Regression I get very poor results with R2 as 0.12 for both.

This shows the model is clearly not predicting and fitting well.

My dependent variable is clearly a positive variable. Is Linear Regression still right? Should I use Poisson regression? Log transformation makes sense?

EDIT:

Tried Poisson from Statsmodels. Gives worse results

 import statsmodels.api as sm 
 poisson_mod = sm.Poisson(train_Y, train_X)
 poisson_res = poisson_mod.fit(method="newton")
 print(poisson_res.summary())

Optimization terminated successfully.
 Current function value: 2.958960
 Iterations 5
 Poisson Regression Results 
==============================================================================
Dep. Variable: y No. Observations: 637
Model: Poisson Df Residuals: 628
Method: MLE Df Model: 8
Date: Mon, 08 Oct 2018 Pseudo R-squ.: 0.09479
Time: 13:57:37 Log-Likelihood: -1884.9
converged: True LL-Null: -2082.2
 LLR p-value: 2.506e-80
=================================================================================
 coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
technology 0.0080 0.040 0.200 0.842 -0.070 0.086
street_parked 0.0014 0.030 0.046 0.963 -0.058 0.061
description 0.0002 0.000 0.884 0.377 -0.000 0.001
num_images_2 -0.0230 0.054 -0.430 0.667 -0.128 0.082
num_images_3 0.0619 0.053 1.160 0.246 -0.043 0.167
num_images_4 0.2234 0.050 4.501 0.000 0.126 0.321
num_images_5 0.2391 0.053 4.521 0.000 0.135 0.343
price_diff -0.0146 0.001 -16.300 0.000 -0.016 -0.013
Bias 2.1325 0.052 41.016 0.000 2.031 2.234
=================================================================================

edited Oct 8 '18 at 21:21

asked Oct 8 '18 at 3:44

Baktaawar

1164

I am trying to predict reservation count from a dataset with few features. Features are both categorical and continuous.

The dependent variable reservations looks like below: My dataset size is around 917 obs.

array([ 1, 7, 17, 2, 2, 13, 8, 11, 9, 4, 4, 3, 5, 2, 5, 7, 3,
 12, 9, 13, 5, 2, 11, 13, 14, 19, 9, 11, 3, 6, 7, 10, 1, 6,
 5, 10, 8, 5, 4, 3, 2, 10, 10, 10, 8, 13, 16, 6, 4, 6, 3,
 11, 10, 1, 18, 7, 2, 12, 17, 4, 2, 19, 3, 4, 17, 13, 10, 2,
 10, 1, 3, 4, 20, 3, 2, 1, 3, 5, 8, 8, 4, 3, 13, 3, 3,
 5, 4, 17, 7, 6, 10, 5, 3, 9, 9, 8, 1, 5, 17, 5, 10, 9,
 2, 7, 13, 2, 9, 1, 15, 13, 10, 4, 2, 4, 5, 4, 3, 3, 10,
 4, 7, 5, 13, 12, 7, 5, 6, 9, 5, 11, 7, 1, 4, 12, 4, 3,
 11, 1, 4, 4, 3, 7, 4, 11, 4, 1, 9, 2, 10, 10, 3, 4, 4,
 3, 2, 7, 10, 7, 6, 1, 3, 19, 9, 3, 8, 20, 1, 12, 9, 13,
 13, 2, 9, 4, 9, 2, 5, 6, 18, 3, 6, 8, 6, 4, 5, 13, 4,
 8, 9, 5, 4, 8, 5, 2, 1, 6, 8, 3, 6, 4, 2, 6, 11, 5,
 1, 5, 1, 5, 11, 11, 9, 3, 12, 2, 2, 9, 19, 7, 13, 13, 9,
 2, 1, 1, 4, 3, 4, 9, 1, 25, 12, 8, 5, 18, 3, 1, 6, 17,
 7, 4, 6, 9, 8, 10, 3, 8, 12, 5, 4, 4, 1, 9, 21, 4, 3,
 3, 7, 13, 5, 12, 8, 8, 6, 3, 6, 7, 5, 3, 7, 3, 14, 3,
 5, 2, 14, 16, 3, 8, 6, 13, 9, 3, 5, 4, 9, 4, 12, 12, 4,
 9, 8, 11, 5, 13, 3, 2, 5, 4, 2, 1, 8, 8, 18, 11, 2, 5,
 13, 4, 1, 2, 4, 1, 2, 2, 12, 2, 6, 19, 7, 20, 2, 10, 2,
 9, 12, 9, 8, 1, 4, 8, 8, 12, 4, 8, 1, 3, 6, 9, 4, 3,
 8, 2, 7, 15, 6, 5, 10, 6, 4, 3, 12, 5, 4, 13, 7, 2, 8,
 5, 2, 4, 3, 14, 12, 3, 4, 3, 2, 15, 6, 14, 12, 11, 9, 5,
 5, 7, 11, 10, 7, 9, 9, 7, 11, 5, 11, 3, 2, 5, 17, 5, 2,
 6, 1, 10, 3, 13, 19, 5, 1, 3, 5, 3, 5, 6, 3, 9, 8, 2,
 3, 2, 3, 7, 4, 9, 5, 1, 6, 14, 4, 8, 17, 13, 7, 1, 4,
 5, 10, 5, 6, 2, 12, 5, 9, 3, 9, 9, 1, 5, 1, 2, 2, 5,
 1, 4, 4, 13, 4, 25, 9, 10, 4, 3, 9, 13, 13, 2, 9, 2, 12,
 4, 1, 20, 9, 10, 2, 5, 4, 10, 2, 6, 1, 7, 7, 7, 4, 8,
 4, 3, 4, 13, 8, 3, 13, 12, 19, 9, 3, 2, 6, 7, 13, 8, 16,
 7, 3, 11, 4, 10, 9, 12, 2, 8, 5, 2, 3, 4, 2, 1, 11, 5,
 4, 2, 8, 12, 7, 5, 7, 7, 4, 6, 18, 2, 1, 6, 15, 11, 2,
 5, 8, 3, 5, 9, 11, 5, 8, 6, 20, 1, 10, 3, 7, 1, 3, 5,
 4, 4, 10, 11, 6, 1, 5, 4, 1, 2, 10, 4, 4, 11, 20, 5, 3,
 2, 7, 8, 2, 10, 5, 1, 18, 5, 10, 5, 3, 8, 15, 2, 1, 14,
 10, 7, 3, 5, 9, 3, 4, 21, 14, 1, 2, 1, 2, 4, 11, 9, 7,
 6, 9, 18, 4, 6, 18, 12, 12, 4, 6, 3, 3, 9, 5, 12, 15, 3,
 7, 3, 7, 4, 2, 15, 14, 7, 10, 5, 5, 5, 9, 3, 6, 3, 1,
 11, 1, 5, 25, 8, 2, 24, 1, 12, 1, 6, 8, 5, 13, 4, 3, 3,
 13, 4, 4, 18, 7, 13, 2, 8, 3, 4, 9, 2, 13, 12, 4, 5, 10,
 9, 15, 1, 8, 8, 15, 10, 1, 9, 2, 2, 2, 2, 3, 6, 17, 7,
 5, 5, 6, 12, 1, 8, 3, 1, 11, 4, 7, 8, 15, 6, 11, 9, 9,
 13, 2, 3, 5, 3, 5, 12, 4, 4, 8, 7, 12, 2, 2, 4, 4, 12,
 8, 11, 10, 6, 5, 1, 4, 2, 7, 3, 5, 15, 12, 12, 2, 9, 7,
 4, 4, 5, 15, 5, 8, 13, 7, 2, 8, 12, 2, 13, 6, 24, 14, 3,
 4, 1, 2, 8, 7, 5, 12, 8, 2, 6, 3, 7, 5, 2, 7, 3, 3,
 1, 9, 9, 3, 12, 3, 2, 11, 11, 6, 3, 9, 12, 4, 8, 7, 5,
 2, 10, 19, 1, 1, 10, 6, 2, 4, 2, 4, 4, 3, 7, 13, 9, 6,
 2, 2, 2, 5, 13, 12, 2, 13, 12, 11, 10, 5, 8, 8, 15, 12, 3,
 3, 9, 4, 6, 13, 15, 4, 7, 1, 12, 10, 9, 7, 3, 7, 4, 9,
 2, 10, 2, 11, 10, 14, 3, 13, 8, 3, 12, 11, 10, 7, 5, 3, 3,
 11, 3, 13, 9, 10, 20, 7, 12, 3, 6, 6, 18, 3, 10, 11, 10, 5,
 6, 11, 4, 6, 7, 9, 13, 1, 14, 14, 13, 4, 3, 8, 5, 7, 14,
 13, 13, 12, 8, 11, 12, 9, 8, 9, 4, 5, 4, 7, 5, 2, 3, 1,
 7, 2, 1, 13, 5, 19, 9, 6, 9, 7])

When I plot the histogram of dependent variable I get this

enter image description here

So I used a log transform to remove some of the skewness.

as y=np.log(df["reservartions"].values)

Now the plot of distribution looks below:

enter image description here

Some of features.

type actual_price recommended_price num_videos image_ava text_length
1 67.85 59 5 0 7
0 100.70 53 5 0 224
0 74.00 74 4 1 21
0 135.00 75 1 0 184
0 59.36 53 2 1 31

Since actual_price and Recommended_price have huge correlation, I created a difference price of these two and dropped actual_price and recommended price.

But after running Linear regression or Random Forest Regression I get very poor results with R2 as 0.12 for both.

This shows the model is clearly not predicting and fitting well.

My dependent variable is clearly a positive variable. Is Linear Regression still right? Should I use Poisson regression? Log transformation makes sense?

EDIT:

Tried Poisson from Statsmodels. Gives worse results

 import statsmodels.api as sm 
 poisson_mod = sm.Poisson(train_Y, train_X)
 poisson_res = poisson_mod.fit(method="newton")
 print(poisson_res.summary())

Optimization terminated successfully.
 Current function value: 2.958960
 Iterations 5
 Poisson Regression Results 
==============================================================================
Dep. Variable: y No. Observations: 637
Model: Poisson Df Residuals: 628
Method: MLE Df Model: 8
Date: Mon, 08 Oct 2018 Pseudo R-squ.: 0.09479
Time: 13:57:37 Log-Likelihood: -1884.9
converged: True LL-Null: -2082.2
 LLR p-value: 2.506e-80
=================================================================================
 coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
technology 0.0080 0.040 0.200 0.842 -0.070 0.086
street_parked 0.0014 0.030 0.046 0.963 -0.058 0.061
description 0.0002 0.000 0.884 0.377 -0.000 0.001
num_images_2 -0.0230 0.054 -0.430 0.667 -0.128 0.082
num_images_3 0.0619 0.053 1.160 0.246 -0.043 0.167
num_images_4 0.2234 0.050 4.501 0.000 0.126 0.321
num_images_5 0.2391 0.053 4.521 0.000 0.135 0.343
price_diff -0.0146 0.001 -16.300 0.000 -0.016 -0.013
Bias 2.1325 0.052 41.016 0.000 2.031 2.234
=================================================================================

machine-learning regression statistics

edited Oct 8 '18 at 21:21

asked Oct 8 '18 at 3:44

Baktaawar

1164

edited Oct 8 '18 at 21:21

asked Oct 8 '18 at 3:44

Baktaawar

1164

edited Oct 8 '18 at 21:21

asked Oct 8 '18 at 3:44

Baktaawar

1164

asked Oct 8 '18 at 3:44

Baktaawar

1164

asked Oct 8 '18 at 3:44

Baktaawar

1164

$begingroup$
I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
$endgroup$
– parvij
Oct 9 '18 at 5:30

$begingroup$
These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
$endgroup$
– Baktaawar
Oct 10 '18 at 17:22

$begingroup$
data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
$endgroup$
– parvij
Oct 11 '18 at 5:25

$begingroup$
That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
$endgroup$
– Baktaawar
Oct 11 '18 at 23:00

add a comment |

$begingroup$
I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
$endgroup$
– parvij
Oct 9 '18 at 5:30

$begingroup$
These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
$endgroup$
– Baktaawar
Oct 10 '18 at 17:22

$begingroup$
data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
$endgroup$
– parvij
Oct 11 '18 at 5:25

$begingroup$
That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
$endgroup$
– Baktaawar
Oct 11 '18 at 23:00

I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering

– parvij
Oct 9 '18 at 5:30

These are feature engg. What do u mean by normalizations? I have already standardized the continuous features

– Baktaawar
Oct 10 '18 at 17:22

data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.

– parvij
Oct 11 '18 at 5:25

That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian

– Baktaawar
Oct 11 '18 at 23:00

add a comment |

1 Answer
1

active

oldest

votes

I think because your target is count so their distribution is Poisson and you should use Poisson regression.

I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.

edited Oct 9 '18 at 6:13

answered Oct 8 '18 at 7:29

parvij

485214

$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f39331%2fright-regression-model-to-use%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I think because your target is count so their distribution is Poisson and you should use Poisson regression.

I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.

edited Oct 9 '18 at 6:13

answered Oct 8 '18 at 7:29

parvij

485214

$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20

add a comment |

I think because your target is count so their distribution is Poisson and you should use Poisson regression.

I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.

edited Oct 9 '18 at 6:13

answered Oct 8 '18 at 7:29

parvij

485214

$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20

add a comment |

I think because your target is count so their distribution is Poisson and you should use Poisson regression.

I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.

edited Oct 9 '18 at 6:13

answered Oct 8 '18 at 7:29

parvij

485214

I think because your target is count so their distribution is Poisson and you should use Poisson regression.

I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.

edited Oct 9 '18 at 6:13

answered Oct 8 '18 at 7:29

parvij

485214

edited Oct 9 '18 at 6:13

answered Oct 8 '18 at 7:29

parvij

485214

answered Oct 8 '18 at 7:29

parvij

485214

answered Oct 8 '18 at 7:29

parvij

485214

$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20

add a comment |

$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20

Pls check the edit of Poisson I tried

– Baktaawar
Oct 8 '18 at 21:20

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

nb7 J0 ZIj20z 9Brw5luwYp4W,Od,hfEaFINXLfixt26yb,B3TDND5IL V2csTvXk7guLh

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1