Right Regression Model to useRegression Model for explained model(Details inside)Best regression model to use for sales predictionChose the right regression analysisMaking Use of the Target Values for RegressionIs removing poorly predicted data points a valid approach?Regression turned into classifficationLinear Model for Linear RegressionSelecting the right time series modelWhen to use an ordinal logistic regression modelCan I use Linear Regression to model a nonlinear function?
"Whatever a Russian does, they end up making the Kalashnikov gun"? Are there any similar proverbs in English?
How would 10 generations of living underground change the human body?
Why didn't the Space Shuttle bounce back into space as many times as possible so as to lose a lot of kinetic energy up there?
How do I reattach a shelf to the wall when it ripped out of the wall?
Was there a shared-world project before "Thieves World"?
Contradiction proof for inequality of P and NP?
Why did some of my point & shoot film photos come back with one third light white or orange?
Pulling the rope with one hand is as heavy as with two hands?
Read line from file and process something
How to have a sharp product image?
What term is being referred to with "reflected-sound-of-underground-spirits"?
On The Origin of Dissonant Chords
How to stop co-workers from teasing me because I know Russian?
Your bread will be buttered on both sides
How can Republicans who favour free markets, consistently express anger when they don't like the outcome of that choice?
Negative Resistance
Does a large simulator bay have standard public address announcements?
Rivers without rain
Why does Mind Blank stop the Feeblemind spell?
Critique of timeline aesthetic
I preordered a game on my Xbox while on the home screen of my friend's account. Which of us owns the game?
Is there really no use for MD5 anymore?
Mistake in years of experience in resume?
How could Tony Stark make this in Endgame?
Right Regression Model to use
Regression Model for explained model(Details inside)Best regression model to use for sales predictionChose the right regression analysisMaking Use of the Target Values for RegressionIs removing poorly predicted data points a valid approach?Regression turned into classifficationLinear Model for Linear RegressionSelecting the right time series modelWhen to use an ordinal logistic regression modelCan I use Linear Regression to model a nonlinear function?
$begingroup$
I am trying to predict reservation count from a dataset with few features. Features are both categorical and continuous.
The dependent variable reservations looks like below: My dataset size is around 917 obs.
array([ 1, 7, 17, 2, 2, 13, 8, 11, 9, 4, 4, 3, 5, 2, 5, 7, 3,
12, 9, 13, 5, 2, 11, 13, 14, 19, 9, 11, 3, 6, 7, 10, 1, 6,
5, 10, 8, 5, 4, 3, 2, 10, 10, 10, 8, 13, 16, 6, 4, 6, 3,
11, 10, 1, 18, 7, 2, 12, 17, 4, 2, 19, 3, 4, 17, 13, 10, 2,
10, 1, 3, 4, 20, 3, 2, 1, 3, 5, 8, 8, 4, 3, 13, 3, 3,
5, 4, 17, 7, 6, 10, 5, 3, 9, 9, 8, 1, 5, 17, 5, 10, 9,
2, 7, 13, 2, 9, 1, 15, 13, 10, 4, 2, 4, 5, 4, 3, 3, 10,
4, 7, 5, 13, 12, 7, 5, 6, 9, 5, 11, 7, 1, 4, 12, 4, 3,
11, 1, 4, 4, 3, 7, 4, 11, 4, 1, 9, 2, 10, 10, 3, 4, 4,
3, 2, 7, 10, 7, 6, 1, 3, 19, 9, 3, 8, 20, 1, 12, 9, 13,
13, 2, 9, 4, 9, 2, 5, 6, 18, 3, 6, 8, 6, 4, 5, 13, 4,
8, 9, 5, 4, 8, 5, 2, 1, 6, 8, 3, 6, 4, 2, 6, 11, 5,
1, 5, 1, 5, 11, 11, 9, 3, 12, 2, 2, 9, 19, 7, 13, 13, 9,
2, 1, 1, 4, 3, 4, 9, 1, 25, 12, 8, 5, 18, 3, 1, 6, 17,
7, 4, 6, 9, 8, 10, 3, 8, 12, 5, 4, 4, 1, 9, 21, 4, 3,
3, 7, 13, 5, 12, 8, 8, 6, 3, 6, 7, 5, 3, 7, 3, 14, 3,
5, 2, 14, 16, 3, 8, 6, 13, 9, 3, 5, 4, 9, 4, 12, 12, 4,
9, 8, 11, 5, 13, 3, 2, 5, 4, 2, 1, 8, 8, 18, 11, 2, 5,
13, 4, 1, 2, 4, 1, 2, 2, 12, 2, 6, 19, 7, 20, 2, 10, 2,
9, 12, 9, 8, 1, 4, 8, 8, 12, 4, 8, 1, 3, 6, 9, 4, 3,
8, 2, 7, 15, 6, 5, 10, 6, 4, 3, 12, 5, 4, 13, 7, 2, 8,
5, 2, 4, 3, 14, 12, 3, 4, 3, 2, 15, 6, 14, 12, 11, 9, 5,
5, 7, 11, 10, 7, 9, 9, 7, 11, 5, 11, 3, 2, 5, 17, 5, 2,
6, 1, 10, 3, 13, 19, 5, 1, 3, 5, 3, 5, 6, 3, 9, 8, 2,
3, 2, 3, 7, 4, 9, 5, 1, 6, 14, 4, 8, 17, 13, 7, 1, 4,
5, 10, 5, 6, 2, 12, 5, 9, 3, 9, 9, 1, 5, 1, 2, 2, 5,
1, 4, 4, 13, 4, 25, 9, 10, 4, 3, 9, 13, 13, 2, 9, 2, 12,
4, 1, 20, 9, 10, 2, 5, 4, 10, 2, 6, 1, 7, 7, 7, 4, 8,
4, 3, 4, 13, 8, 3, 13, 12, 19, 9, 3, 2, 6, 7, 13, 8, 16,
7, 3, 11, 4, 10, 9, 12, 2, 8, 5, 2, 3, 4, 2, 1, 11, 5,
4, 2, 8, 12, 7, 5, 7, 7, 4, 6, 18, 2, 1, 6, 15, 11, 2,
5, 8, 3, 5, 9, 11, 5, 8, 6, 20, 1, 10, 3, 7, 1, 3, 5,
4, 4, 10, 11, 6, 1, 5, 4, 1, 2, 10, 4, 4, 11, 20, 5, 3,
2, 7, 8, 2, 10, 5, 1, 18, 5, 10, 5, 3, 8, 15, 2, 1, 14,
10, 7, 3, 5, 9, 3, 4, 21, 14, 1, 2, 1, 2, 4, 11, 9, 7,
6, 9, 18, 4, 6, 18, 12, 12, 4, 6, 3, 3, 9, 5, 12, 15, 3,
7, 3, 7, 4, 2, 15, 14, 7, 10, 5, 5, 5, 9, 3, 6, 3, 1,
11, 1, 5, 25, 8, 2, 24, 1, 12, 1, 6, 8, 5, 13, 4, 3, 3,
13, 4, 4, 18, 7, 13, 2, 8, 3, 4, 9, 2, 13, 12, 4, 5, 10,
9, 15, 1, 8, 8, 15, 10, 1, 9, 2, 2, 2, 2, 3, 6, 17, 7,
5, 5, 6, 12, 1, 8, 3, 1, 11, 4, 7, 8, 15, 6, 11, 9, 9,
13, 2, 3, 5, 3, 5, 12, 4, 4, 8, 7, 12, 2, 2, 4, 4, 12,
8, 11, 10, 6, 5, 1, 4, 2, 7, 3, 5, 15, 12, 12, 2, 9, 7,
4, 4, 5, 15, 5, 8, 13, 7, 2, 8, 12, 2, 13, 6, 24, 14, 3,
4, 1, 2, 8, 7, 5, 12, 8, 2, 6, 3, 7, 5, 2, 7, 3, 3,
1, 9, 9, 3, 12, 3, 2, 11, 11, 6, 3, 9, 12, 4, 8, 7, 5,
2, 10, 19, 1, 1, 10, 6, 2, 4, 2, 4, 4, 3, 7, 13, 9, 6,
2, 2, 2, 5, 13, 12, 2, 13, 12, 11, 10, 5, 8, 8, 15, 12, 3,
3, 9, 4, 6, 13, 15, 4, 7, 1, 12, 10, 9, 7, 3, 7, 4, 9,
2, 10, 2, 11, 10, 14, 3, 13, 8, 3, 12, 11, 10, 7, 5, 3, 3,
11, 3, 13, 9, 10, 20, 7, 12, 3, 6, 6, 18, 3, 10, 11, 10, 5,
6, 11, 4, 6, 7, 9, 13, 1, 14, 14, 13, 4, 3, 8, 5, 7, 14,
13, 13, 12, 8, 11, 12, 9, 8, 9, 4, 5, 4, 7, 5, 2, 3, 1,
7, 2, 1, 13, 5, 19, 9, 6, 9, 7])
When I plot the histogram of dependent variable I get this
So I used a log transform to remove some of the skewness.
as y=np.log(df["reservartions"].values)
Now the plot of distribution looks below:
Some of features.
type actual_price recommended_price num_videos image_ava text_length
1 67.85 59 5 0 7
0 100.70 53 5 0 224
0 74.00 74 4 1 21
0 135.00 75 1 0 184
0 59.36 53 2 1 31
Since actual_price and Recommended_price have huge correlation, I created a difference price of these two and dropped actual_price and recommended price.
But after running Linear regression or Random Forest Regression I get very poor results with R2 as 0.12 for both.
This shows the model is clearly not predicting and fitting well.
My dependent variable is clearly a positive variable. Is Linear Regression still right? Should I use Poisson regression? Log transformation makes sense?
EDIT:
Tried Poisson from Statsmodels. Gives worse results
import statsmodels.api as sm
poisson_mod = sm.Poisson(train_Y, train_X)
poisson_res = poisson_mod.fit(method="newton")
print(poisson_res.summary())
Optimization terminated successfully.
Current function value: 2.958960
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 637
Model: Poisson Df Residuals: 628
Method: MLE Df Model: 8
Date: Mon, 08 Oct 2018 Pseudo R-squ.: 0.09479
Time: 13:57:37 Log-Likelihood: -1884.9
converged: True LL-Null: -2082.2
LLR p-value: 2.506e-80
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
technology 0.0080 0.040 0.200 0.842 -0.070 0.086
street_parked 0.0014 0.030 0.046 0.963 -0.058 0.061
description 0.0002 0.000 0.884 0.377 -0.000 0.001
num_images_2 -0.0230 0.054 -0.430 0.667 -0.128 0.082
num_images_3 0.0619 0.053 1.160 0.246 -0.043 0.167
num_images_4 0.2234 0.050 4.501 0.000 0.126 0.321
num_images_5 0.2391 0.053 4.521 0.000 0.135 0.343
price_diff -0.0146 0.001 -16.300 0.000 -0.016 -0.013
Bias 2.1325 0.052 41.016 0.000 2.031 2.234
=================================================================================
machine-learning regression statistics
$endgroup$
add a comment |
$begingroup$
I am trying to predict reservation count from a dataset with few features. Features are both categorical and continuous.
The dependent variable reservations looks like below: My dataset size is around 917 obs.
array([ 1, 7, 17, 2, 2, 13, 8, 11, 9, 4, 4, 3, 5, 2, 5, 7, 3,
12, 9, 13, 5, 2, 11, 13, 14, 19, 9, 11, 3, 6, 7, 10, 1, 6,
5, 10, 8, 5, 4, 3, 2, 10, 10, 10, 8, 13, 16, 6, 4, 6, 3,
11, 10, 1, 18, 7, 2, 12, 17, 4, 2, 19, 3, 4, 17, 13, 10, 2,
10, 1, 3, 4, 20, 3, 2, 1, 3, 5, 8, 8, 4, 3, 13, 3, 3,
5, 4, 17, 7, 6, 10, 5, 3, 9, 9, 8, 1, 5, 17, 5, 10, 9,
2, 7, 13, 2, 9, 1, 15, 13, 10, 4, 2, 4, 5, 4, 3, 3, 10,
4, 7, 5, 13, 12, 7, 5, 6, 9, 5, 11, 7, 1, 4, 12, 4, 3,
11, 1, 4, 4, 3, 7, 4, 11, 4, 1, 9, 2, 10, 10, 3, 4, 4,
3, 2, 7, 10, 7, 6, 1, 3, 19, 9, 3, 8, 20, 1, 12, 9, 13,
13, 2, 9, 4, 9, 2, 5, 6, 18, 3, 6, 8, 6, 4, 5, 13, 4,
8, 9, 5, 4, 8, 5, 2, 1, 6, 8, 3, 6, 4, 2, 6, 11, 5,
1, 5, 1, 5, 11, 11, 9, 3, 12, 2, 2, 9, 19, 7, 13, 13, 9,
2, 1, 1, 4, 3, 4, 9, 1, 25, 12, 8, 5, 18, 3, 1, 6, 17,
7, 4, 6, 9, 8, 10, 3, 8, 12, 5, 4, 4, 1, 9, 21, 4, 3,
3, 7, 13, 5, 12, 8, 8, 6, 3, 6, 7, 5, 3, 7, 3, 14, 3,
5, 2, 14, 16, 3, 8, 6, 13, 9, 3, 5, 4, 9, 4, 12, 12, 4,
9, 8, 11, 5, 13, 3, 2, 5, 4, 2, 1, 8, 8, 18, 11, 2, 5,
13, 4, 1, 2, 4, 1, 2, 2, 12, 2, 6, 19, 7, 20, 2, 10, 2,
9, 12, 9, 8, 1, 4, 8, 8, 12, 4, 8, 1, 3, 6, 9, 4, 3,
8, 2, 7, 15, 6, 5, 10, 6, 4, 3, 12, 5, 4, 13, 7, 2, 8,
5, 2, 4, 3, 14, 12, 3, 4, 3, 2, 15, 6, 14, 12, 11, 9, 5,
5, 7, 11, 10, 7, 9, 9, 7, 11, 5, 11, 3, 2, 5, 17, 5, 2,
6, 1, 10, 3, 13, 19, 5, 1, 3, 5, 3, 5, 6, 3, 9, 8, 2,
3, 2, 3, 7, 4, 9, 5, 1, 6, 14, 4, 8, 17, 13, 7, 1, 4,
5, 10, 5, 6, 2, 12, 5, 9, 3, 9, 9, 1, 5, 1, 2, 2, 5,
1, 4, 4, 13, 4, 25, 9, 10, 4, 3, 9, 13, 13, 2, 9, 2, 12,
4, 1, 20, 9, 10, 2, 5, 4, 10, 2, 6, 1, 7, 7, 7, 4, 8,
4, 3, 4, 13, 8, 3, 13, 12, 19, 9, 3, 2, 6, 7, 13, 8, 16,
7, 3, 11, 4, 10, 9, 12, 2, 8, 5, 2, 3, 4, 2, 1, 11, 5,
4, 2, 8, 12, 7, 5, 7, 7, 4, 6, 18, 2, 1, 6, 15, 11, 2,
5, 8, 3, 5, 9, 11, 5, 8, 6, 20, 1, 10, 3, 7, 1, 3, 5,
4, 4, 10, 11, 6, 1, 5, 4, 1, 2, 10, 4, 4, 11, 20, 5, 3,
2, 7, 8, 2, 10, 5, 1, 18, 5, 10, 5, 3, 8, 15, 2, 1, 14,
10, 7, 3, 5, 9, 3, 4, 21, 14, 1, 2, 1, 2, 4, 11, 9, 7,
6, 9, 18, 4, 6, 18, 12, 12, 4, 6, 3, 3, 9, 5, 12, 15, 3,
7, 3, 7, 4, 2, 15, 14, 7, 10, 5, 5, 5, 9, 3, 6, 3, 1,
11, 1, 5, 25, 8, 2, 24, 1, 12, 1, 6, 8, 5, 13, 4, 3, 3,
13, 4, 4, 18, 7, 13, 2, 8, 3, 4, 9, 2, 13, 12, 4, 5, 10,
9, 15, 1, 8, 8, 15, 10, 1, 9, 2, 2, 2, 2, 3, 6, 17, 7,
5, 5, 6, 12, 1, 8, 3, 1, 11, 4, 7, 8, 15, 6, 11, 9, 9,
13, 2, 3, 5, 3, 5, 12, 4, 4, 8, 7, 12, 2, 2, 4, 4, 12,
8, 11, 10, 6, 5, 1, 4, 2, 7, 3, 5, 15, 12, 12, 2, 9, 7,
4, 4, 5, 15, 5, 8, 13, 7, 2, 8, 12, 2, 13, 6, 24, 14, 3,
4, 1, 2, 8, 7, 5, 12, 8, 2, 6, 3, 7, 5, 2, 7, 3, 3,
1, 9, 9, 3, 12, 3, 2, 11, 11, 6, 3, 9, 12, 4, 8, 7, 5,
2, 10, 19, 1, 1, 10, 6, 2, 4, 2, 4, 4, 3, 7, 13, 9, 6,
2, 2, 2, 5, 13, 12, 2, 13, 12, 11, 10, 5, 8, 8, 15, 12, 3,
3, 9, 4, 6, 13, 15, 4, 7, 1, 12, 10, 9, 7, 3, 7, 4, 9,
2, 10, 2, 11, 10, 14, 3, 13, 8, 3, 12, 11, 10, 7, 5, 3, 3,
11, 3, 13, 9, 10, 20, 7, 12, 3, 6, 6, 18, 3, 10, 11, 10, 5,
6, 11, 4, 6, 7, 9, 13, 1, 14, 14, 13, 4, 3, 8, 5, 7, 14,
13, 13, 12, 8, 11, 12, 9, 8, 9, 4, 5, 4, 7, 5, 2, 3, 1,
7, 2, 1, 13, 5, 19, 9, 6, 9, 7])
When I plot the histogram of dependent variable I get this
So I used a log transform to remove some of the skewness.
as y=np.log(df["reservartions"].values)
Now the plot of distribution looks below:
Some of features.
type actual_price recommended_price num_videos image_ava text_length
1 67.85 59 5 0 7
0 100.70 53 5 0 224
0 74.00 74 4 1 21
0 135.00 75 1 0 184
0 59.36 53 2 1 31
Since actual_price and Recommended_price have huge correlation, I created a difference price of these two and dropped actual_price and recommended price.
But after running Linear regression or Random Forest Regression I get very poor results with R2 as 0.12 for both.
This shows the model is clearly not predicting and fitting well.
My dependent variable is clearly a positive variable. Is Linear Regression still right? Should I use Poisson regression? Log transformation makes sense?
EDIT:
Tried Poisson from Statsmodels. Gives worse results
import statsmodels.api as sm
poisson_mod = sm.Poisson(train_Y, train_X)
poisson_res = poisson_mod.fit(method="newton")
print(poisson_res.summary())
Optimization terminated successfully.
Current function value: 2.958960
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 637
Model: Poisson Df Residuals: 628
Method: MLE Df Model: 8
Date: Mon, 08 Oct 2018 Pseudo R-squ.: 0.09479
Time: 13:57:37 Log-Likelihood: -1884.9
converged: True LL-Null: -2082.2
LLR p-value: 2.506e-80
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
technology 0.0080 0.040 0.200 0.842 -0.070 0.086
street_parked 0.0014 0.030 0.046 0.963 -0.058 0.061
description 0.0002 0.000 0.884 0.377 -0.000 0.001
num_images_2 -0.0230 0.054 -0.430 0.667 -0.128 0.082
num_images_3 0.0619 0.053 1.160 0.246 -0.043 0.167
num_images_4 0.2234 0.050 4.501 0.000 0.126 0.321
num_images_5 0.2391 0.053 4.521 0.000 0.135 0.343
price_diff -0.0146 0.001 -16.300 0.000 -0.016 -0.013
Bias 2.1325 0.052 41.016 0.000 2.031 2.234
=================================================================================
machine-learning regression statistics
$endgroup$
$begingroup$
I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
$endgroup$
– parvij
Oct 9 '18 at 5:30
$begingroup$
These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
$endgroup$
– Baktaawar
Oct 10 '18 at 17:22
$begingroup$
data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
$endgroup$
– parvij
Oct 11 '18 at 5:25
$begingroup$
That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
$endgroup$
– Baktaawar
Oct 11 '18 at 23:00
add a comment |
$begingroup$
I am trying to predict reservation count from a dataset with few features. Features are both categorical and continuous.
The dependent variable reservations looks like below: My dataset size is around 917 obs.
array([ 1, 7, 17, 2, 2, 13, 8, 11, 9, 4, 4, 3, 5, 2, 5, 7, 3,
12, 9, 13, 5, 2, 11, 13, 14, 19, 9, 11, 3, 6, 7, 10, 1, 6,
5, 10, 8, 5, 4, 3, 2, 10, 10, 10, 8, 13, 16, 6, 4, 6, 3,
11, 10, 1, 18, 7, 2, 12, 17, 4, 2, 19, 3, 4, 17, 13, 10, 2,
10, 1, 3, 4, 20, 3, 2, 1, 3, 5, 8, 8, 4, 3, 13, 3, 3,
5, 4, 17, 7, 6, 10, 5, 3, 9, 9, 8, 1, 5, 17, 5, 10, 9,
2, 7, 13, 2, 9, 1, 15, 13, 10, 4, 2, 4, 5, 4, 3, 3, 10,
4, 7, 5, 13, 12, 7, 5, 6, 9, 5, 11, 7, 1, 4, 12, 4, 3,
11, 1, 4, 4, 3, 7, 4, 11, 4, 1, 9, 2, 10, 10, 3, 4, 4,
3, 2, 7, 10, 7, 6, 1, 3, 19, 9, 3, 8, 20, 1, 12, 9, 13,
13, 2, 9, 4, 9, 2, 5, 6, 18, 3, 6, 8, 6, 4, 5, 13, 4,
8, 9, 5, 4, 8, 5, 2, 1, 6, 8, 3, 6, 4, 2, 6, 11, 5,
1, 5, 1, 5, 11, 11, 9, 3, 12, 2, 2, 9, 19, 7, 13, 13, 9,
2, 1, 1, 4, 3, 4, 9, 1, 25, 12, 8, 5, 18, 3, 1, 6, 17,
7, 4, 6, 9, 8, 10, 3, 8, 12, 5, 4, 4, 1, 9, 21, 4, 3,
3, 7, 13, 5, 12, 8, 8, 6, 3, 6, 7, 5, 3, 7, 3, 14, 3,
5, 2, 14, 16, 3, 8, 6, 13, 9, 3, 5, 4, 9, 4, 12, 12, 4,
9, 8, 11, 5, 13, 3, 2, 5, 4, 2, 1, 8, 8, 18, 11, 2, 5,
13, 4, 1, 2, 4, 1, 2, 2, 12, 2, 6, 19, 7, 20, 2, 10, 2,
9, 12, 9, 8, 1, 4, 8, 8, 12, 4, 8, 1, 3, 6, 9, 4, 3,
8, 2, 7, 15, 6, 5, 10, 6, 4, 3, 12, 5, 4, 13, 7, 2, 8,
5, 2, 4, 3, 14, 12, 3, 4, 3, 2, 15, 6, 14, 12, 11, 9, 5,
5, 7, 11, 10, 7, 9, 9, 7, 11, 5, 11, 3, 2, 5, 17, 5, 2,
6, 1, 10, 3, 13, 19, 5, 1, 3, 5, 3, 5, 6, 3, 9, 8, 2,
3, 2, 3, 7, 4, 9, 5, 1, 6, 14, 4, 8, 17, 13, 7, 1, 4,
5, 10, 5, 6, 2, 12, 5, 9, 3, 9, 9, 1, 5, 1, 2, 2, 5,
1, 4, 4, 13, 4, 25, 9, 10, 4, 3, 9, 13, 13, 2, 9, 2, 12,
4, 1, 20, 9, 10, 2, 5, 4, 10, 2, 6, 1, 7, 7, 7, 4, 8,
4, 3, 4, 13, 8, 3, 13, 12, 19, 9, 3, 2, 6, 7, 13, 8, 16,
7, 3, 11, 4, 10, 9, 12, 2, 8, 5, 2, 3, 4, 2, 1, 11, 5,
4, 2, 8, 12, 7, 5, 7, 7, 4, 6, 18, 2, 1, 6, 15, 11, 2,
5, 8, 3, 5, 9, 11, 5, 8, 6, 20, 1, 10, 3, 7, 1, 3, 5,
4, 4, 10, 11, 6, 1, 5, 4, 1, 2, 10, 4, 4, 11, 20, 5, 3,
2, 7, 8, 2, 10, 5, 1, 18, 5, 10, 5, 3, 8, 15, 2, 1, 14,
10, 7, 3, 5, 9, 3, 4, 21, 14, 1, 2, 1, 2, 4, 11, 9, 7,
6, 9, 18, 4, 6, 18, 12, 12, 4, 6, 3, 3, 9, 5, 12, 15, 3,
7, 3, 7, 4, 2, 15, 14, 7, 10, 5, 5, 5, 9, 3, 6, 3, 1,
11, 1, 5, 25, 8, 2, 24, 1, 12, 1, 6, 8, 5, 13, 4, 3, 3,
13, 4, 4, 18, 7, 13, 2, 8, 3, 4, 9, 2, 13, 12, 4, 5, 10,
9, 15, 1, 8, 8, 15, 10, 1, 9, 2, 2, 2, 2, 3, 6, 17, 7,
5, 5, 6, 12, 1, 8, 3, 1, 11, 4, 7, 8, 15, 6, 11, 9, 9,
13, 2, 3, 5, 3, 5, 12, 4, 4, 8, 7, 12, 2, 2, 4, 4, 12,
8, 11, 10, 6, 5, 1, 4, 2, 7, 3, 5, 15, 12, 12, 2, 9, 7,
4, 4, 5, 15, 5, 8, 13, 7, 2, 8, 12, 2, 13, 6, 24, 14, 3,
4, 1, 2, 8, 7, 5, 12, 8, 2, 6, 3, 7, 5, 2, 7, 3, 3,
1, 9, 9, 3, 12, 3, 2, 11, 11, 6, 3, 9, 12, 4, 8, 7, 5,
2, 10, 19, 1, 1, 10, 6, 2, 4, 2, 4, 4, 3, 7, 13, 9, 6,
2, 2, 2, 5, 13, 12, 2, 13, 12, 11, 10, 5, 8, 8, 15, 12, 3,
3, 9, 4, 6, 13, 15, 4, 7, 1, 12, 10, 9, 7, 3, 7, 4, 9,
2, 10, 2, 11, 10, 14, 3, 13, 8, 3, 12, 11, 10, 7, 5, 3, 3,
11, 3, 13, 9, 10, 20, 7, 12, 3, 6, 6, 18, 3, 10, 11, 10, 5,
6, 11, 4, 6, 7, 9, 13, 1, 14, 14, 13, 4, 3, 8, 5, 7, 14,
13, 13, 12, 8, 11, 12, 9, 8, 9, 4, 5, 4, 7, 5, 2, 3, 1,
7, 2, 1, 13, 5, 19, 9, 6, 9, 7])
When I plot the histogram of dependent variable I get this
So I used a log transform to remove some of the skewness.
as y=np.log(df["reservartions"].values)
Now the plot of distribution looks below:
Some of features.
type actual_price recommended_price num_videos image_ava text_length
1 67.85 59 5 0 7
0 100.70 53 5 0 224
0 74.00 74 4 1 21
0 135.00 75 1 0 184
0 59.36 53 2 1 31
Since actual_price and Recommended_price have huge correlation, I created a difference price of these two and dropped actual_price and recommended price.
But after running Linear regression or Random Forest Regression I get very poor results with R2 as 0.12 for both.
This shows the model is clearly not predicting and fitting well.
My dependent variable is clearly a positive variable. Is Linear Regression still right? Should I use Poisson regression? Log transformation makes sense?
EDIT:
Tried Poisson from Statsmodels. Gives worse results
import statsmodels.api as sm
poisson_mod = sm.Poisson(train_Y, train_X)
poisson_res = poisson_mod.fit(method="newton")
print(poisson_res.summary())
Optimization terminated successfully.
Current function value: 2.958960
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 637
Model: Poisson Df Residuals: 628
Method: MLE Df Model: 8
Date: Mon, 08 Oct 2018 Pseudo R-squ.: 0.09479
Time: 13:57:37 Log-Likelihood: -1884.9
converged: True LL-Null: -2082.2
LLR p-value: 2.506e-80
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
technology 0.0080 0.040 0.200 0.842 -0.070 0.086
street_parked 0.0014 0.030 0.046 0.963 -0.058 0.061
description 0.0002 0.000 0.884 0.377 -0.000 0.001
num_images_2 -0.0230 0.054 -0.430 0.667 -0.128 0.082
num_images_3 0.0619 0.053 1.160 0.246 -0.043 0.167
num_images_4 0.2234 0.050 4.501 0.000 0.126 0.321
num_images_5 0.2391 0.053 4.521 0.000 0.135 0.343
price_diff -0.0146 0.001 -16.300 0.000 -0.016 -0.013
Bias 2.1325 0.052 41.016 0.000 2.031 2.234
=================================================================================
machine-learning regression statistics
$endgroup$
I am trying to predict reservation count from a dataset with few features. Features are both categorical and continuous.
The dependent variable reservations looks like below: My dataset size is around 917 obs.
array([ 1, 7, 17, 2, 2, 13, 8, 11, 9, 4, 4, 3, 5, 2, 5, 7, 3,
12, 9, 13, 5, 2, 11, 13, 14, 19, 9, 11, 3, 6, 7, 10, 1, 6,
5, 10, 8, 5, 4, 3, 2, 10, 10, 10, 8, 13, 16, 6, 4, 6, 3,
11, 10, 1, 18, 7, 2, 12, 17, 4, 2, 19, 3, 4, 17, 13, 10, 2,
10, 1, 3, 4, 20, 3, 2, 1, 3, 5, 8, 8, 4, 3, 13, 3, 3,
5, 4, 17, 7, 6, 10, 5, 3, 9, 9, 8, 1, 5, 17, 5, 10, 9,
2, 7, 13, 2, 9, 1, 15, 13, 10, 4, 2, 4, 5, 4, 3, 3, 10,
4, 7, 5, 13, 12, 7, 5, 6, 9, 5, 11, 7, 1, 4, 12, 4, 3,
11, 1, 4, 4, 3, 7, 4, 11, 4, 1, 9, 2, 10, 10, 3, 4, 4,
3, 2, 7, 10, 7, 6, 1, 3, 19, 9, 3, 8, 20, 1, 12, 9, 13,
13, 2, 9, 4, 9, 2, 5, 6, 18, 3, 6, 8, 6, 4, 5, 13, 4,
8, 9, 5, 4, 8, 5, 2, 1, 6, 8, 3, 6, 4, 2, 6, 11, 5,
1, 5, 1, 5, 11, 11, 9, 3, 12, 2, 2, 9, 19, 7, 13, 13, 9,
2, 1, 1, 4, 3, 4, 9, 1, 25, 12, 8, 5, 18, 3, 1, 6, 17,
7, 4, 6, 9, 8, 10, 3, 8, 12, 5, 4, 4, 1, 9, 21, 4, 3,
3, 7, 13, 5, 12, 8, 8, 6, 3, 6, 7, 5, 3, 7, 3, 14, 3,
5, 2, 14, 16, 3, 8, 6, 13, 9, 3, 5, 4, 9, 4, 12, 12, 4,
9, 8, 11, 5, 13, 3, 2, 5, 4, 2, 1, 8, 8, 18, 11, 2, 5,
13, 4, 1, 2, 4, 1, 2, 2, 12, 2, 6, 19, 7, 20, 2, 10, 2,
9, 12, 9, 8, 1, 4, 8, 8, 12, 4, 8, 1, 3, 6, 9, 4, 3,
8, 2, 7, 15, 6, 5, 10, 6, 4, 3, 12, 5, 4, 13, 7, 2, 8,
5, 2, 4, 3, 14, 12, 3, 4, 3, 2, 15, 6, 14, 12, 11, 9, 5,
5, 7, 11, 10, 7, 9, 9, 7, 11, 5, 11, 3, 2, 5, 17, 5, 2,
6, 1, 10, 3, 13, 19, 5, 1, 3, 5, 3, 5, 6, 3, 9, 8, 2,
3, 2, 3, 7, 4, 9, 5, 1, 6, 14, 4, 8, 17, 13, 7, 1, 4,
5, 10, 5, 6, 2, 12, 5, 9, 3, 9, 9, 1, 5, 1, 2, 2, 5,
1, 4, 4, 13, 4, 25, 9, 10, 4, 3, 9, 13, 13, 2, 9, 2, 12,
4, 1, 20, 9, 10, 2, 5, 4, 10, 2, 6, 1, 7, 7, 7, 4, 8,
4, 3, 4, 13, 8, 3, 13, 12, 19, 9, 3, 2, 6, 7, 13, 8, 16,
7, 3, 11, 4, 10, 9, 12, 2, 8, 5, 2, 3, 4, 2, 1, 11, 5,
4, 2, 8, 12, 7, 5, 7, 7, 4, 6, 18, 2, 1, 6, 15, 11, 2,
5, 8, 3, 5, 9, 11, 5, 8, 6, 20, 1, 10, 3, 7, 1, 3, 5,
4, 4, 10, 11, 6, 1, 5, 4, 1, 2, 10, 4, 4, 11, 20, 5, 3,
2, 7, 8, 2, 10, 5, 1, 18, 5, 10, 5, 3, 8, 15, 2, 1, 14,
10, 7, 3, 5, 9, 3, 4, 21, 14, 1, 2, 1, 2, 4, 11, 9, 7,
6, 9, 18, 4, 6, 18, 12, 12, 4, 6, 3, 3, 9, 5, 12, 15, 3,
7, 3, 7, 4, 2, 15, 14, 7, 10, 5, 5, 5, 9, 3, 6, 3, 1,
11, 1, 5, 25, 8, 2, 24, 1, 12, 1, 6, 8, 5, 13, 4, 3, 3,
13, 4, 4, 18, 7, 13, 2, 8, 3, 4, 9, 2, 13, 12, 4, 5, 10,
9, 15, 1, 8, 8, 15, 10, 1, 9, 2, 2, 2, 2, 3, 6, 17, 7,
5, 5, 6, 12, 1, 8, 3, 1, 11, 4, 7, 8, 15, 6, 11, 9, 9,
13, 2, 3, 5, 3, 5, 12, 4, 4, 8, 7, 12, 2, 2, 4, 4, 12,
8, 11, 10, 6, 5, 1, 4, 2, 7, 3, 5, 15, 12, 12, 2, 9, 7,
4, 4, 5, 15, 5, 8, 13, 7, 2, 8, 12, 2, 13, 6, 24, 14, 3,
4, 1, 2, 8, 7, 5, 12, 8, 2, 6, 3, 7, 5, 2, 7, 3, 3,
1, 9, 9, 3, 12, 3, 2, 11, 11, 6, 3, 9, 12, 4, 8, 7, 5,
2, 10, 19, 1, 1, 10, 6, 2, 4, 2, 4, 4, 3, 7, 13, 9, 6,
2, 2, 2, 5, 13, 12, 2, 13, 12, 11, 10, 5, 8, 8, 15, 12, 3,
3, 9, 4, 6, 13, 15, 4, 7, 1, 12, 10, 9, 7, 3, 7, 4, 9,
2, 10, 2, 11, 10, 14, 3, 13, 8, 3, 12, 11, 10, 7, 5, 3, 3,
11, 3, 13, 9, 10, 20, 7, 12, 3, 6, 6, 18, 3, 10, 11, 10, 5,
6, 11, 4, 6, 7, 9, 13, 1, 14, 14, 13, 4, 3, 8, 5, 7, 14,
13, 13, 12, 8, 11, 12, 9, 8, 9, 4, 5, 4, 7, 5, 2, 3, 1,
7, 2, 1, 13, 5, 19, 9, 6, 9, 7])
When I plot the histogram of dependent variable I get this
So I used a log transform to remove some of the skewness.
as y=np.log(df["reservartions"].values)
Now the plot of distribution looks below:
Some of features.
type actual_price recommended_price num_videos image_ava text_length
1 67.85 59 5 0 7
0 100.70 53 5 0 224
0 74.00 74 4 1 21
0 135.00 75 1 0 184
0 59.36 53 2 1 31
Since actual_price and Recommended_price have huge correlation, I created a difference price of these two and dropped actual_price and recommended price.
But after running Linear regression or Random Forest Regression I get very poor results with R2 as 0.12 for both.
This shows the model is clearly not predicting and fitting well.
My dependent variable is clearly a positive variable. Is Linear Regression still right? Should I use Poisson regression? Log transformation makes sense?
EDIT:
Tried Poisson from Statsmodels. Gives worse results
import statsmodels.api as sm
poisson_mod = sm.Poisson(train_Y, train_X)
poisson_res = poisson_mod.fit(method="newton")
print(poisson_res.summary())
Optimization terminated successfully.
Current function value: 2.958960
Iterations 5
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 637
Model: Poisson Df Residuals: 628
Method: MLE Df Model: 8
Date: Mon, 08 Oct 2018 Pseudo R-squ.: 0.09479
Time: 13:57:37 Log-Likelihood: -1884.9
converged: True LL-Null: -2082.2
LLR p-value: 2.506e-80
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
technology 0.0080 0.040 0.200 0.842 -0.070 0.086
street_parked 0.0014 0.030 0.046 0.963 -0.058 0.061
description 0.0002 0.000 0.884 0.377 -0.000 0.001
num_images_2 -0.0230 0.054 -0.430 0.667 -0.128 0.082
num_images_3 0.0619 0.053 1.160 0.246 -0.043 0.167
num_images_4 0.2234 0.050 4.501 0.000 0.126 0.321
num_images_5 0.2391 0.053 4.521 0.000 0.135 0.343
price_diff -0.0146 0.001 -16.300 0.000 -0.016 -0.013
Bias 2.1325 0.052 41.016 0.000 2.031 2.234
=================================================================================
machine-learning regression statistics
machine-learning regression statistics
edited Oct 8 '18 at 21:21
Baktaawar
asked Oct 8 '18 at 3:44
BaktaawarBaktaawar
1164
1164
$begingroup$
I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
$endgroup$
– parvij
Oct 9 '18 at 5:30
$begingroup$
These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
$endgroup$
– Baktaawar
Oct 10 '18 at 17:22
$begingroup$
data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
$endgroup$
– parvij
Oct 11 '18 at 5:25
$begingroup$
That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
$endgroup$
– Baktaawar
Oct 11 '18 at 23:00
add a comment |
$begingroup$
I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
$endgroup$
– parvij
Oct 9 '18 at 5:30
$begingroup$
These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
$endgroup$
– Baktaawar
Oct 10 '18 at 17:22
$begingroup$
data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
$endgroup$
– parvij
Oct 11 '18 at 5:25
$begingroup$
That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
$endgroup$
– Baktaawar
Oct 11 '18 at 23:00
$begingroup$
I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
$endgroup$
– parvij
Oct 9 '18 at 5:30
$begingroup$
I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
$endgroup$
– parvij
Oct 9 '18 at 5:30
$begingroup$
These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
$endgroup$
– Baktaawar
Oct 10 '18 at 17:22
$begingroup$
These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
$endgroup$
– Baktaawar
Oct 10 '18 at 17:22
$begingroup$
data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
$endgroup$
– parvij
Oct 11 '18 at 5:25
$begingroup$
data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
$endgroup$
– parvij
Oct 11 '18 at 5:25
$begingroup$
That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
$endgroup$
– Baktaawar
Oct 11 '18 at 23:00
$begingroup$
That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
$endgroup$
– Baktaawar
Oct 11 '18 at 23:00
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
I think because your target is count so their distribution is Poisson and you should use Poisson regression.
I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.
$endgroup$
$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f39331%2fright-regression-model-to-use%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I think because your target is count so their distribution is Poisson and you should use Poisson regression.
I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.
$endgroup$
$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20
add a comment |
$begingroup$
I think because your target is count so their distribution is Poisson and you should use Poisson regression.
I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.
$endgroup$
$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20
add a comment |
$begingroup$
I think because your target is count so their distribution is Poisson and you should use Poisson regression.
I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.
$endgroup$
I think because your target is count so their distribution is Poisson and you should use Poisson regression.
I strongly recommend reading this, however, it's in R but will be helpful for you and also there is some similar python version in comments.
edited Oct 9 '18 at 6:13
answered Oct 8 '18 at 7:29
parvijparvij
485214
485214
$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20
add a comment |
$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20
$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20
$begingroup$
Pls check the edit of Poisson I tried
$endgroup$
– Baktaawar
Oct 8 '18 at 21:20
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f39331%2fright-regression-model-to-use%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
I recommend you these tasks, after that, you can put the result again: 1. remove outlier if exist 2. normalize variables 3. put price and percent of off furthermore, big bias means your model could not predict target. you should think about feature engineering
$endgroup$
– parvij
Oct 9 '18 at 5:30
$begingroup$
These are feature engg. What do u mean by normalizations? I have already standardized the continuous features
$endgroup$
– Baktaawar
Oct 10 '18 at 17:22
$begingroup$
data for linear regression should be Gaussian. Box-Cox transformation changes your data. try it.
$endgroup$
– parvij
Oct 11 '18 at 5:25
$begingroup$
That is not true. You can have features of any distribution. It doesn't need to be gaussian. Its the error which is gaussian
$endgroup$
– Baktaawar
Oct 11 '18 at 23:00