Katakanlah saya memiliki beberapa data, dan kemudian saya mencocokkan data dengan model (regresi non-linear). Lalu saya menghitung R-squared ( ).
Ketika R-squared negatif, apa artinya itu? Apakah itu berarti model saya buruk? Saya tahu kisaran bisa [-1,1]. Ketika adalah 0, apa artinya itu juga?
regression
goodness-of-fit
r-squared
curve-fitting
RockTheStar
sumber
sumber
Jawaban:
bisa negatif, itu hanya berarti bahwa:R2
Untuk orang-orang yang mengatakan bahwa adalah antara 0 dan 1, ini bukan masalahnya. Sementara nilai negatif untuk sesuatu dengan kata 'kuadrat' di dalamnya mungkin terdengar seperti melanggar aturan matematika, itu bisa terjadi dalam model R 2 tanpa intersep. Untuk memahami alasannya, kita perlu melihat bagaimana R 2 dihitung.R2 R2 R2
Ini agak panjang - Jika Anda ingin jawabannya tanpa memahaminya, maka lewati sampai akhir. Kalau tidak, saya sudah mencoba menulis ini dengan kata-kata sederhana.
Pertama, mari kita mendefinisikan 3 variabel: , T S S dan E S S .RSS TSS ESS
Menghitung RSS :
Untuk setiap variabel independen , kami memiliki variabel dependen y . Kami memplot garis linier yang paling sesuai, yang memprediksi nilai y untuk setiap nilai x . Mari kita sebut nilai-nilai y garis memprediksi y . Kesalahan antara apa yang diprediksi garis Anda dan apa nilai y aktual dapat dikurangkan menjadi pengurangan. Semua perbedaan ini kuadrat dan ditambahkan, yang memberikan Residual Sum of Squares R S S .x y y x y y^ y RSS
Puting yang menjadi persamaan,RSS=∑(y−y^)2
Menghitung TSS :
Kita dapat menghitung nilai rata-rata , yang disebut ˉ y . Jika kita memplot ˉ y , itu hanya garis horizontal melalui data karena konstan. Apa yang bisa kita lakukan dengannya adalah mengurangi ˉ y (nilai rata-rata y ) dari setiap nilai aktual y . Hasilnya adalah kuadrat dan ditambahkan bersama-sama, yang memberikan total jumlah kotak T S S .y y¯ y¯ y¯ y y TSS
Puting yang menjadi persamaanTSS=∑(y−y¯)2
Menghitung ESS :
Perbedaan antara y (nilai y yang diprediksi oleh garis) dan nilai rata-rata ˉ y dikuadratkan dan ditambahkan. Ini adalah jumlah Dijelaskan kuadrat, yang sama dengan Σ ( y - ˉ y ) 2y^ y y¯ ∑(y^−y¯)2
Ingat, , tapi kita bisa menambahkan + y - y ke dalamnya, karena membatalkan sendirinya. Oleh karena itu, T S S = Σ ( y - y + y - ˉ y ) 2 . Memperluas kurung ini, kita mendapatkan T S S = Σ ( y - y ) 2 +TSS=∑(y−y¯)2 +y^−y^ TSS=∑(y−y^+y^−y¯)2 TSS=∑(y−y^)2+2∗∑(y−y^)(y^−y¯)+∑(y^−y¯)2
Jika, dan hanya ketika garis diplot dengan intercept, berikut adalah selalu benar: . Oleh karena itu, T S S = Σ ( y - y ) 2 + Σ ( y - ˉ y ) 2 , yang dapat Anda perhatikan hanya berarti bahwa T S S = R S S +2∗∑(y−y^)(y^−y¯)=0 TSS=∑(y−y^)2+∑(y^−y¯)2 TSS=RSS+ESS . If we divide all terms by TSS and rearrange, we get 1−RSSTSS=ESSTSS .
Here's the important part:
BUT
When we don't specify an intercept,2∗∑(y−y^)(y^−y¯) does not necessarily equal 0 . This means that TSS=RSS+ESS+2∗∑(y−y^)(y^−y¯) .
Dividing all terms byTSS , we get 1−RSSTSS=ESS+2∗∑(y−y^)(y^−y¯)TSS .
Finally, we substitute to getR2=ESS+2∗∑(y−y^)(y^−y¯)TSS . This time, the numerator has a term in it which is not a sum of squares, so it can be negative. This would make R2 negative. When would this happen? 2∗∑(y−y^)(y^−y¯) would be negative when y−y^ is negative and y^−y¯ is positive, or vice versa. This occurs when the horizontal line of y¯ actually explains the data better than the line of best fit.
Here's an exaggerated example of whenR2 is negative (Source: University of Houston Clear Lake)
Put simply:
You also asked aboutR2=0 .
I commend you for making it through that. If you found this helpful, you should also upvote fcop's answer here which I had to refer to, because it's been a while.
sumber
Neither answer so far is entirely correct, so I will try to give my understanding of R-Squared. I have given a more detailed explanation of this on my blog post here "What is R-Squared"
Sum Squared Error
The objective of ordinary least squared regression is to get a line which minimized the sum squared error. The default line with minimum sum squared error is a horizontal line through the mean. Basically, if you can't do better, you can just predict the mean value and that will give you the minimum sum squared error
R-Squared is a way of measuring how much better than the mean line you have done based on summed squared error. The equation for R-Squared is
Now SS Regression and SS Total are both sums of squared terms. Both of those are always positive. This means we are taking 1, and subtracting a positive value. So the maximum R-Squared value is positive 1, but the minimum is negative infinity. Yes, that is correct, the range of R-squared is between -infinity and 1, not -1 and 1 and not 0 and 1
What Is Sum Squared Error
Sum squared error is taking the error at every point, squaring it, and adding all the squares. For total error, it uses the horizontal line through the mean, because that gives the lowest sum squared error if you don't have any other information, i.e. can't do a regression.
As an equation it is this
Now with regression, our objective is to do better than the mean. For instance this regression line will give a lower sum squared error than using the horizontal line.
The equation for regression sum squared error is this
Ideally, you would have zero regression error, i.e. your regression line would perfectly match the data. In that case you would get an R-Squared value of 1
Negative R Squared
All the information above is pretty standard. Now what about negative R-Squared ?
Well it turns out that there is not reason that your regression equation must give lower sum squared error than the mean value. It is generally thought that if you can't make a better prediction than the mean value, you would just use the mean value, but there is nothing forcing that to be the cause. You could for instance predict the median for everything.
In actual practice, with ordinary least squared regression, the most common time to get a negative R-Squared value is when you force a point that the regression line must go through. This is typically done by setting the intercept, but you can force the regression line through any point.
When you do that the regression line goes through that point, and attempts to get the minimum sum squared error while still going through that point.
By default, the regression equations use average x and average y as the point that the regression line goes through. But if you force it through a point that is far away from where the regression line would normally be you can get sum squared error that is higher than using the horizontal line
In the image below, both regression lines were forced to have a y intercept of 0. This caused a negative R-squared for the data that is far offset from the origin.
For the top set of points, the red ones, the regression line is the best possible regression line that also passes through the origin. It just happens that that regression line is worse than using a horizontal line, and hence gives a negative R-Squared.
Undefined R-Squared
There is one special case no one mentioned, where you can get an undefined R-Squared. That is if your data is completely horizontal, then your total sum squared error is zero. As a result you would have a zero divided by zero in the R-squared equation, which is undefined.
sumber
As the previous commenter notes, r^2 is between [0,1], not [-1,+1], so it is impossible to be negative. You cannot square a value and get a negative number. Perhaps you are looking at r, the correlation? It can be between [-1,+1], where zero means there is no relationship between the variables, -1 means there is a perfect negative relationship (as one variable increases, the other decreases), and +1 is a perfect positive relationship (both variables go up or down concordantly).
If indeed you are looking at r^2, then, as the previous commenter describes, you are probably seeing the adjusted r^2, not the actual r^2. Consider what the statistic means: I teach behavioral science statistics, and the easiest way that I've learned to teach my students about the meaning of r^2 is " % variance explained." So if you have r^2=0.5, the model explains 50% of the variation of the dependent (outcome) variable. If you have a negative r^2, it would mean that the model explains a negative % of the outcome variable, which is not an intuitively reasonable suggestion. However, adjusted r^2 takes the sample size (n) and number of predictors (p) into consideration. A formula for calculating it is here. If you have a very low r^2, then it is reasonably easy to get negative values. Granted, a negative adjusted r^2 does not have any more intuitive meaning than regular r^2, but as the previous commenter says, it just means your model is very poor, if not just plain useless.
sumber