Saya ingin sepenuhnya memahami gagasan menggambarkan jumlah variasi antar variabel. Setiap penjelasan web sedikit mekanis dan tumpul. Saya ingin "mendapatkan" konsepnya, bukan hanya menggunakan angka secara mekanis.
Misalnya: Jam belajar vs skor tes
= .8
= .64
- So, what does this mean?
- 64% of the variability of test scores can be explained by hours?
- How do we know that just by squaring?
regression
correlation
variance
JackOfAll
sumber
sumber
Jawaban:
Start with the basic idea of variation. Your beginning model is the sum of the squared deviations from the mean. The R^2 value is the proportion of that variation that is accounted for by using an alternative model. For example, R-squared tells you how much of the variation in Y you can get rid of by summing up the squared distances from a regression line, rather than the mean.
I think this is made perfectly clear if we think about the simple regression problem plotted out. Consider a typical scatterplot where you have a predictor X along the horizontal axis and a response Y along the vertical axis.
The mean is a horizontal line on the plot where Y is constant. The total variation in Y is the sum of squared differences between the mean of Y and each individual data point. It's the distance between the mean line and every individual point squared and added up.
You can also calculate another measure of variability after you have the regression line from the model. This is the difference between each Y point and the regression line. Rather than each (Y - the mean) squared we get (Y - the point on the regression line) squared.
If the regression line is anything but horizontal, we're going to get less total distance when we use this fitted regression line rather than the mean--that is there is less unexplained variation. The ratio between the extra variation explained and the original variation is your R^2. It's the proportion of the original variation in your response that is explained by fitting that regression line.
Here is some R code for a graph with the mean, the regression line, and segments from the regression line to each point to help visualize:
sumber
A mathematical demonstration of the relationship between the two is here: Pearson's correlation and least squares regression analysis.
I am not sure if there is a geometric or any other intuition that can be offered apart from the math but if I can think of one I will update this answer.Update: Geometric Intuition
Here is a geometric intuition I came up with. Suppose that you have two variablesx and y which are mean centered. (Assuming mean centered lets us ignore the intercept which simplifies the geometrical intuition a bit.) Let us first consider the geometry of linear regression. In linear regression, we model y as follows:
Consider the situation when we have two observations from the above data generating process given by the pairs (y1,y2 ) and (x1,x2 ). We can view them as vectors in two-dimensional space as shown in the figure below:
alt text http://a.imageshack.us/img202/669/linearregression1.png
Thus, in terms of the above geometry, our goal is to find aβ such that the vector x β is the closest possible to the vector y . Note that different choices of β scale x appropriately. Let β^ be the value of β that is our best possible approximation of y and denote y^=x β^ . Thus,
From a geometrical perspective we have three vectors.y , y^ and ϵ^ . A little thought suggests that we must choose β^ such that three vectors look like the one below:
alt text http://a.imageshack.us/img19/9524/intuitionlinearregressi.png
In other words, we need to chooseβ such that the angle between x β and ϵ^ is 900.
So, how much variation iny have we explained with this projection of y onto the vector x . Since the data is mean centered the variance in y is equals (y21+y22 ) which is the square of the distance between the point represented by the point y and the origin. The variation in y^ is similarly the distance from the point y^ and the origin and so on.
By the Pythagorean theorem, we have:
Therefore, the proportion of the variance explained byx is y^2y2 . Notice also that cos(θ)=y^y . and the wiki tells us that the geometrical interpretation of correlation is that correlation equals the cosine of the angle between the mean-centered vectors.
Therefore, we have the required relationship:
(Correlation)2 = Proportion of variation iny explained by x .
Hope that helps.
sumber
The Regression By Eye applet could be of use if you're trying to develop some intuition.
It lets you generate data then guess a value for R, which you can then compare with the actual value.
sumber