Tidak yakin apakah normalisasi adalah kata yang tepat untuk digunakan di sini, tetapi saya akan mencoba yang terbaik untuk menggambarkan apa yang ingin saya tanyakan. Estimator yang digunakan di sini adalah kuadrat terkecil.
Misalkan Anda memiliki , Anda dapat memusatkannya di sekitar rata-rata dengan mana dan , sehingga tidak lagi memiliki pengaruh pada estimasi .
Dengan ini saya berarti β 1 di y = β 1 x ' 1 setara dengan ß 1 di y = β 0 + β 1 x 1 . Kami telah mengurangi persamaan untuk perhitungan kuadrat terkecil yang lebih mudah.
Bagaimana Anda menerapkan metode ini secara umum? Sekarang saya memiliki model , saya mencoba menguranginya menjadi .
Jawaban:
Meskipun saya tidak dapat melakukan keadilan terhadap pertanyaan di sini - yang akan membutuhkan monograf kecil - mungkin bermanfaat untuk merekapitulasi beberapa ide kunci.
Pertanyaan
Mari kita mulai dengan menyatakan kembali pertanyaan dan menggunakan terminologi yang jelas. The Data terdiri dari daftar pasangan memerintahkan . Konstanta yang diketahui α 1 dan α 2 menentukan nilai x 1 , i = exp ( α 1 t i ) dan x 2 , i = exp ( α 2 t i ) . Kami menempatkan model di mana(ti,yi) α1 α2 x1,i=exp(α1ti) x2,i=exp(α2ti)
untuk estimasi konstanta dan β 2 , ε i adalah acak, dan - to aproksimasi yang baik - independen dan memiliki varian yang sama (yang estimasi-nya juga menarik).β1 β2 εi
Latar belakang: linear "matching"
Mosteller and Tukey refer to the variablesx1 = (x1,1,x1,2,…) and x2 as "matchers." They will be used to "match" the values of y=(y1,y2,…) in a specific way, which I will illustrate. More generally, let y and x be any two vectors in the same Euclidean vector space, with y playing the role of "target" and x that of "matcher". We contemplate systematically varying a coefficient λ in order to approximate y by the multiple λx . The best approximation is obtained when λx is as close to y as possible. Equivalently, the squared length of y−λx is minimized.
Salah satu cara untuk memvisualisasikan proses pencocokan ini adalah dengan membuat sebar dan y yang menggambar grafik x → λ x . Jarak vertikal antara titik sebar dan grafik ini adalah komponen dari vektor sisa y - λ x ; jumlah kotak mereka harus dibuat sekecil mungkin. Hingga konstan proporsionalitas, kuadrat ini adalah area lingkaran yang berpusat pada titik ( x i , y i ) dengan jari-jari sama dengan residu: kami ingin meminimalkan jumlah area dari semua lingkaran ini.x y x→λx y−λx (xi,yi)
Here is an example showing the optimal value ofλ in the middle panel:
The points in the scatterplot are blue; the graph ofx→λx is a red line. This illustration emphasizes that the red line is constrained to pass through the origin (0,0) : it is a very special case of line fitting.
Multiple regression can be obtained by sequential matching
Kembali ke pengaturan pertanyaan, kami memiliki satu target dan dua pencocokan x 1 dan x 2 . Kami mencari angka b 1 dan b 2 yang y diperkirakan kira-kira sedekat mungkin dengan b 1 x 1 + b 2 x 2 , lagi-lagi dalam arti jarak paling rendah. Dimulai secara acak dengan x 1 , Mosteller & Tukey cocok dengan variabel yang tersisa x 2 dan y hingga x 1y x1 x2 b1 b2 y b1x1+b2x2 x1 x2 y x1 . Write the residuals for these matches as x2⋅1 and y⋅1 , respectively: the ⋅1 indicates that x1 has been "taken out of" the variable.
We can write
Having takenx1 out of x2 and y , we proceed to match the target residuals y⋅1 to the matcher residuals x2⋅1 . The final residuals are y⋅12 . Algebraically, we have written
This shows that theλ3 in the last step is the coefficient of x2 in a matching of x1 and x2 to y .
We could just as well have proceeded by first takingx2 out of x1 and y , producing x1⋅2 and y⋅2 , and then taking x1⋅2 out of y⋅2 , yielding a different set of residuals y⋅21 . This time, the coefficient of x1 found in the last step--let's call it μ3 --is the coefficient of x1 in a matching of x1 and x2 to y .
Finally, for comparison, we might run a multiple (ordinary least squares regression) ofy against x1 and x2 . Let those residuals be y⋅lm . It turns out that the coefficients in this multiple regression are precisely the coefficients μ3 and λ3 found previously and that all three sets of residuals, y⋅12 , y⋅21 , and y⋅lm , are identical.
Depicting the process
None of this is new: it's all in the text. I would like to offer a pictorial analysis, using a scatterplot matrix of everything we have obtained so far.
Because these data are simulated, we have the luxury of showing the underlying "true" values ofy on the last row and column: these are the values β1x1+β2x2 without the error added in.
The scatterplots below the diagonal have been decorated with the graphs of the matchers, exactly as in the first figure. Graphs with zero slopes are drawn in red: these indicate situations where the matcher gives us nothing new; the residuals are the same as the target. Also, for reference, the origin (wherever it appears within a plot) is shown as an open red circle: recall that all possible matching lines have to pass through this point.
Much can be learned about regression through studying this plot. Some of the highlights are:
The matching ofx2 to x1 (row 2, column 1) is poor. This is a good thing: it indicates that x1 and x2 are providing very different information; using both together will likely be a much better fit to y than using either one alone.
Once a variable has been taken out of a target, it does no good to try to take that variable out again: the best matching line will be zero. See the scatterplots forx2⋅1 versus x1 or y⋅1 versus x1 , for instance.
The valuesx1 , x2 , x1⋅2 , and x2⋅1 have all been taken out of y⋅lm .
Multiple regression ofy against x1 and x2 can be achieved first by computing y⋅1 and x2⋅1 . These scatterplots appear at (row, column) = (8,1) and (2,1) , respectively. With these residuals in hand, we look at their scatterplot at (4,3) . These three one-variable regressions do the trick. As Mosteller & Tukey explain, the standard errors of the coefficients can be obtained almost as easily from these regressions, too--but that's not the topic of this question, so I will stop here.
Code
These data were (reproducibly) created in
R
with a simulation. The analyses, checks, and plots were also produced withR
. This is the code.sumber
y.21
toy.12
in my code.