Estimator regresi yang dihukum seperti LASSO dan ridge dikatakan sesuai dengan estimator Bayesian dengan prior tertentu. Saya kira (karena saya tidak tahu cukup tentang statistik Bayesian) bahwa untuk parameter penyetelan tetap, ada beton yang sesuai sebelumnya.
Sekarang frequentist akan mengoptimalkan parameter tuning dengan validasi silang. Apakah ada padanan Bayesian dalam melakukannya, dan apakah itu digunakan sama sekali? Atau apakah pendekatan Bayesian secara efektif memperbaiki parameter tuning sebelum melihat data? (Saya kira yang terakhir akan merugikan kinerja prediktif.)
bayesian
lasso
ridge-regression
Richard Hardy
sumber
sumber
Jawaban:
Ya itu benar. Setiap kali kita memiliki masalah optimisasi yang melibatkan maksimalisasi fungsi log-likelihood ditambah fungsi penalti pada parameter, ini secara matematis setara dengan maksimalisasi posterior di mana fungsi penalti dianggap logaritma kernel sebelumnya. † Untuk melihat ini, misalkan kita memiliki fungsi penalti w menggunakan parameter tuning λ . Fungsi objektif dalam kasus-kasus ini dapat ditulis sebagai:† w λ
di mana kita menggunakanπ(θ|λ)∝exp(−w(θ|λ)) . Perhatikan di sini bahwa parameter penyetelan dalam optimisasi diperlakukan sebagai hyperparameter tetap dalam distribusi sebelumnya. Jika Anda melakukan optimasi klasik dengan parameter tuning tetap, ini sama dengan melakukan optimasi Bayesian dengan parameter-tetap. Untuk regresi LASSO dan Ridge fungsi penalti dan setara-padanan sebelumnya adalah:
Metode sebelumnya menghukum koefisien regresi sesuai dengan besarnya absolut mereka, yang setara dengan memaksakan suatu Laplace sebelumnya yang terletak di nol. Metode terakhir menghukum koefisien regresi sesuai dengan besarnya kuadratnya, yang setara dengan memaksakan sebelum normal yang terletak di nol.
Selama metode frequentist dapat dianggap sebagai masalah optimisasi (daripada mengatakan, termasuk tes hipotesis, atau sesuatu seperti ini) akan ada analogi Bayesian menggunakan padanan sebelumnya. Sama seperti frequentist mungkin memperlakukan parameter tuningλ sebagai tidak diketahui dan memperkirakan ini dari data, Bayesian juga dapat memperlakukan hyperparameter λ sebagai tidak diketahui. Dalam analisis Bayesian lengkap ini akan melibatkan pemberian hyperparameter sendiri sebelum dan menemukan maksimum posterior di bawah ini sebelumnya, yang akan analog dengan memaksimalkan fungsi tujuan berikut:
This method is indeed used in Bayesian analysis in cases where the analyst is not comfortable choosing a specific hyperparameter for their prior, and seeks to make the prior more diffuse by treating it as unknown and giving it a distribution. (Note that this is just an implicit way of giving a more diffuse prior to the parameter of interestθ .)
Before proceeding to look atK -fold cross-validation, it is first worth noting that, mathematically, the maximum a posteriori (MAP) method is simply an optimisation of a function of the parameter θ and the data x . If you are willing to allow improper priors then the scope encapsulates any optimisation problem involving a function of these variables. Thus, any frequentist method that can be framed as a single optimisation problem of this kind has a MAP analogy, and any frequentist method that cannot be framed as a single optimisation of this kind does not have a MAP analogy.
In the above form of model, involving a penalty function with a tuning parameter,K -fold cross-validation is commonly used to estimate the tuning parameter λ . For this method you partition the data vector x into K sub-vectors x1,...,xK . For each of sub-vector k=1,...,K you fit the model with the "training" data x−k and then measure the fit of the model with the "testing" data xk . Dalam setiap kecocokan Anda mendapatkan penduga untuk parameter model, yang kemudian memberi Anda prediksi data pengujian, yang kemudian dapat dibandingkan dengan data pengujian aktual untuk memberikan ukuran "kehilangan":
The loss measures for each of theK "folds" can then be aggregated to get an overall loss measure for the cross-validation:
One then estimates the tuning parameter by minimising the overall loss measure:
We can see that this is an optimisation problem, and so we now have two seperate optimisation problems (i.e., the one described in the sections above forθ , and the one described here for λ ). Since the latter optimisation does not involve θ , we can combine these optimisations into a single problem, with some technicalities that I discuss below. To do this, consider the optimisation problem with objective function:
whereδ>0 is a weighting value on the tuning-loss. As δ→∞ the weight on optimisation of the tuning-loss becomes infinite and so the optimisation problem yields the estimated tuning parameter from K -fold cross-validation (in the limit). The remaining part of the objective function is the standard objective function conditional on this estimated value of the tuning parameter. Now, unfortunately, taking δ=∞ screws up the optimisation problem, but if we take δ to be a very large (but still finite) value, we can approximate the combination of the two optimisation problems up to arbitrary accuracy.
From the above analysis we can see that it is possible to form a MAP analogy to the model-fitting andK -fold cross-validation process. This is not an exact analogy, but it is a close analogy, up to arbitrarily accuracy. It is also important to note that the MAP analogy no longer shares the same likelihood function as the original problem, since the loss function depends on the data and is thus absorbed as part of the likelihood rather than the prior. In fact, the full analogy is as follows:
whereL∗x(θ,λ)∝exp(ℓx(θ)−δL(x,λ)) and π(θ,λ)∝exp(−w(θ|λ)) , with a fixed (and very large) hyper-parameter δ .
sumber
Indeed most penalized regression methods correspond to placing a particular type of prior to the regression coefficients. For example, you get the LASSO using a Laplace prior, and the ridge using a normal prior. The tuning parameters are the “hyperparameters” under the Bayesian formulation for which you can place an additional prior to estimate them; for example, for in the case of the ridge it is often assumed that the inverse variance of the normal distribution has aχ2 prior. However, as one would expect, resulting inferences can be sensitive to the choice of the prior distributions for these hyperparameters. For example, for the horseshoe prior there are some theoretical results that you should place such a prior for the hyperparameters that it would reflect the number of non-zero coefficients you expect to have.
A nice overview of the links between penalized regression and Bayesian priors is given, for example, by Mallick and Yi.
sumber