Saya ingin tahu apakah hutan acak Breiman (hutan acak dalam paket R randomForest) digunakan sebagai kriteria pemisahan (kriteria untuk pemilihan atribut) informasi atau indeks Gini? Saya mencoba untuk mengetahuinya di http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm dan dalam dokumentasi untuk paket randomForest di R. Tetapi satu-satunya hal yang saya temukan adalah bahwa indeks Gini dapat digunakan untuk komputasi kepentingan variabel.
r
random-forest
entropy
gini
seseorang
sumber
sumber
Jawaban:
Paket randomForest dalam R oleh A. Liaw adalah port dari kode asli menjadi campuran kode-c (diterjemahkan) beberapa kode fortran yang tersisa dan kode wrapper R. Untuk memutuskan pemisahan terbaik secara keseluruhan di seluruh titik break dan di seluruh variabel mtry, kode ini menggunakan fungsi penilaian yang mirip dengan gini-gain:
WhereX is a given feature, N is the node on which the split is to be made, and N1 and N2 are the two child nodes created by splitting N . |.| is the number of elements in a node.
AndGini(N)=1−∑Kk=1p2k , where K is the number of categories in the node
But the applied scoring function is not the exactly same, but instead a equivalent more computational efficient version.Gini(N) and |N| are constant for all compared splits and thus omitted.
Also lets inspect the part if the sum of squared prevalence in a node(1) is computed as|N2||N|Gini(N2)∝|N2|Gini(N2)=|N2|(1−∑Kk=1p2k)=|N2|∑nclass22,k|N2|2
wherenclass1,k is the class count of target-class k in daughter node 1. Notice |N2| is placed both in nominator and denominator.
removing the trivial constant1− from equation such that best split decision is to maximize nodes size weighted sum of squared class prevalence...
score=|N1|∑Kk=1p21,k+|N2|∑Kk=1p22,k=|N1|∑Kk=1nclass21,k|N1|2+|N2|∑Kk=1nclass22,k|N2|2
=∑Kk=1nclass22,k1|N1|−1+∑Kk=1nclass22,k1|N1|−2
=nominator1/denominator1+nominator2/denominator2
The implementation also allows for classwise up/down weighting of samples. Also very important when the implementation update this modified gini-gain, moving a single sample from one node to the other is very efficient. The sample can be substracted from nominators/denominators of one node and added to the others. I wrote a prototype-RF some months ago, ignorantly recomputing from scratch gini-gain for every break-point and that was slower :)
If several splits scores are best, a random winner is picked.
This answer was based on inspecting source file "randomForest.x.x.tar.gz/src/classTree.c" line 209-250
sumber