Apakah hutan acak Breiman menggunakan informasi atau indeks Gini?

15

Saya ingin tahu apakah hutan acak Breiman (hutan acak dalam paket R randomForest) digunakan sebagai kriteria pemisahan (kriteria untuk pemilihan atribut) informasi atau indeks Gini? Saya mencoba untuk mengetahuinya di http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm dan dalam dokumentasi untuk paket randomForest di R. Tetapi satu-satunya hal yang saya temukan adalah bahwa indeks Gini dapat digunakan untuk komputasi kepentingan variabel.

seseorang
sumber
Saya juga bertanya-tanya apakah pohon hutan acak dalam paket randomForest adalah biner atau tidak.
seseorang

Jawaban:

16

Paket randomForest dalam R oleh A. Liaw adalah port dari kode asli menjadi campuran kode-c (diterjemahkan) beberapa kode fortran yang tersisa dan kode wrapper R. Untuk memutuskan pemisahan terbaik secara keseluruhan di seluruh titik break dan di seluruh variabel mtry, kode ini menggunakan fungsi penilaian yang mirip dengan gini-gain:

GiniGain(N,X)=Gini(N)|N1||N|Gini(N1)|N2||N|Gini(N2)

Where X is a given feature, N is the node on which the split is to be made, and N1 and N2 are the two child nodes created by splitting N. |.| is the number of elements in a node.

And Gini(N)=1k=1Kpk2, where K is the number of categories in the node

But the applied scoring function is not the exactly same, but instead a equivalent more computational efficient version. Gini(N) and |N| are constant for all compared splits and thus omitted.

Also lets inspect the part if the sum of squared prevalence in a node(1) is computed as |N2||N|Gini(N2)|N2|Gini(N2)=|N2|(1k=1Kpk2)=|N2|nclass2,k2|N2|2

where nclass1,k is the class count of target-class k in daughter node 1. Notice |N2| is placed both in nominator and denominator.

removing the trivial constant 1 from equation such that best split decision is to maximize nodes size weighted sum of squared class prevalence...

score= |N1|k=1Kp1,k2+|N2|k=1Kp2,k2=|N1|k=1Knclass1,k2|N1|2+|N2|k=1Knclass2,k2|N2|2 =k=1Knclass2,k21|N1|1+k=1Knclass2,k21|N1|2 =nominator1/denominator1+nominator2/denominator2

The implementation also allows for classwise up/down weighting of samples. Also very important when the implementation update this modified gini-gain, moving a single sample from one node to the other is very efficient. The sample can be substracted from nominators/denominators of one node and added to the others. I wrote a prototype-RF some months ago, ignorantly recomputing from scratch gini-gain for every break-point and that was slower :)

If several splits scores are best, a random winner is picked.

This answer was based on inspecting source file "randomForest.x.x.tar.gz/src/classTree.c" line 209-250

Soren Havelund Welling
sumber