Cara mengetahui mengapa proses terbunuh di server

17

Saya menjalankan pekerjaan di server kami tempo hari yang mengambil 70% dari memori. Ketika saya login satu hari kemudian untuk memeriksanya, pekerjaan itu telah mati (katanya "terbunuh" di terminal).

Pertanyaan saya adalah, bisakah saya mencari tahu apa yang terjadi?

  1. Apakah itu karena pengguna lain memulai pekerjaan yang mengambil> 30% memori, jadi proses saya mati?
  2. Apakah seorang administrator membunuhnya?
  3. ?

Apakah ada cara untuk mengetahui apa yang sebenarnya terjadi?

BillyJean
sumber

Jawaban:

20

Jika suatu proses mengkonsumsi terlalu banyak memori maka pembunuh kernel "Out of Memory" (OOM) akan secara otomatis membunuh proses yang menyinggung. Sepertinya ini mungkin terjadi pada pekerjaan Anda. Log kernel harus menunjukkan tindakan pembunuh OOM, jadi gunakan perintah "dmesg" untuk melihat apa yang terjadi, misalnya

dmesg | less

Anda akan melihat pesan pembunuh OOM, seperti berikut ini:

[   54.125380] Out of memory: Kill process 8320 (stress-ng-brk) score 324 or sacrifice child
[   54.125382] Killed process 8320 (stress-ng-brk) total-vm:1309660kB, anon-rss:1287796kB, file-rss:76kB
[   54.522906] gmain invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[   54.522908] gmain cpuset=accounts-daemon.service mems_allowed=0
[   54.522912] CPU: 6 PID: 1032 Comm: gmain Not tainted 4.4.0-0-generic #3-Ubuntu
[   54.522913] Hardware name: Intel Corporation Skylake Client platform/Skylake DT DDR4 RVP8, BIOS SKLSE2R1.R00.B089.B00.1506160228 06/16/2015
[   54.522914]  0000000000000000 000000002d879fe9 ffff88016d727a58 ffffffff813d8604
[   54.522915]  ffff88016d727c50 ffff88016d727ac8 ffffffff8120272e 0000000000000015
[   54.522916]  0000000000000000 ffff880080ab3600 ffff880086725880 ffff88016d727ab8
[   54.522917] Call Trace:
[   54.522921]  [<ffffffff813d8604>] dump_stack+0x44/0x60
[   54.522924]  [<ffffffff8120272e>] dump_header+0x5a/0x1c5
[   54.522926]  [<ffffffff81376bd8>] ? apparmor_capable+0xb8/0x120
[   54.522928]  [<ffffffff8118b472>] oom_kill_process+0x202/0x3b0
[   54.522929]  [<ffffffff8118b885>] out_of_memory+0x215/0x460
[   54.522931]  [<ffffffff81191740>] __alloc_pages_nodemask+0x9b0/0xb40
[   54.522933]  [<ffffffff811da7cc>] alloc_pages_current+0x8c/0x110
[   54.522934]  [<ffffffff81187d75>] __page_cache_alloc+0xb5/0xc0
[   54.522935]  [<ffffffff81189f4a>] filemap_fault+0x14a/0x3f0
[   54.522937]  [<ffffffff811b6140>] __do_fault+0x50/0xe0
[   54.522938]  [<ffffffff811b9b82>] handle_mm_fault+0xf92/0x1840
[   54.522939]  [<ffffffff812526a7>] ? eventfd_ctx_read+0x67/0x210
[   54.522941]  [<ffffffff81068517>] __do_page_fault+0x197/0x400
[   54.522942]  [<ffffffff810687a2>] do_page_fault+0x22/0x30
[   54.522944]  [<ffffffff8180e2f8>] page_fault+0x28/0x30
[   54.522945] Mem-Info:
[   54.522947] active_anon:788399 inactive_anon:33532 isolated_anon:0
                active_file:83 inactive_file:37 isolated_file:0
                unevictable:1 dirty:10 writeback:0 unstable:0
                slab_reclaimable:5166 slab_unreclaimable:13868
                mapped:5646 shmem:9752 pagetables:4476 bounce:0
                free:7576 free_pcp:0 free_cma:0
[   54.522948] Node 0 DMA free:15476kB min:28kB low:32kB high:40kB active_anon:144kB inactive_anon:216kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:80kB shmem:80kB slab_reclaimable:0kB slab_unreclaimable:48kB kernel_stack:0kB pagetables:4kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[   54.522951] lowmem_reserve[]: 0 2072 3862 3862
[   54.522952] Node 0 DMA32 free:11220kB min:4204kB low:5252kB high:6304kB active_anon:1711968kB inactive_anon:80964kB active_file:236kB inactive_file:100kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2206296kB managed:2125964kB mlocked:0kB dirty:36kB writeback:0kB mapped:17948kB shmem:26240kB slab_reclaimable:8988kB slab_unreclaimable:26036kB kernel_stack:2656kB pagetables:9348kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3776 all_unreclaimable? yes
[   54.522955] lowmem_reserve[]: 0 0 1790 1790
[   54.522956] Node 0 Normal free:3608kB min:3628kB low:4532kB high:5440kB active_anon:1441484kB inactive_anon:52948kB active_file:96kB inactive_file:48kB unevictable:4kB isolated(anon):0kB isolated(file):0kB present:1900544kB managed:1833172kB mlocked:4kB dirty:4kB writeback:0kB mapped:4556kB shmem:12688kB slab_reclaimable:11676kB slab_unreclaimable:29388kB kernel_stack:2448kB pagetables:8552kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:924 all_unreclaimable? yes
[   54.522958] lowmem_reserve[]: 0 0 0 0
[   54.522959] Node 0 DMA: 7*4kB (UME) 3*8kB (UM) 4*16kB (UME) 4*32kB (UME) 2*64kB (U) 4*128kB (UME) 1*256kB (E) 2*512kB (ME) 3*1024kB (UME) 1*2048kB (E) 2*4096kB (M) = 15476kB
[   54.522965] Node 0 DMA32: 118*4kB (UME) 36*8kB (UME) 62*16kB (UME) 94*32kB (UME) 34*64kB (UME) 24*128kB (UME) 5*256kB (UE) 1*512kB (U) 0*1024kB 0*2048kB 0*4096kB = 11800kB
[   54.522969] Node 0 Normal: 151*4kB (UME) 39*8kB (UME) 77*16kB (UME) 38*32kB (UME) 9*64kB (ME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3940kB
[   54.522974] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   54.522974] Node 0 hugepages_total=256 hugepages_free=256 hugepages_surp=0 hugepages_size=2048kB
[   54.522975] 9932 total pagecache pages
[   54.522976] 0 pages in swap cache
[   54.522976] Swap cache stats: add 1831590, delete 1831590, find 5929/10969
[   54.522977] Free swap  = 0kB
[   54.522977] Total swap = 0kB
[   54.522978] 1030706 pages RAM
[   54.522978] 0 pages HighMem/MovableOnly
[   54.522979] 36950 pages reserved
[   54.522979] 0 pages cma reserved
[   54.522979] 0 pages hwpoisoned
[   54.522980] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   54.522986] [  285]     0   285    10173     1022      23       3        0             0 systemd-journal
[   54.522988] [  312]     0   312    11192      266      22       3        0         -1000 systemd-udevd
[   54.522989] [  623]   100   623    25590      569      20       4        6             0 systemd-timesyn
[   54.522990] [  823]     0   823     5859     1723      14       3        0             0 dhclient
[   54.522991] [  917]     0   917     7152       96      18       3        2             0 systemd-logind
[   54.522992] [  936]     0   936     6310      223      16       3        0             0 smartd
[   54.522993] [  943]     0   943   112847      523      72       3        9             0 NetworkManager
[   54.522993] [  952]     0   952    84334      421      68       4        0             0 ModemManager
[   54.522994] [  957]     0   957     4797       40      15       4        0             0 atd
[   54.522995] [  961]   115   961    93456      912      80       4        0             0 whoopsie
[   54.522996] [  963]     0   963     4865       65      13       3        0             0 irqbalance
[   54.522997] [  964]   104   964    65667      224      30       4        9             0 rsyslogd
[   54.522998] [  966]     0   966    23282       34      13       3        0             0 lxcfs
[   54.522999] [  971]   105   971    10926      318      26       3        8          -900 dbus-daemon
[   54.523000] [ 1008]     0  1008     9570       82      25       3        0             0 cgmanager
[   54.523001] [ 1016]     0  1016    70808      240      41       3        0             0 accounts-daemon
[   54.523002] [ 1019]     0  1019     1119       46       8       3        0             0 ondemand
[   54.523003] [ 1022]     0  1022     7233       68      20       3        0             0 cron
[   54.523004] [ 1028]   109  1028    11218       97      26       3        3             0 avahi-daemon
[   54.523005] [ 1030]     0  1030     1807       20      10       3        0             0 sleep
[   54.523006] [ 1037]   109  1037    11185       82      25       3        0             0 avahi-daemon
[   54.523007] [ 1047]     0  1047   141966     2188     156       4        3             0 libvirtd
[   54.523008] [ 1053]     0  1053    13902      163      33       3        0         -1000 sshd
[   54.523009] [ 1057]     0  1057    69683      586      40       3       12             0 polkitd
[   54.523010] [ 1072]     0  1072    10963      134      24       3        0             0 wpa_supplicant
[   54.523011] [ 1081]     0  1081    87582      696      39       3       23             0 lightdm
[   54.523012] [ 1088]     0  1088    99946     6138      97       3       15             0 Xorg
[   54.523012] [ 1111]     0  1111     1099       45       8       3        0             0 acpid
[   54.523013] [ 1125]     0  1125    56533      191      47       4       14             0 lightdm
[   54.523014] [ 1129]   114  1129    11957      850      27       3        0             0 systemd
[   54.523015] [ 1130]   114  1130    15825      501      33       3        0             0 (sd-pam)
[   54.523029] [ 1136]   114  1136    30728      108      26       4        0             0 gnome-keyring-d
[   54.523030] [ 1138]   114  1138     1119       20       8       3        0             0 lightdm-greeter
[   54.523031] [ 1143]   114  1143    10743      145      25       3       13             0 dbus-daemon
[   54.523032] [ 1144]   114  1144   227063     2039     170       4       17             0 unity-greeter
[   54.523032] [ 1146]   114  1146    84488      626      34       3        0             0 at-spi-bus-laun
[   54.523033] [ 1151]   114  1151    10680       97      27       4        0             0 dbus-daemon
[   54.523034] [ 1153]   114  1153    51706      157      37       3        3             0 at-spi2-registr
[   54.523035] [ 1159]   114  1159    68584      154      37       3        0             0 gvfsd
[   54.523036] [ 1164]   114  1164    85325      145      32       3        0             0 gvfsd-fuse
[   54.523037] [ 1174]   114  1174    44626      121      23       3        3             0 dconf-service
[   54.523038] [ 1197]     0  1197    20665      147      44       3        0             0 lightdm
[   54.523038] [ 1201]   114  1201    11465      160      27       3        0             0 upstart
[   54.523039] [ 1204]   114  1204   144936     1323     136       4        4             0 nm-applet
[   54.523040] [ 1206]   114  1206    88647      256      41       3       26             0 indicator-messa
[   54.523041] [ 1207]   114  1207    83323      127      31       3        0             0 indicator-bluet
[   54.523042] [ 1208]   114  1208   122044       98      37       4       12             0 indicator-power
[   54.523043] [ 1209]   114  1209   132868      439      75       3        0             0 indicator-datet
[   54.523044] [ 1210]   114  1210   140272     1504     127       4        1             0 indicator-keybo
[   54.523045] [ 1211]   114  1211   134142      426      68       4        8             0 indicator-sound
[   54.523045] [ 1212]   114  1212   189042      260      47       4        0             0 indicator-sessi
[   54.523046] [ 1218]   114  1218   117391      350      89       4        0             0 indicator-appli
[   54.523047] [ 1232]     0  1232     7973       81      20       3       11             0 bluetoothd
[   54.523048] [ 1238]   114  1238   152474     1084     129       3       15             0 unity-settings-
[   54.523049] [ 1261]   114  1261   104039      719      78       4        0             0 pulseaudio
[   54.523050] [ 1272]   120  1272    45874       77      24       3        1             0 rtkit-daemon
[   54.523051] [ 1293]     0  1293    68995      324      53       3       12             0 upowerd
[   54.523052] [ 1296]   114  1296    15493      366      33       3        0             0 gconfd-2
[   54.523053] [ 1342]   110  1342    75254     1170      49       3        0             0 colord
[   54.523054] [ 1429]   113  1429    12484       98      27       3        0             0 dnsmasq
[   54.523054] [ 1430]     0  1430    12477       94      27       3        0             0 dnsmasq
[   54.523055] [ 1514]     0  1514    22408      226      49       3        0             0 sshd
[   54.523056] [ 1570]  1000  1570    11958      853      26       3        0             0 systemd
[   54.523057] [ 1571]  1000  1571    15825      501      33       3        0             0 (sd-pam)
[   54.523058] [ 1631]  1000  1631    22408      244      46       3        0             0 sshd
[   54.523058] [ 1632]  1000  1632     5779      619      16       3        0             0 bash
[   54.523059] [ 1692]   118  1692    11320       77      25       3       14             0 kerneloops
[   54.523060] [ 1745]     0  1745     3964       41      13       3        0             0 agetty
[   54.523061] [ 1768]   125  1768    13192       98      27       3        0             0 dnsmasq
[   54.523062] [ 2276]   126  2276    32160      388      58       3        0             0 exim4
[   54.523062] [ 8310]  1000  8310     5508      661      14       3        0             0 stress-ng
[   54.523063] [ 8311]  1000  8311     5508       49      13       3        0             0 stress-ng-brk
[   54.523064] [ 8312]  1000  8312     5508       46      13       3        0             0 stress-ng-brk
[   54.523065] [ 8313]  1000  8313     5508       46      13       3        0             0 stress-ng-brk
[   54.523065] [ 8314]  1000  8314     5508       46      13       3        0             0 stress-ng-brk
[   54.523066] [ 8321]  1000  8321   365871   360407     717       4        0             0 stress-ng-brk
[   54.523067] [ 8322]  1000  8322   239424   233959     470       3        0             0 stress-ng-brk
[   54.523068] [ 8323]  1000  8323   143599   138152     283       3        0             0 stress-ng-brk
[   54.523069] [ 8324]  1000  8324    54613    49145     109       3        0             0 stress-ng-brk
[   54.523070] Out of memory: Kill process 8321 (stress-ng-brk) score 363 or sacrifice child
[   54.523072] Killed process 8321 (stress-ng-brk) total-vm:1463484kB, anon-rss:1441628kB, file-rss:0kB

Namun, pesan ini mungkin telah dihapus dari log kernel, jadi orang mungkin perlu memeriksa log kernel /var/log/kern.log*

Pengaturan memori virtual default untuk Linux adalah untuk melakukan terlalu banyak memori. Ini berarti kernel akan memungkinkan seseorang untuk mengalokasikan lebih banyak memori daripada yang tersedia, memungkinkan proses untuk memetakan wilayah memori besar karena biasanya tidak semua halaman dalam alokasi digunakan. Namun, kadang-kadang suatu proses akan membaca / menulis ke semua halaman yang terlalu berkomitmen dan kernel tidak dapat memberikan cukup memori fisik + swap, sehingga pembunuh OOM berusaha untuk menemukan kandidat terbaik proses overcommitted dan membunuhnya.

Colin Ian King
sumber
2
Alih-alih dmesg | less, dmesg | grep -i killmungkin lebih bermanfaat. Dan karenanya grep /var/log/kern.log* -ie kill,.
Alex
1
Memang, tapi kadang-kadang juga berguna untuk melihat seluruh konteks tempat pembuangan pembunuh OOM karena menunjukkan proses kandidat lain yang mungkin telah terbunuh (misalnya ukuran memori) dan karenanya mengapa proses tertentu terbunuh dari banyak yang berjalan.
Colin Ian King
Apakah mungkin melihat tanggal pada entri tertentu? Kalau tidak, saya tidak bisa menentukannya secara unik
BillyJean
Jika Anda dapat membaca log kernel di / var / log maka Anda harus dapat melihat nama proses yang dibunuh dengan pembunuh OOM dan tanggal dan waktu ketika ini dicatat.
Colin Ian King
@ColinIanKing Log kernel hanya tersedia untuk sudo, sayangnya saya tidak dapat mengaksesnya
BillyJean
1

Jadi, jika Anda ingin segera melihat log kernel, pekerjaan tersebut dimatikan, bungkus dengan skrip bash berikut:

#!/bin/bash
your_job_here
ret=$?
#
#  returns > 127 are a SIGNAL
#
if [ $ret -gt 127 ]; then
        sig=$((ret - 128))
        echo "Got SIGNAL $sig"
        if [ $sig -eq $(kill -l SIGKILL) ]; then
                echo "process was killed with SIGKILL"
                dmesg > $HOME/dmesg-kill.log
        fi
fi

Catatan: "your_job_here" adalah nama program / pekerjaan yang ingin Anda jalankan. Skrip ini memeriksa kode pengembalian program dan akan memeriksa apakah program tersebut dimatikan dengan SIGKILL dan jika demikian, akan langsung membuang dmesg ke direktori home Anda dalam file bernama dmesg-kill.log

Semoga itu bisa membantu

Colin Ian King
sumber