Kami memiliki beberapa server EC2 yang mengalami kebocoran memori selama berhari-hari atau berminggu-minggu. Akhirnya akan ada banyak GB memori yang digunakan (sesuai dengan alat suka free
dan htop
) dan, jika kita tidak me-restart server, proses kita mulai terbunuh OOM.
Salah satu server tersebut memiliki ram 15GB. Inilah output dari free -m
:
total used free shared buffers cached
Mem: 15039 3921 11118 0 0 7
-/+ buffers/cache: 3913 11126
Swap: 0 0 0
Server ini diam; Saya telah membunuh sebagian besar proses userland. Tidak ada proses di htop yang ditampilkan> 100k VIRT. Baru-baru ini saya berlari echo 3 > /proc/sys/vm/drop_caches
, tanpa hasil (itu sebabnya buffers
dan cached
sangat kecil). Selain itu:
- Menyodok masuk
/proc/slabinfo
danslabtop
tidak menunjukkan sesuatu yang menjanjikan - Tidak ada apa-apa di / run / shm
Inilah output dari cat /proc/meminfo
:
MemTotal: 15400880 kB
MemFree: 11385688 kB
Buffers: 564 kB
Cached: 7792 kB
SwapCached: 0 kB
Active: 27668 kB
Inactive: 2012 kB
Active(anon): 21368 kB
Inactive(anon): 380 kB
Active(file): 6300 kB
Inactive(file): 1632 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 21380 kB
Mapped: 7208 kB
Shmem: 380 kB
Slab: 39260 kB
SReclaimable: 16456 kB
SUnreclaim: 22804 kB
KernelStack: 1352 kB
PageTables: 2872 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 7700440 kB
Committed_AS: 39072 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 30336 kB
VmallocChunk: 34359691552 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 36864 kB
DirectMap2M: 15822848 kB
Anda dapat melihat bahwa ada celah besar antara MemFree
dan MemTotal
itu tidak dijelaskan oleh metrik meminfo lainnya.
Tahu di mana memori ini pergi, atau bagaimana saya bisa men-debug lebih lanjut?
Lebih banyak info server, jika relevan:
$ lsb_release -d
Description: Ubuntu 14.04.1 LTS
$ uname -r
3.13.0-36-generic
Pembaruan: Berikut adalah beberapa perintah dan hasilnya:
# dmesg | fgrep 'Memory:'
[ 0.000000] Memory: 15389980K/15728244K available (7373K kernel code, 1144K rwdata, 3404K rodata, 1336K init, 1440K bss, 338264K reserved)
# awk '{print $2 " " $1}' /proc/modules | sort -nr | head -5
106678 psmouse
97812 raid6_pq
86484 raid456
69418 floppy
55624 aesni_intel
# cat /proc/mounts | grep tmp
udev /dev devtmpfs rw,relatime,size=7695004k,nr_inodes=1923751,mode=755 0 0
tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=1540088k,mode=755 0 0
none /sys/fs/cgroup tmpfs rw,relatime,size=4k,mode=755 0 0
none /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
none /run/shm tmpfs rw,nosuid,nodev,relatime 0 0
none /run/user tmpfs rw,nosuid,nodev,noexec,relatime,size=102400k,mode=755 0 0
# df -h /dev /run /sys/fs/cgroup /run/lock /run/shm /run/user
Filesystem Size Used Avail Use% Mounted on
udev 7.4G 12K 7.4G 1% /dev
tmpfs 1.5G 368K 1.5G 1% /run
none 4.0K 0 4.0K 0% /sys/fs/cgroup
none 5.0M 0 5.0M 0% /run/lock
none 7.4G 0 7.4G 0% /run/shm
none 100M 0 100M 0% /run/user
Pembaruan 2 : Inilah keseluruhan output dari ps aux
:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 33636 2368 ? Ss 2015 0:03 /sbin/init
root 2 0.0 0.0 0 0 ? S 2015 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S 2015 0:11 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S< 2015 0:00 [kworker/0:0H]
root 7 0.0 0.0 0 0 ? S 2015 1:31 [rcu_sched]
root 8 0.0 0.0 0 0 ? S 2015 0:30 [rcuos/0]
root 9 0.0 0.0 0 0 ? S 2015 0:25 [rcuos/1]
root 10 0.0 0.0 0 0 ? S 2015 0:33 [rcuos/2]
root 11 0.0 0.0 0 0 ? S 2015 0:25 [rcuos/3]
root 12 0.0 0.0 0 0 ? S 2015 0:14 [rcuos/4]
root 13 0.0 0.0 0 0 ? S 2015 0:14 [rcuos/5]
root 14 0.0 0.0 0 0 ? S 2015 0:14 [rcuos/6]
root 15 0.0 0.0 0 0 ? S 2015 0:33 [rcuos/7]
root 16 0.0 0.0 0 0 ? S 2015 0:00 [rcuos/8]
root 17 0.0 0.0 0 0 ? S 2015 0:00 [rcuos/9]
root 18 0.0 0.0 0 0 ? S 2015 0:00 [rcuos/10]
root 19 0.0 0.0 0 0 ? S 2015 0:00 [rcuos/11]
root 20 0.0 0.0 0 0 ? S 2015 0:00 [rcuos/12]
root 21 0.0 0.0 0 0 ? S 2015 0:00 [rcuos/13]
root 22 0.0 0.0 0 0 ? S 2015 0:00 [rcuos/14]
root 23 0.0 0.0 0 0 ? S 2015 0:00 [rcu_bh]
root 24 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/0]
root 25 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/1]
root 26 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/2]
root 27 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/3]
root 28 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/4]
root 29 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/5]
root 30 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/6]
root 31 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/7]
root 32 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/8]
root 33 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/9]
root 34 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/10]
root 35 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/11]
root 36 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/12]
root 37 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/13]
root 38 0.0 0.0 0 0 ? S 2015 0:00 [rcuob/14]
root 39 0.0 0.0 0 0 ? S 2015 0:01 [migration/0]
root 40 0.0 0.0 0 0 ? S 2015 0:06 [watchdog/0]
root 41 0.0 0.0 0 0 ? S 2015 0:05 [watchdog/1]
root 42 0.0 0.0 0 0 ? S 2015 0:00 [migration/1]
root 43 0.0 0.0 0 0 ? S 2015 0:08 [ksoftirqd/1]
root 45 0.0 0.0 0 0 ? S< 2015 0:00 [kworker/1:0H]
root 46 0.0 0.0 0 0 ? S 2015 0:05 [watchdog/2]
root 47 0.0 0.0 0 0 ? S 2015 0:01 [migration/2]
root 48 0.0 0.0 0 0 ? S 2015 0:08 [ksoftirqd/2]
root 50 0.0 0.0 0 0 ? S< 2015 0:00 [kworker/2:0H]
root 51 0.0 0.0 0 0 ? S 2015 0:06 [watchdog/3]
root 52 0.0 0.0 0 0 ? S 2015 0:01 [migration/3]
root 53 0.0 0.0 0 0 ? S 2015 0:17 [ksoftirqd/3]
root 55 0.0 0.0 0 0 ? S< 2015 0:00 [kworker/3:0H]
root 56 0.0 0.0 0 0 ? S 2015 0:07 [watchdog/4]
root 57 0.0 0.0 0 0 ? S 2015 0:01 [migration/4]
root 58 0.0 0.0 0 0 ? S 2015 0:02 [ksoftirqd/4]
root 60 0.0 0.0 0 0 ? S< 2015 0:00 [kworker/4:0H]
root 61 0.0 0.0 0 0 ? S 2015 0:06 [watchdog/5]
root 62 0.0 0.0 0 0 ? S 2015 0:01 [migration/5]
root 63 0.0 0.0 0 0 ? S 2015 0:07 [ksoftirqd/5]
root 65 0.0 0.0 0 0 ? S< 2015 0:00 [kworker/5:0H]
root 66 0.0 0.0 0 0 ? S 2015 0:06 [watchdog/6]
root 67 0.0 0.0 0 0 ? S 2015 0:01 [migration/6]
root 68 0.0 0.0 0 0 ? S 2015 0:04 [ksoftirqd/6]
root 70 0.0 0.0 0 0 ? S< 2015 0:00 [kworker/6:0H]
root 71 0.0 0.0 0 0 ? S 2015 0:06 [watchdog/7]
root 72 0.0 0.0 0 0 ? S 2015 0:02 [migration/7]
root 73 0.0 0.0 0 0 ? S 2015 0:17 [ksoftirqd/7]
root 74 0.0 0.0 0 0 ? S 2015 0:14 [kworker/7:0]
root 75 0.0 0.0 0 0 ? S< 2015 0:00 [kworker/7:0H]
root 76 0.0 0.0 0 0 ? S< 2015 0:00 [khelper]
root 77 0.0 0.0 0 0 ? S 2015 0:00 [kdevtmpfs]
root 78 0.0 0.0 0 0 ? S< 2015 0:00 [netns]
root 79 0.0 0.0 0 0 ? S 2015 0:00 [xenwatch]
root 80 0.0 0.0 0 0 ? S 2015 0:00 [xenbus]
root 81 0.0 0.0 0 0 ? S 2015 0:39 [kworker/0:1]
root 82 0.0 0.0 0 0 ? S< 2015 0:00 [writeback]
root 83 0.0 0.0 0 0 ? S< 2015 0:00 [kintegrityd]
root 84 0.0 0.0 0 0 ? S< 2015 0:00 [bioset]
root 86 0.0 0.0 0 0 ? S< 2015 0:00 [kblockd]
root 88 0.0 0.0 0 0 ? S< 2015 0:00 [ata_sff]
root 89 0.0 0.0 0 0 ? S 2015 0:00 [khubd]
root 90 0.0 0.0 0 0 ? S< 2015 0:00 [md]
root 91 0.0 0.0 0 0 ? S< 2015 0:00 [devfreq_wq]
root 92 0.0 0.0 0 0 ? S 2015 0:12 [kworker/1:1]
root 95 0.0 0.0 0 0 ? S 2015 0:10 [kworker/4:1]
root 97 0.0 0.0 0 0 ? S 2015 0:11 [kworker/6:1]
root 99 0.0 0.0 0 0 ? S 2015 0:00 [khungtaskd]
root 100 0.0 0.0 0 0 ? S 2015 7:26 [kswapd0]
root 101 0.0 0.0 0 0 ? SN 2015 0:00 [ksmd]
root 102 0.0 0.0 0 0 ? SN 2015 0:29 [khugepaged]
root 103 0.0 0.0 0 0 ? S 2015 0:00 [fsnotify_mark]
root 104 0.0 0.0 0 0 ? S 2015 0:00 [ecryptfs-kthrea]
root 105 0.0 0.0 0 0 ? S< 2015 0:00 [crypto]
root 117 0.0 0.0 0 0 ? S< 2015 0:00 [kthrotld]
root 119 0.0 0.0 0 0 ? S 2015 0:00 [scsi_eh_0]
root 120 0.0 0.0 0 0 ? S 2015 0:00 [scsi_eh_1]
root 141 0.0 0.0 0 0 ? S< 2015 0:00 [deferwq]
root 142 0.0 0.0 0 0 ? S< 2015 0:00 [charger_manager]
root 199 0.0 0.0 0 0 ? S< 2015 0:00 [kpsmoused]
root 223 0.0 0.0 0 0 ? S< 2015 0:00 [bioset]
root 265 0.0 0.0 0 0 ? S< 2015 0:00 [raid5wq]
root 291 0.0 0.0 0 0 ? S 2015 0:22 [jbd2/xvda1-8]
root 292 0.0 0.0 0 0 ? S< 2015 0:00 [ext4-rsv-conver]
root 445 0.0 0.0 0 0 ? S 2015 0:16 [jbd2/md0-8]
root 446 0.0 0.0 0 0 ? S< 2015 0:00 [ext4-rsv-conver]
root 516 0.0 0.0 19604 564 ? S 2015 0:00 upstart-udev-bridge --daemon
root 522 0.0 0.0 49864 1048 ? Ss 2015 0:00 /lib/systemd/systemd-udevd --daemon
root 671 0.0 0.0 15256 408 ? S 2015 0:00 upstart-socket-bridge --daemon
root 800 0.0 0.0 10220 2900 ? Ss 2015 0:00 dhclient -1 -v -pf /run/dhclient.eth0.pid -l
message+ 1048 0.0 0.0 39224 1048 ? Ss 2015 0:00 dbus-daemon --system --fork
root 1077 0.0 0.0 0 0 ? S Jan04 0:00 [kworker/u30:2]
root 1082 0.0 0.0 43448 1196 ? Ss 2015 0:00 /lib/systemd/systemd-logind
root 1116 0.0 0.0 15272 512 ? S 2015 0:00 upstart-file-bridge --daemon
root 1339 0.0 0.0 14536 412 tty4 Ss+ 2015 0:00 /sbin/getty -8 38400 tty4
root 1344 0.0 0.0 14536 416 tty5 Ss+ 2015 0:00 /sbin/getty -8 38400 tty5
root 1360 0.0 0.0 14536 408 tty2 Ss+ 2015 0:00 /sbin/getty -8 38400 tty2
root 1361 0.0 0.0 14536 416 tty3 Ss+ 2015 0:00 /sbin/getty -8 38400 tty3
root 1363 0.0 0.0 14536 404 tty6 Ss+ 2015 0:00 /sbin/getty -8 38400 tty6
root 1418 0.0 0.0 61364 1296 ? Ss 2015 0:07 /usr/sbin/sshd -D
root 1432 0.0 0.0 23652 552 ? Ss 2015 0:02 cron
daemon 1433 0.0 0.0 19136 180 ? Ss 2015 0:00 atd
root 1461 0.0 0.0 19316 644 ? Ss 2015 1:57 /usr/sbin/irqbalance
root 1518 0.0 0.0 4364 404 ? Ss 2015 0:00 acpid -c /etc/acpi/events -s /var/run/acpid.
root 1521 0.0 0.0 0 0 ? S 2015 0:00 [kworker/5:1]
root 1641 0.0 0.0 0 0 ? S Jan04 0:00 [kworker/u30:1]
root 1863 0.0 0.0 14536 404 tty1 Ss+ 2015 0:00 /sbin/getty -8 38400 tty1
root 1864 0.0 0.0 12784 388 ttyS0 Ss+ 2015 0:00 /sbin/getty -8 38400 ttyS0
ntp 2075 0.0 0.0 31448 1252 ? Ss 2015 1:17 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 10
root 2087 0.0 0.0 0 0 ? S 2015 0:00 [kauditd]
ubuntu 2393 0.0 0.0 105628 2028 ? S Jan04 0:00 sshd: ubuntu@notty
root 2496 0.0 0.0 0 0 ? S Jan04 0:00 [kworker/2:2]
root 2713 0.0 0.0 0 0 ? S 2015 0:00 [kworker/6:2]
root 2722 0.0 0.0 0 0 ? S 2015 0:12 [kworker/5:2]
root 3678 0.0 0.0 0 0 ? S Jan05 0:01 [kworker/0:0]
root 3716 0.0 0.0 0 0 ? S Jan05 0:00 [kworker/3:0]
root 3941 0.0 0.0 0 0 ? S Jan05 0:00 [kworker/2:0]
root 4732 0.0 0.0 0 0 ? S Jan05 0:00 [kworker/1:2]
root 6896 0.0 0.0 105628 4228 ? Ss 08:00 0:00 sshd: ubuntu [priv]
ubuntu 7008 0.0 0.0 105628 1876 ? S 08:00 0:00 sshd: ubuntu@pts/0
ubuntu 7014 0.0 0.0 21308 3908 pts/0 Ss 08:00 0:00 -bash
root 7234 0.0 0.0 63668 2096 pts/0 S 08:10 0:00 sudo su
root 7235 0.0 0.0 63248 1776 pts/0 S 08:10 0:00 su
root 7236 1.0 0.0 21088 3456 pts/0 S 08:10 0:00 bash
root 7248 0.0 0.0 17164 1320 pts/0 R+ 08:10 0:00 ps aux
root 13299 0.0 0.0 0 0 ? S 2015 0:19 [kworker/3:2]
root 19933 0.0 0.0 0 0 ? S 2015 0:00 [kworker/7:1]
root 20305 0.0 0.0 0 0 ? S 2015 0:00 [kworker/4:2]
root 29814 0.0 0.0 0 0 ? S< Jan04 0:00 [kworker/u31:2]
root 30693 0.0 0.0 0 0 ? S< Jan04 0:00 [kworker/u31:1]
dmesg | fgrep Memory:
ditunjukkan?df -h --type=tmpfs
.df -h | grep tmpfs
→tmpfs 1.5G 364K 1.5G 1% /run
. Adapun dmesg:dmesg | fgrep 'Memory:'
→[ 0.000000] Memory: 15389980K/15728244K available (7373K kernel code, 1144K rwdata, 3404K rodata, 1336K init, 1440K bss, 338264K reserved)
Jawaban:
Kebocoran ingatan bisa sangat menyakitkan dan cukup membuat frustrasi untuk dilacak pada sistem skala besar. Saya akan mencoba menyalin seluruh server ke lingkungan pengujian, memulai layanan satu per satu untuk mengisolasi masalah.
Setelah Anda memeriksa masing-masing layanan (proses mode pengguna) secara terpisah dan individual dan masih tidak dapat menemukan sumber kebocoran, Anda harus memeriksa kernel. Berurusan dengan kernel membutuhkan waktu dan tangan yang berpengalaman, saya akan merekomendasikan berkonsultasi dengan spesialis kernel.
Kemungkinan lain adalah adanya malware. Berurusan dengan malware adalah opera yang sama sekali berbeda.
Terkadang tidak ada jalan pintas: \
sumber