網(wǎng)站導(dǎo)航

如何快速分析Linux服務(wù)器的性能問題

如何快速分析Linux服務(wù)器的性能問題？

當(dāng)遇到一個系統(tǒng)性能問題時，如何利用登錄的前60秒對系統(tǒng)的性能情況做一個快速瀏覽和分析，主要包括如下10個工具，這是一個非常有用且有效的命工具列表。本文將詳細介紹這些命令及其擴展選項的意義，及其在實踐中的作用。并利用一個實際出現(xiàn)問題的例子，來驗證這些套路是不是可行，下面工具的屏幕輸出結(jié)果都來自這個出現(xiàn)題的系統(tǒng)。

# 系統(tǒng)負載概覽uptime

# 系統(tǒng)日志dmesg | tail

# CPUvmstat 1mpstat -P ALL 1pidstat 1

# Diskiostat -xz 1

# 內(nèi)存free -m

# 網(wǎng)絡(luò)sar -n DEV 1sar -n TCP,ETCP 1

# 系統(tǒng)概覽top

上面的工具都基于內(nèi)核提供給用戶態(tài)的統(tǒng)計，并以計數(shù)器形式展示，是快速排查時的利器。對于應(yīng)用和系統(tǒng)的進一步跟蹤(tracing)，則需要利用strace和systemtap，不在本文的范疇。

注意：

如上的分類只是基于工具默認選項的分類，比如pidstat，默認展示進程的CPU統(tǒng)計，但是利用-d參數(shù)可以展示進程的I/O統(tǒng)計。又比如vmstat，雖然名稱是查看虛擬內(nèi)存的工具，但默認展示了負載，內(nèi)存，I/O，系統(tǒng)，CPU等多方面的信息。部分工具需要安裝sysstat包。

1. uptime[root@nginx1 ~]# uptime 15:38:10 up 43 days, 3:54, 1 user, load average: 1.13, 0.41, 0.18

uptime是快速查看load average的方法，在Linux中l(wèi)oad average包括處于runnable和uninterruptable狀態(tài)的進程總數(shù)，runnable狀態(tài)的進程包括在CPU上運行的進程和已經(jīng)ready to run在等待CPU時間的進程；uninterruptable狀態(tài)的進程是在等待一些I/O訪問，比如等待disk的返回。Load average沒有根據(jù)系統(tǒng)的CPU數(shù)量做格式化，所以load average 1表示單CPU系統(tǒng)在對應(yīng)時間段內(nèi)(1分鐘, 5分鐘, 15分鐘)一直負載飽和，而在4 CPU的系統(tǒng)中，load average 1表示有75%的時間在idle。

Load average體現(xiàn)了一個high level的負載概覽，但是可能需要和別的工具一起來使用以了解更多信息，比如處于runable和uninterruptable的實時進程數(shù)量分別是多少，可以用下面將介紹到的vmstat來查看。1分鐘，5分鐘，15分鐘的負載平均值同時能體現(xiàn)系統(tǒng)負載的變化情況。例如，如果你要檢查一個問題服務(wù)器，當(dāng)你看到1分鐘的平均負載值已經(jīng)遠小于15分鐘的平均負載值，則意味這也許你登錄晚了點，錯過了現(xiàn)場。用top或者w命令，也可以看到load average信息。

上面示例中最近1分鐘內(nèi)的負載比15分鐘內(nèi)的負載高了不少 (因為是個測試的例子，1.13可以看作明顯大于0.18，但是在生產(chǎn)系統(tǒng)上這不能說明什么)。

2. dmesg | tail

[root@nginx1 ~]# dmesg | tail [3128052.929139] device eth0 left promiscuous mode [3128104.794514] device eth0 entered promiscuous mode [3128526.750271] device eth0 left promiscuous mode [3537292.096991] device eth0 entered promiscuous mode [3537295.941952] device eth0 left promiscuous mode [3537306.450497] device eth0 entered promiscuous mode [3537307.884028] device eth0 left promiscuous mode [3668025.020351] bash (8290): drop_caches: 1 [3674191.126305] bash (8290): drop_caches: 2 [3675304.139734] bash (8290): drop_caches: 1

dmesg用于查看內(nèi)核緩沖區(qū)存放的系統(tǒng)信息。另外查看/var/log/messages也可能查看出服務(wù)器系統(tǒng)方面的某些問題。

上面示例中的dmesg沒有特別的值得注意的錯誤。

3. vmstat 1

vmstat簡介：

vmstat是virtual memory stat的簡寫，能夠打印processes, memory, paging, block IO, traps, disks and cpu的相關(guān)信息。vmstat的格式：vmstat [options] [delay [count]]。在輸入中的1是延遲。第一行打印的是機器啟動到現(xiàn)在的平均值，后面打印的則是根據(jù)deley間隔的取樣結(jié)果，也就是實時的結(jié)果。

結(jié)果中列的含義：

Procs(進程)

r: The number of runnable processes (running or waiting for run time).b: The number of processes in uninterruptible sleep.

注釋：r表示在CPU上運行的進程和ready等待運行的進程總數(shù)，相比load average, 這個值更能判斷CPU是否飽和(saturation)，因為它沒有包括I/O。如果r的值大于CPU數(shù)目，即達到飽和。

Memory

swpd: the amount of virtual memory used.free: the amount of idle memory.buff: the amount of memory used as buffers.cache: the amount of memory used as cache.

Swap

si: Amount of memory swapped in from disk (/s).so: Amount of memory swapped to disk (/s).

注釋：swap-in和swap-out的內(nèi)存。如果是非零，說明主存中的內(nèi)存耗盡。

bi: Blocks received from a block device (blocks/s).bo: Blocks sent to a block device (blocks/s).

System (中斷和進程上下文切換)

in: The number of interrupts per second, including the clock.cs: The number of context switches per second.

CPU

These are percentages of total CPU time.us: Time spent running non-kernel code. (user time, including nice time)sy: Time spent running kernel code. (system time)id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle.st: Time stolen from a virtual machine. Prior to Linux 2.6.11, unknown.

根據(jù)user+system時間，可以判斷CPUs是否繁忙。如果wait I/O一直維持一定程度，說明disk有瓶頸，這時CPUs是"idle"的，因為任務(wù)都被block在等待disk I/O中。wait I/O可以被視為另一種形式的CPU idle，并且說明idle的原因就是在等待disk I/O的完成。

處理I/O需要花費system time，在將I/O提交到disk driver之前可能要經(jīng)過remap, split和merge等操作，并被I/O scheduler調(diào)度到request queue。如果處理I/O時平均system time比較高，超過20%，則要進一步分析下，是不是內(nèi)核處理I/O時的效率有問題。

如果用戶空間的CPU使用率接近100%，不一定就代表有問題，可以結(jié)合r列的進程總數(shù)量看下CPU的飽和程度。

上面示例可以看到在CPU方面有一個明顯的問題。user+system的CPU一直維持在50%左右，并且system消耗了大部分的CPU。

4. mpstat -P ALL 1

mpstat可以打印按照CPU的分解，可以用來檢查不不均衡的情況。

上面示例結(jié)果可以印證vmstat中觀察到的結(jié)論，并且可以看到服務(wù)器有2個CPU，其中CPU 1的使用率一直維持在100%，而CPU 0并沒有什么負載。CPU 1的消耗主要在內(nèi)核空間，而非用戶空間。

5. pidstat 1

默認pidstat類似于top按照進程的打印方式，不過是以滾動打印的方式，和top的清屏方式不同。利用-p可以打出指定進程的信息，-p ALL可以打出所有進程的信息。如果沒有指定任何進程默認相當(dāng)于-p ALL，但是只打印活動進程的信息(統(tǒng)計非0的數(shù)據(jù))。

pidstat不只可以打印進程的CPU信息，還可以打印內(nèi)存，I/O等方面的信息，如下是比較有用的信息：

pidstat -d 1：看哪些進程有讀寫。pidstat -r 1：看進程的page fault和內(nèi)存使用。沒有發(fā)生page fault的進程默認不會被打印出來，可以指定-p和進程號來打印查看內(nèi)存。pidstat -t：利用-t查看線程信息，可以快速查看線程和期相關(guān)線程的關(guān)系。pidstat -w：利用-w查看進程的context switch情況。輸出：cswch/s: 每秒發(fā)生的voluntary context switch數(shù)目 (voluntary cs：當(dāng)進程被block在獲取不到的資源時，主動發(fā)生的context switch)nvcswch/s: 每秒發(fā)生的non voluntary context switch數(shù)目 (non vloluntary cs：進程執(zhí)行一段時間用完了CPU分配的time slice，被強制從CPU上調(diào)度下來，這時發(fā)生的context switch)

上面示例中可以明確得看到是nc這個進程在消耗CPU 1 100%的CPU。因為測試系統(tǒng)里消耗CPU的進程比較少，所以一目了然，在生產(chǎn)系統(tǒng)中pidstat應(yīng)該能輸出更多正在消耗CPU的進程情況。

6. iostat -zx 1

了解塊設(shè)備(block device, 這里是disk)負載和性能的工具。主要看如下指標(biāo)：

r/s, w/s, rkB/s, wkB/s：每秒完成的讀請求次數(shù)(read requests, after merges)，每秒完成的寫請求次數(shù)(write requests completed, after merges)，每秒讀取的千字節(jié)數(shù)，每秒寫入的千字節(jié)數(shù)。這些指標(biāo)可以看出disk的負載情況。一個性能問題可能僅僅是因為disk的負載過大。await：每個I/O平均所需的時間，單位為毫秒。await不僅包括硬盤設(shè)備處理I/O的時間，還包括了在kernel隊列中等待的時間。要精確地知道塊設(shè)備service一個I/O請求地時間，可供iostat讀取地內(nèi)核統(tǒng)計并沒有體現(xiàn)，需要用如blktrace這樣地跟蹤工具來跟蹤。對于blktrace來說，D2C的時間間隔代表硬件塊設(shè)備地service time，Q2C代表整個I/O請求所消耗的時間，即iostat的await。avgqu-sz：隊列里的平均I/O請求數(shù)量 (更恰當(dāng)?shù)睦斫鈶?yīng)該是平均未完成的I/O請求數(shù)量）。如果該值大于1，則有飽和的趨勢 (當(dāng)然設(shè)備可以并發(fā)地處理請求，特別是一個front對多個backend disk的虛擬設(shè)備)。%util：設(shè)備在處理I/O的時間占總時間的百分比。表示該設(shè)備有I/O（即非空閑）的時間比率，不考慮I/O有多少，只考慮有沒有。通常該指標(biāo)達到60%即可能引起性能問題 (可以根據(jù)await指標(biāo)進一步求證)。如果指標(biāo)接近100%，通常就說明出現(xiàn)了飽和。

如果存儲設(shè)備是一個對應(yīng)多個后端磁盤的邏輯磁盤，那么100%使用率可能僅僅表示一些I/O在處理時間占比達到100%，其他后端磁盤不一定也到達了飽和。請注意磁盤I/O的性能問題并不一定會造成應(yīng)用的問題，很多技術(shù)都是使用異步I/O操作，所以應(yīng)用不一定會被block或者直接受到延遲的影響。

7. free -m# free -m total used free shared buff/cache available Mem: 7822 129 214 0 7478 7371 Swap: 0 0 0

查看內(nèi)存使用情況。倒數(shù)第二列：

buffers: buffer cache，用于block device I/O。cached: page cache, 用于文件系統(tǒng)。

Linux用free memory來做cache, 當(dāng)應(yīng)用需要時，這些cache可以被回收。比如kswapd內(nèi)核進程做頁面回收時可能回收cache；另外手動寫/proc/sys/vm/drop_caches也會導(dǎo)致cache回收。

上面示例中free的內(nèi)存只有129M，大部分memory被cache占用。但是系統(tǒng)并沒有問題。

8. sar -n DEV 1

輸出指標(biāo)的含義如下：

rxpck/s: Total number of packets received per second.txpck/s: Total number of packets transmitted per second.rxkB/s: Total number of kilobytes received per second.txkB/s: Total number of kilobytes transmitted per second.rxcmp/s: Number of compressed packets received per second (for cslip etc.).txcmp/s: Number of compressed packets transmitted per second.rxmcst/s: Number of multicast packets received per second.%ifutil: Utilization percentage of the network interface. For half-duplex interfaces, utilization is calculated using the sum of rxkB/s and txkB/s as a percentage of the interface speed.For full-duplex, this is the greater of rxkB/S or txkB/s.

這個工具可以查看網(wǎng)絡(luò)接口的吞吐量，特別是上面藍色高亮的rxkB/s和txkB/s，這是網(wǎng)絡(luò)負載，也可以看是否達到了limit。

9. sar -n TCP,ETCP 1

輸出指標(biāo)的含義如下：

active/s: The number of times TCP connections have made a direct transition to the SYN-SENT state from the CLOSED state per second [tcpActiveOpens].passive/s: The number of times TCP connections have made a direct transition to the SYN-RCVD state from the LISTEN state per second [tcpPassiveOpens].iseg/s: The total number of segments received per second, including those received in error [tcpInSegs]. This count includes segments received on currently established connections.oseg/s: The total number of segments sent per second, including those on current connections but excluding those containing only retransmitted octets [tcpOutSegs].atmptf/s: The number of times per second TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times per second TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state [tcpAttemptFails].estres/s: The number of times per second TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state [tcpEstabResets].retrans/s: The total number of segments retransmitted per second - that is, the number of TCP segments transmitted containing one or more previously transmitted octets [tcpRetransSegs].isegerr/s: The total number of segments received in error (e.g., bad TCP checksums) per second [tcpInErrs].orsts/s: The number of TCP segments sent per second containing the RST flag [tcpOutRsts].

上述藍色高亮的3個指標(biāo)：active/s, passive/s和retrans/s是比較有代表性的指標(biāo)。

active/s和passive/s分別是本地發(fā)起的每秒新建TCP連接數(shù)和遠程發(fā)起的TCP新建連接數(shù)。這兩個指標(biāo)可以粗略地判斷服務(wù)器的負載。可以用active衡量出站發(fā)向，用passive衡量入站方向，但也不是完全準(zhǔn)確(比如，考慮localhost到localhost的連接)。retrans是網(wǎng)絡(luò)或者服務(wù)器發(fā)生問題的象征。有可能問題是網(wǎng)絡(luò)不穩(wěn)定，比如Internet網(wǎng)絡(luò)問題，或者服務(wù)器過載丟包。

10. top

# top Tasks: 79 total, 2 running, 77 sleeping, 0 stopped, 0 zombie %Cpu(s): 6.0 us, 44.1 sy, 0.0 ni, 49.6 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st KiB Mem : 8010456 total, 7326348 free, 132296 used, 551812 buff/cache KiB Swap: 0 total, 0 free, 0 used. 7625940 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4617 root 20 0 44064 2076 1544 R 100.0 0.0 16:27.23 nc 13634 nginx 20 0 121192 3864 1208 S 0.3 0.0 17:59.85 nginx 1 root 20 0 125372 3740 2428 S 0.0 0.0 6:11.53 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.60 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:17.92 ksoftirqd/0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 7 root rt 0 0 0 0 S 0.0 0.0 0:03.21 migration/0 8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh 9 root 20 0 0 0 0 S 0.0 0.0 31:47.62 rcu_sched 10 root rt 0 0 0 0 S 0.0 0.0 0:10.00 watchdog/0

top是一個常用的命令，包括了多方面的指標(biāo)。缺點是沒有滾動輸出(rolling output)，不可復(fù)現(xiàn)問題發(fā)生時不容易保留信息。對于信息保留，用vmstat或者pidstat等能夠提供滾動輸出的工具會更好。

示例的問題？

在上面利用工具排查的過程中，我們可以在非常短的時間內(nèi)快速得到如下結(jié)論：

2個CPU，nc這個進程消耗了CPU 1 100%的時間，并且時間消耗在system內(nèi)核態(tài)。其他進程基本沒有在消耗CPU。內(nèi)存free比較少，大部分在cache中 (并不是問題)。Disk I/O非常低，平均讀寫請求小于1個。收到報文在個位數(shù)KB/s級別，每秒有15個被動建立的TCP連接，沒有明顯異常。

整個排查過程把系統(tǒng)的問題定位到了進程級別，并且能排除一些可能性 (Disk I/O和內(nèi)存)。接下來就是進一步到進程級別的排查，不屬于本文的覆蓋范圍，有時間再進一步演示。

上一篇人有時候會原地跑然后突然就瞬移了

下一篇惠普1000筆記本如何拆背板換風(fēng)扇

色婷婷狠狠18禁久久YY,CHINESE性内射高清国产,国产女人18毛片水真多1,国产AV在线观看

網(wǎng)站導(dǎo)航

網(wǎng)站導(dǎo)航

網(wǎng)站分類

如何快速分析Linux服務(wù)器的性能問題

色婷婷狠狠18禁久久YY,CHINESE性内射高清国产,国产女人18毛片水真多1,国产AV在线观看

網(wǎng)站導(dǎo)航

網(wǎng)站導(dǎo)航

網(wǎng)站分類

如何快速分析Linux服務(wù)器的性能問題

相關(guān)文章