2009.03_Tuning Toolbox-Tools and Techniques for Performance Tuning in Linux.pdf
(
510 KB
)
Pobierz
Tuning Toolbox - Tools and techniques for performance tuning in Linux - Linux Magazine
Tools and techniques for performance tuning in Linux
Tuning Toolbox
Tune up your systems and search out bottlenecks with these handy performance tools.
By Tim Chen, Alex Shi, and Yanmin Zhang
shocky, Fotolia
Over the past several years, the Linux Kernel Performance Project [1] has tracked the performance of Linux
and tuned it for throughput and power efficiency on Intel platforms. This experience has given us some
insights into the best tools and techniques for tuning Linux systems. In this article, we describe some of our
favorite Linux performance utilities and provide a real-world example that shows how the Kernel
Performance Project uses these tools to hunt down and solve a real Linux performance issue.
Finding Bottlenecks
The first task in performance tuning is to identify any bottlenecks that might be slowing down system
performance.
The most common bottlenecks occur in I/O, memory management, or the scheduler. Linux offers a suite of
tools for examining system use and searching out bottlenecks. Some tools reveal the general health of the
system, and other tools offer information about specific system components.
The vmstat utility offers a useful summary of overall system performance. Listing 1 shows vmstat data
collected every two seconds for a CPU-intensive, multi-threaded Java workload. The first two columns (
r
,
b
)
describe how many processes in the systems can be run if a CPU is available and how many are blocked. The
presence of both blocked processes and idle time in the system is usually a sign of trouble.
Listing 1:
vmstat Output
01 #vmstat
02 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
03 r b swpd free buff cache si so bi bo in cs us sy id wa
04 7 0 34328 757464 2712 26416 0 0 0 0 12 616773 34 28 37 0
The next four columns under
memory
show how much memory space is used. Frequently swapping memory
in and out of the disk swap space slows the system. The
cache
column gives the amount of memory used as a
page cache. A bigger cache means more files cached in memory. The two columns under
io
,
bi
, and
bo
,
indicate the number of blocks received and sent to block devices, respectively, which gives an idea of the
level of disk activity. The two columns under
system
,
in
, and
cs
, reveal the number of interrupts and context
switches.
Tuning Toolbox
1
If the interrupt rate is too high, you can use an interrupt utility, like sar, to help uncover the cause. The
command
sar -I XALL 10 1000
will break down the source of the interrupts every 10 seconds for 1000
seconds. A high number of context switches relative to the number of processes is undesirable because of
flushing of cached data.
The next four columns in Listing 1,
us
,
sy
,
id
, and
wa
, indicate the percentage of time the CPU(s) has spent in
userspace applications, in the kernel, being idle, or waiting for I/O, respectively. This output shows whether
the CPUs are doing useful work or whether they are just idling or being blocked. A high percentage of time
spent in the OS could indicate a non-optimal system call. Idle time for a fully loaded system could point to
lock contentions.
Disk Performance
Hdparm is a good tool for determining whether the disks are healthy and configured:
# hdparm -tT /dev/sda
/dev/sda:
Timing buffered disk reads: 184 MB in 3.02 seconds = 60.88 MB/sec
Timing cached reads: 11724 MB in 2.00 seconds = 5870.80 MB/sec
The preceding command displays the speed of reading through the buffer cache to the disk, with and without
any prior caching of data. The uncached speed should be somewhat close to the raw speed of the disk. If this
value is too low, you should check in your BIOS to see whether the disk mode is configured properly. Also,
you could check the hard disk parameter setting for an IDE disk
# hdparm -I /dev/hda
or for a SCSI disk:
# sdparm /dev/sda
To study the health of a run-time workload's I/O, use iostat. For example, Listing 2 shows how to use iostat
for dumping a workload. If
%iowait
is high, CPUs are idle and waiting for outstanding disk I/O requests. In
that case, try modifying the workloads to use asynchronous I/O or dedicate a thread to file I/O so workload
execution doesn't stop.
Listing 2:
iostat
01 #iostat -x sda 1
02 avg-cpu: %user %nice %system %iowait %steal %idle
03 0.00 0.00 2.16 20.86 0.00 76.98
04
05 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
06 sda 17184.16 0.00 1222.77 0.00 147271.29 0.00 120.44 3.08 2.52 0.81 99.01
The other parameter to check is the number of queued I/O requests:
avgqu-sz
. This value should be less than
1
or disk I/O will significantly slow things down. The
%util
parameter also indicates the percentage of time the
disk has requests and is a good indication of how busy the disk is.
CPU Cycles
One important way to identify a performance problem is to determine how the system is spending its CPU
cycles. The oprofile utility can help you study the CPU to this end. Oprofile usually is enabled by default. If
you compile your own kernel, then you need to make sure that the kernel configs
CONFIG_OPROFILE=y
and
CONFIG_HAVE_OPROFILE=y
are turned on.
The easiest way to invoke oprofile is with the oprofile GUI that wraps the command-line options. To do so,
use oprofile 0.9.3 or later for an Intel Core 2 processor and install the oprofile-gui package. Now invoke
#oprof_start
Tuning Toolbox
2
to bring up the
Start profiler
screen with
Setup
and
Configuration
tabs (Figure 1). First, select the
Configuration
tab. If you want to profile the kernel, enter the location of the kernel image file (that is, the
uncompressed
vmlinux
file if you compile the kernel from source). Now return to the
Setup
tab.
Figure 1:
Profiling the kernel with oprofile.
In the
Events
table, select the
CPU_CLK_UNHALTED
event and the unit mask
Unhalted core cycles
. Note:
Normally, you do not need to sample the system any more often than the setting listed under in the
Count
field.
A lower count means that fewer events will need to happen before a sample is taken, thus increasing the
sampling frequency. Now run the application you want to profile, and start oprofile by clicking on the
Start
button. When the application has stopped running, click the
Stop
button.
To view the profile data, invoke:
#opreport -l
The output for this command is shown in Listing 3.
Listing 3 shows the percentage of CPU time spent in each application or kernel, and it also shows the
functions that are being executed. This report reveals the code the system is spending the most time in, which
should improve performance if you can use this data as a basis for optimization.
Listing 3:
Viewing Profile Data with oprofile
01 CPU: Core 2, speed 2400 MHz (estimated)
02 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
03 mask of 0x00 (Unhalted core cycles) count 1200000
04 samples % app name symbol name
05 295397 63.6911 cc1 (no symbols)
06 22861 4.9291 vmlinux-2.6.25-rc9 clear_page_c
07 11382 2.4541 libc-2.5.so memset
08 10959 2.3629 genksyms yylex
09 9256 1.9957 libc-2.5.so _int_malloc
10 6076 1.3101 vmlinux-2.6.25-rc9 page_fault
11 5378 1.1596 libc-2.5.so memcpy
12 5178 1.1164 vmlinux-2.6.25-rc9 handle_mm_fault
13 3857 0.8316 genksyms yyparse
14 3822 0.8241 libc-2.5.so strlen
15 ... ...
If you have collected call graph information, type the command
#opreport -c
to obtain the output shown in Listing 4.Listing 4 shows that this workload has some very heavy memory
allocation activity associated with getting free memory pages and clearing them.
Tuning Toolbox
3
Listing 4:
opreport Output
01 CPU: Core 2, speed 2400 MHz (estimated)
02 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 1200000
03 samples % image name app name symbol name
04 -------------------------------------------------------------------------------
05 295397 63.6911 cc1 cc1 (no symbols)
06 295397 100.000 cc1 cc1 (no symbols) [self]
07 -------------------------------------------------------------------------------
08 1 0.0044 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 path_walk
09 2 0.0087 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 __alloc_pages
10 2 0.0087 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 mntput_no_expire
11 22922 99.9782 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 get_page_from_freelist
12 22861 4.9291 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 clear_page_c
13 22861 99.7121 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 clear_page_c [self]
14 36 0.1570 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 apic_timer_interrupt
15 24 0.1047 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 ret_from_intr
16 3 0.0131 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 smp_apic_timer_interrupt
17 2 0.0087 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 mntput_no_expire
18 1 0.0044 vmlinux-2.6.25-rc9 vmlinux-2.6.25-rc9 __link_path_walk
19 -------------------------------------------------------------------------------
20 11382 2.4541 libc-2.5.so libc-2.5.so memset
21 11382 100.000 libc-2.5.so libc-2.5.so memset [self]
22 -------------------------------------------------------------------------------
23 10959 2.3629 genksyms genksyms yylex
24 10959 100.000 genksyms genksyms yylex [self]
25 ... ...
Too Many Cache Misses?
The performance of the system is highly dependent on the effectiveness of the cache. Any cache miss will
degrade performance and lead to a CPU stall.
Sometimes a cache miss is caused by frequently used fields located in data structures that span across the
cache line. Oprofile can diagnose this kind of problem.
Again, using the Intel Core 2 processor as an example, choose the event
LLC_MISSES
to profile all the L2
cache requests that miss the L2 cache. For the exact event to use, you should invoke
opcontrol --list-events
to
read about the details of each event type available for your CPU.
Listing 5 shows how to call up a cache miss profile.
Oprofile is a very versatile tool. By carefully choosing which events to monitor, you can zero in on the CPU
operation that is causing the problem.
Listing 5:
Cache Miss Profile
01 #opreport -l
02 CPU: Core 2, speed 1801 MHz (estimated)
03 Counted L2_RQSTS events (number of L2 cache requests) with a unit mask of 0x41
04 (multiple flags) count 90050
05 samples % app name symbol name
06 2803 63.4163 cc1 (no symbols)
07 190 4.2986 vmlinux-2.6.25-rc9-ltop get_page_from_freelist
08 102 2.3077 as (no symbols)
09 60 1.3575 vmlinux-2.6.25-rc9-ltop __lock_acquire
10 53 1.1991 libc-2.7.so strcmp
11 39 0.8824 vmlinux-2.6.25-rc9-ltop unmap_vmas
12 38 0.8597 vmlinux-2.6.25-rc9-ltop list_del
Locking Problems
A high context switching rate, relative to the number of running processes, is undesirable and could indicate a
lock contention problem. To determine the most contended locks, enable the lock statistics in the kernel,
which will give you insight into what is causing the lock contention. To do so, use the
lock_stat
feature in
Tuning Toolbox
4
2.6.23 or later kernels. First, you'll need to recompile the kernel with the
CONFIG_LOCK_STAT=y
option.
Then, before running the workloads, clear the statistics with:
#echo 0 > /proc/lock_stat
After running the workload, review the lock statistics with the following command:
#cat /proc/lock_stat
The output of the preceding command is a list of locks in the kernel sorted by the number of contentions. For
each lock, you will see the number of contentions, as well as the shortest, maximum, and cumulative wait
time for a contention. In addition, you will see the number of acquisitions, as well as the minimum, maximum,
and cumulative hold times for a lock. The top call sites of the lock are also given to let you locate quickly
where in the kernel the lock occurs.
It is worth noting that the lock statistics infrastructure incurs overhead. Once you have finished hunting for
locks, you should disable this feature to maximize performance.
Excessive Latency
Program throughput that is inconsistent and sputters, applications that seem to go to sleep before coming
alive, and a lot of processes under the
blocked
column in vmstat are often signs of latency in the system.
LatencyTOP is a new tool that helps diagnose latency issues.
Starting with the 2.6.25 kernel, you can compile LatencyTOP support into the kernel by enabling the
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
and
CONFIG_LATENCYTOP=y
options in the kernel
configuration. After booting up the kernel with LatencyTOP capability, you can trace latency in the workload
with a userspace latency tracing tool from the LatencyTOP website [2]. To start, compile the tool, do a
make
install
of the LatencyTOP program, and run the following as root:
#./latencytop
The LatencyTOP program's top screen (Figure 2) provides a periodic dump of the top causes that lead to
processes being blocked, sorted by the maximum blocked time for each cause. Also, you'll find information
on the percentage of time a particular cause contributed to the total blocked time. The bottom screen provides
similar information on a per-process basis.
Figure 2:
Studying system latency with LatencyTOP.
An Example
Linux provides quick allocation and deallocation of frequently used objects in caches called "slabs." To
provide better performance, Christopher Lameter introduced a new slabs manager called Slub.
Tuning Toolbox
5
Plik z chomika:
SOLARIX33
Inne pliki z tego folderu:
2010.10_Popular Demand-Sql Selects in Javascript with Publicsql.pdf
(728 KB)
2010.10_New Start-What's Coming in Html 5.pdf
(645 KB)
2010.10_Hook, Line, and Sinker-Testing Web Applications with Google's Skipfish.pdf
(756 KB)
2010.10_Easy Street-Simplifying Site Management.pdf
(601 KB)
2010.09_Return of the Lost Sibling-Failure Tolerance with Xen 4 and Remus.pdf
(634 KB)
Inne foldery tego chomika:
Ask Klaus
Beginners
Comment
Community
Community Notebook
Zgłoś jeśli
naruszono regulamin