Where to set readahead: LVM, RAID devices, device-mapper, block devices?

You want to set readahead to tune the performance of you disk reads and you find that in your server there are several levels of devices, block devices, RAID devices, then LVM with device-mapper, etc.

You can set the readahead in any of them, which one is the right one?

I came up with this Stackoverflow question: https://serverfault.com/questions/418352/readahead-settings-for-lvm-device-mapper-software-raid-and-block-devices-wha

And i decided to do some tests to prove what wojciechz was saying, and he is right, let me show you:

My setup is a server with RAID10 and LVM with a /db partition mounted on the logical volume

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active raid10 sdj[7] sdi[6] sdh[5] sdg[4] sdf[3] sde[2] sdd[1] sdc[0]
3906525184 blocks super 1.2 512K chunks 2 near-copies [8/8] [UUUUUUUU]

# pvdisplay
--- Physical volume ---
PV Name /dev/md127
VG Name vg1
PV Size 3.64 TiB / not usable 0
Allocatable yes
PE Size 4.00 MiB
Total PE 953741
Free PE 489746
Allocated PE 463995
PV UUID KH4RjS-lgAN-2OdI-hiYQ-HuR1-naDM-nSmc5S

# mount | grep db
/dev/mapper/vg1-db on /db type ext4 (rw,noatime,nodiratime,discard,stripe=512,data=ordered)

The current readhead set is the following:

# blockdev --report
RO RA SSZ BSZ StartSec Size Device
rw 256 512 4096 0 1000204886016 /dev/sda
rw 256 512 4096 2048 1023410176 /dev/sda1
rw 256 512 1024 2002942 1024 /dev/sda2
rw 256 512 4096 2002944 999178633216 /dev/sda5
rw 256 512 4096 0 1022820352 /dev/md0
rw 256 512 4096 0 999044218880 /dev/md1
rw 256 512 4096 0 1000204886016 /dev/sdb
rw 256 512 4096 2048 1023410176 /dev/sdb1
rw 256 512 1024 2002942 1024 /dev/sdb2
rw 256 512 4096 2002944 999178633216 /dev/sdb5
rw 256 512 4096 0 49996103680 /dev/dm-0
rw 256 512 4096 0 135094337536 /dev/dm-1
rw 256 512 4096 0 1000204886016 /dev/sdc
rw 8192 512 4096 0 4000281788416 /dev/md127
rw 256 512 4096 0 1000204886016 /dev/sdd
rw 256 512 4096 0 1000204886016 /dev/sde
rw 256 512 4096 0 1000204886016 /dev/sdf
rw 256 512 4096 0 1000204886016 /dev/sdg
rw 256 512 4096 0 1000204886016 /dev/sdh
rw 256 512 4096 0 1000204886016 /dev/sdi
rw 256 512 4096 0 1000204886016 /dev/sdj
rw 8192 512 4096 0 1946136084480 /dev/dm-2
rw 256 512 4096 0 214748364800 /dev/dm-3

With the default setup, let’s perform some tests with hdparm, that does sequential reads:

# hdparm -t /dev/md127

Timing buffered disk reads: 3004 MB in 3.00 seconds = 1001.14 MB/sec
Timing buffered disk reads: 2982 MB in 3.00 seconds = 992.78 MB/sec
Timing buffered disk reads: 3654 MB in 3.00 seconds = 1217.26 MB/sec
Timing buffered disk reads: 3030 MB in 3.00 seconds = 1009.46 MB/sec
Timing buffered disk reads: 3026 MB in 3.00 seconds = 1007.32 MB/sec

# hdparm -t /dev/dm-2

Timing buffered disk reads: 2784 MB in 3.00 seconds = 927.44 MB/sec
Timing buffered disk reads: 2508 MB in 3.00 seconds = 835.10 MB/sec
Timing buffered disk reads: 2532 MB in 3.00 seconds = 843.62 MB/sec
Timing buffered disk reads: 2792 MB in 3.00 seconds = 930.62 MB/sec
Timing buffered disk reads: 2836 MB in 3.00 seconds = 944.93 MB/sec

# hdparm -t /dev/sdc

Timing buffered disk reads: 1046 MB in 3.00 seconds = 348.13 MB/sec
Timing buffered disk reads: 1048 MB in 3.00 seconds = 348.91 MB/sec
Timing buffered disk reads: 1034 MB in 3.01 seconds = 344.00 MB/sec
Timing buffered disk reads: 1050 MB in 3.01 seconds = 349.25 MB/sec
Timing buffered disk reads: 1044 MB in 3.00 seconds = 347.97 MB/sec

(/dev/dm-2 is /dev/mapper/vg1-db)

dm-2 and md127 show similar results, same readahead. Obviously, sdc2 is slower due to a smaller readahead and because doesn’t take advantage of RAID read parallelism. Let’s ignore this one for the tests.

Test 1: Set readahead only to 0 in /dev/md127

blockdev --setra 0 /dev/md127

Results:

# hdparm -t /dev/dm-2

Timing buffered disk reads: 2780 MB in 3.00 seconds = 926.19 MB/sec
Timing buffered disk reads: 2900 MB in 3.00 seconds = 966.01 MB/sec
Timing buffered disk reads: 3000 MB in 3.00 seconds = 999.22 MB/sec
Timing buffered disk reads: 3112 MB in 3.00 seconds = 1037.30 MB/sec
Timing buffered disk reads: 2624 MB in 3.00 seconds = 873.89 MB/sec

# hdparm -t /dev/md127

Timing buffered disk reads: 278 MB in 3.00 seconds = 92.62 MB/sec

Timing buffered disk reads: 328 MB in 3.02 seconds = 108.78 MB/sec
Timing buffered disk reads: 278 MB in 3.00 seconds = 92.62 MB/sec
Timing buffered disk reads: 328 MB in 3.01 seconds = 108.92 MB/sec

Timing buffered disk reads: 328 MB in 3.01 seconds = 92.56 MB/sec

dm-2 is not affected by the change.

Test 2: Set readahead only to 0 in /dev/dm-2

blockdev --setra 8192 /dev/md-127

blockdev --setra 0 /dev/dm-2

Results:

# hdparm -t /dev/dm-2

Timing buffered disk reads: 2780 MB in 3.00 seconds = 89.15 MB/sec
Timing buffered disk reads: 2900 MB in 3.00 seconds = 101.55 MB/sec
Timing buffered disk reads: 3000 MB in 3.00 seconds = 98.21 MB/sec
Timing buffered disk reads: 3112 MB in 3.00 seconds = 87.87 MB/sec
Timing buffered disk reads: 2624 MB in 3.00 seconds = 91.80 MB/sec

# hdparm -t /dev/md127

Timing buffered disk reads: 278 MB in 3.00 seconds = 1123.22 MB/sec

Timing buffered disk reads: 328 MB in 3.02 seconds = 1120.78 MB/sec
Timing buffered disk reads: 278 MB in 3.00 seconds = 1002.34 MB/sec
Timing buffered disk reads: 328 MB in 3.01 seconds = 1109.44 MB/sec

Timing buffered disk reads: 328 MB in 3.01 seconds = 1008.32 MB/sec

Timing buffered disk reads: 328 MB in 3.01 seconds = 1050.65 MB/sec

Now, only the one with readahead 0 is affected. And the other one is not.

Conclusion:

No matter what is below or on top of each device, readahead must be set in the one the operating system is dealing with, which is, the one mounted.

So, in my case, i need to change the readahead in /dev/dm-2 (or in /dev/mapper/vg1-db)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s