The Journey of I/O from Userspace to Device

When a Linux userspace application performs disk I/O – such as writing to an SD card, those requests go through a complex journey across multiple kernel components. These components aim to deliver optimal I/O performance whilst balancing a variety of needs such as latency and fairness. However, when things go wrong, it can be challenging to trace and understand the path that I/O takes. In this post, we will take a glimpse into the journey of I/O requests and unravel some the complexities of the Linux kernel components involved.

Let’s start by using the ‘dd‘ utility to perform the seemingly simple task of writing two 128KB blocks of zeros to an SD card:

$ dd if=/dev/zero of=/dev/mmcblk0 bs=128K count=2

In order to help us examine what IO requests are actually sent to the SD card, we’ll make use of the ‘btrace‘ script running in another terminal. The script, which utilises blktrace and blkparse extracts event data from the kernel via debugfs and can provide excellent insight. The command is shown below, note that we’ve used grep to filter the output to only show requests that have been “completed” by the device driver (SD/MMC).

$ btrace -t /dev/mmcblk0 | grep " C "

What are you expecting to see? Perhaps two 128K write requests? Let’s take a look at the output from btrace.

 179,0    1      325   171.158566907    22  C  WS 0 + 512 (22530758) [0]
 179,0    4      551   171.160517220    40  C   R 0 + 8 ( 1254058) [0]
 179,0    1      330   171.161858718    22  C   R 8 + 8 ( 1198855) [0]
 179,0    4      556   171.163031344    40  C   R 24 + 8 ( 1029382) [0]
 179,0    6      118   171.170218683    52  C  RA 15267712 + 8 (  802653) [0]
 179,0    1      337   171.171261676    22  C  RA 15267824 + 8 (  908070) [0]
 179,0    6      125   171.172299234    52  C  RA 15267832 + 8 (  922497) [0]
 179,0    1      344   171.173451941    22  C  RA 15267576 + 8 ( 1036266) [0]
 179,0    6      132   171.174496423    52  C  RA 15267776 + 8 (  878108) [0]
 179,0    1      351   171.175505872    22  C  RA 15267584 + 8 (  887605) [0]
 179,0    6      139   171.176480833    52  C  RA 15267440 + 8 (  861523) [0]
...

Each line represents an event that has occurred, you’ll notice a column with the values WS, R or RA – where “R” refers to a read and “W” refers to a write. The “S” indicates a synchronous operation and the “RA” refers to read-ahead which is when the kernel reads more than you asked for (as it assumes you’ll probably soon ask for it as well). You’ll also see next to it an offset and size, e.g. “WS 0 + 512 means a write starting at offset 0 and of size 512 sectors – where a sector is 512 bytes.

The output above shows that, in response to our ‘dd’ command, a single 256KB write (512 sectors of 512 bytes) was issued to the SD card, followed by 48 reads (not all shown above) totalling more than 1MB.

The unexpected reads are best explained by a thread on the systemd mailing list: “When a device node is closed for writing, the kernel generates a ‘change’ uevent for it. This prompts udev to re-read filesystems and update its cache of labels, UUIDs, and other information.” – In other words the partition table of the device is being read. These extra reads are inconvenient and add unnecessary ‘noise’ to our investigation. Rather than dig into systemd/udev code, we decided to make use of the flock system call, which allows a process to hold an exclusive lock on a file (in our case a device node) which locks out any other process (such as systemd/udev). To aid our testing, we wrote a simple utility that uses flock to request a lock on /dev/mmcblk0 and then starts a given command (e.g. dd) having previously forked blktrace. With this, when we re-run the ‘dd’ command we no longer see any reads. Phew.

However, how do we explain seeing one 256KB write when we asked for two 128KB writes? We asked the kernel, via ‘dd’ to perform two writes of 128K each and we can prove this by looking at strace output from dd (notice the size argument in the write system call):

$ sudo strace -e write dd if=/dev/zero of=/dev/mmcblk0 bs=128K count=2
 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"…, 131072) = 131072
 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"…, 131072) = 131072
 write(2, "2+0 records in\n2+0 records out\n", 312+0 records in
 2+0 records out
 ) = 31
 write(2, "262144 bytes (262 kB, 256 KiB) c"…, 61262144 bytes (262 kB, 256 KiB) copied, 0.00075298 s, 348 MB/s) = 61
 write(2, "\n", 1
 )                       = 1
 +++ exited with 0 +++

Let’s take another look at the full, unfiltered output from btrace to get some clues:

 179,0    4        1     0.000000000 120257  Q   W 0 + 8 [kworker/u16:0]
 179,0    4        2     0.000010144 120257  G   W 0 + 8 [kworker/u16:0]
 179,0    4        3     0.000011676 120257  P   N [kworker/u16:0]
 179,0    4        4     0.000014309 120257  Q   W 8 + 8 [kworker/u16:0]
 179,0    4        5     0.000015904 120257  M   W 8 + 8 [kworker/u16:0]
 179,0    4        6     0.000018056 120257  Q   W 16 + 8 [kworker/u16:0]
 179,0    4        7     0.000018342 120257  M   W 16 + 8 [kworker/u16:0]
 179,0    4        8     0.000019860 120257  Q   W 24 + 8 [kworker/u16:0]
 179,0    4        9     0.000020136 120257  M   W 24 + 8 [kworker/u16:0]
...
 179,0    4      125     0.000132941 120257  M   W 488 + 8 [kworker/u16:0]
 179,0    4      126     0.000134407 120257  Q   W 496 + 8 [kworker/u16:0]
 179,0    4      127     0.000134674 120257  M   W 496 + 8 [kworker/u16:0]
 179,0    4      128     0.000136394 120257  Q   W 504 + 8 [kworker/u16:0]
 179,0    4      129     0.000136661 120257  M   W 504 + 8 [kworker/u16:0]
 179,0    4      130     0.000140161 120257  U   N [kworker/u16:0] 1
 179,0    4      131     0.000141694 120257  I   W 0 + 512 (  131550) [kworker/u16:0]
 179,0    4      132     0.426671897   296  D   W 0 + 512 (426530203) [kworker/4:1H]
 179,0    4      133     0.449229017    40  C   W 0 + 512 (22557120) [0]
 CPU4 (mmcblk0):
  Reads Queued:           0,        0KiB  Writes Queued:          64,      256KiB
  Read Dispatches:        0,        0KiB  Write Dispatches:        1,      256KiB
  Reads Requeued:         0               Writes Requeued:         0
  Reads Completed:        0,        0KiB  Writes Completed:        1,      256KiB
  Read Merges:            0,        0KiB  Write Merges:           63,      252KiB
  Read depth:             0               Write depth:             1
  IO unplugs:             1               Timer unplugs:           0
 Throughput (R/W): 0KiB/s / 570KiB/s
 Events (mmcblk0): 133 entries
 Skips: 0 forward (0 -   0.0%)

The sixth column describes the trace action, we’ve previously seen ‘C’ for device completion – but now we also have Q, G, P, M, U, I and D. Let’s explain these by following the first lines above:

Q W 0 + 8: A write request of 4096 bytes to offset 0 would like to be queued
G W 0 + 8: A structure for the above request (struct request) has been allocated
P N: This is a plug event, another I/O optimisation – let’s ignore this for now
Q W 8 + 8: Another write request of 4096 bytes to offset 4096 would like to be queued
M W 8 + 8: A request to merge this write request with another request (turning these two requests into one)

Most of the output from btrace shows writes of 4096 bytes being queued (Q) and then immediately merged (M) with a previous request. Towards the end of the log we see a different pattern:

U N: This is an unplug event
I W 0 + 512: A write request of 256KB is sent to the I/O scheduler
D W 0 + 512: The write request of 256KB is sent to the device driver
C W 0 + 512: The device driver has completed the 256KB write request

This is another surprise, our user space utility ‘dd’ made two writes of 128KB, yet they appeared in btrace as 64 writes of 4K, which was then combined into a single write request of 256KB made to the driver. Why? This time we can blame Linux’s page cache.

The Linux page cache is a cache of pages that sits between the underlying storage (e.g. disks) and the Linux Virtual File System (VFS – which is Linux’s way of providing a common interface to filesystems). When data is read (e.g. you read a file), the data will only be retrieved from the underlying storage if it’s not already in the page cache. When data is written it is only written to the page cache (it’s a write-back cache), the kernel will write out this ‘dirty’ data to storage at a convenient time in the future. The cache makes full use of otherwise unused RAM and provides for improved system performance whilst reducing I/O operations.

But how does the kernel know when to write back these dirty pages from the page cache to the underlying block layer? That really depends – in our case, we were interacting with the /dev/mmcblk0 block device, which is really just an inode with a special set of generic ‘struct file_operations’ for block devices. The file_operations are a set of callbacks which are called when the file is opened, written to, etc. We can see that for /dev/mmcblk0 when all file handles to a file are closed then it calls ‘filemap_flush’ – this is a function that is used to tell the kernel to writeback dirty pages. We can see use of this function (and it’s synchronous version filemap_write_and_wait) in various places across the kernel, for example when closing a file on a vfat filesystem that was mounted with the ‘flush’ flag. For our /dev/mmcblk0 device, when we use the fsync system call, or use the fsync option in ‘dd’, then that too will use filemap_write_and_wait. However, failing explicit instructions to writeback, the kernel will periodically run a background thread to write out old data – this is very configurable via the files in /proc/sys/vm that are prefixed with dirty_.

It’s also worth noting that when the kernel decides to write back dirty data, the way it does that depends, once again, on the driver being used. Each inode has a set of callbacks named ‘struct address_space_operations’ and these describe how to read and write pages. Our /dev/mmcblk0 uses default ‘struct address_space_operations’, when the kernel wants to write back pages they are written page at a time. Referring back to our btrace output you’ll notice that the queue requests are all 8 sectors, or 4096 bytes in size – that’s because the size of a page on my system is 4096 bytes. But it’s interesting to note that most file systems use a more effecient method, in the case of extfat it uses mpage_writepages which attempts to combine pages into fewer requests (bio’s).

The final part of the mystery is understanding how these 8 sector (4096 bytes) requests end up as a single 512 sector (256KB) request to the SD card. This is an optimisation performed by the block layer, which we can see in our btrace output:

 179,0    4      125     0.000132941 120257  M   W 488 + 8 [kworker/u16:0]
 179,0    4      126     0.000134407 120257  Q   W 496 + 8 [kworker/u16:0]
 179,0    4      127     0.000134674 120257  M   W 496 + 8 [kworker/u16:0]
 179,0    4      128     0.000136394 120257  Q   W 504 + 8 [kworker/u16:0]

When the block layer sees that we have a write that continues from a previous write, it merges them together (shown by the ‘M’ action) which acts to optimise I/O performance. The result is a merge of 64 requests of 8 sectors (4096 bytes) – we see the outcome of this when the block layer issues the request to the I/O scheduler – a single request of 512 sectors (256KB)

179,0 4 131 0.000141694 120257 I W 0 + 512 ( 131550)

Incidentally, we can disable this merging behaviour via the nomerges sysfs file, for our device that’s found in /sys/block/mmcblk0/queue/nomerges – by writing a value of 2 no merges will be performed. Let’s try this and review the summary provided by blkparse:

CPU5 (mmcblk0):
  Reads Queued:           0,        0KiB  Writes Queued:          64,      256KiB
  Read Dispatches:        0,        0KiB  Write Dispatches:       64,      256KiB
  Reads Requeued:         0               Writes Requeued:         0
  Reads Completed:        0,        0KiB  Writes Completed:       64,      256KiB
  Read Merges:            0,        0KiB  Write Merges:            0,        0KiB
  Read depth:             0               Write depth:             3
  IO unplugs:             1               Timer unplugs:           3
 Throughput (R/W): 0KiB/s / 323KiB/s
 Events (mmcblk0): 328 entries
 Skips: 0 forward (0 -   0.0%)

You can see that 64 write requests were queued (i.e. 256KB divided into 4096 byte pages), but unlike the first trace we see 64 write dispatches and completions – in other words, no merging occurred and the SD card was told to make 64 writes of 4096 bytes. However, let’s revert our ‘nomerges’ file back to the default merging behaviour.

To summarise what we’ve seen – we started in userspace with two write system calls, writing 128KB at a time. This made it to the page cache which was eventually written out to the block layer as 64 page size requests. The block layer merged these as it received them, into a single 256K write, which was issued to the I/O scheduler, and eventually passed to the underlying SD card.

Of course, we could have bypassed the page cache by updating our dd line as follows to include the ‘direct’ oflag as follows. This results in us using direct I/O instead of buffered I/O (buffered by the page cache). It’s interesting to note that the address_space_operations structure includes a callback for doing direct I/O, once again allowing the driver to determine the best way to write pages to disk.

$ dd if=/dev/zero of=/dev/mmcblk0 bs=128K count=2 oflag=direct

Let’s see what our btrace looks like:

 179,0    1        1     0.000000000 224632  Q  WS 0 + 256 [dd]
 179,0    1        2     0.000011317 224632  G  WS 0 + 256 [dd]
 179,0    1        3     0.000013021 224632  P   N [dd]
 179,0    1        4     0.000014960 224632 UT   N [dd] 1
 179,0    1        5     0.000016757 224632  I  WS 0 + 256 (    5440) [dd]
 179,0    1        6     0.454820951 120533  D  WS 0 + 256 (454804194) [kworker/1:2H]
 179,0    1        7     0.464689235    22  C  WS 0 + 256 ( 9868284) [0]
 179,0    2        1     0.464784982 224632  Q  WS 256 + 256 [dd]
 179,0    2        2     0.464788735 224632  G  WS 256 + 256 [dd]
 179,0    2        3     0.464789740 224632  P   N [dd]
 179,0    2        4     0.464790533 224632 UT   N [dd] 1
 179,0    2        5     0.464791361 224632  I  WS 256 + 256 (    2626) [dd]
 179,0    2        6     0.464811958 120532  D  WS 256 + 256 (   20597) [kworker/2:0H]
 179,0    2        7     0.476459250    28  C  WS 256 + 256 (11647292) [0]

This time we can see that that that a write request of 256 sectors (128KB) is queued, and we can see this is queued by the process ‘dd’ (Q). We then see a structure get allocated (G), a plug and unplug event (P, UT). Finally we see the 256 sector write being issued to the I/O scheduler where it goes on to being sent to the device driver which is shortly completed. We then see this happen again, but for the second 128KB block.

You may be wondering why the block layer didn’t merge the two 128KB blocks into one 256KB block. This is because the ‘write’ system call is synchronous, it will wait for the write to complete before returning – thus the block layer or I/O scheduler never gets a chance to merge them. (When we used the page cache the write is considered complete once it’s written to the page cache). By using the ‘-r’ flag in strace to add timestamps, we can compare the timestamps between the btrace and strace output and see that the second write request from ‘dd’ only arrives after the first request has been completed by the SD card.

By the way, you may recall plug and unplug events in the blktrace output – this is another optimisation that serves to delay issuing requests to the I/O scheduler to allow them to build up enough such that requests can be merged or sorted. Plugs are usually started when writing to an empty device queue, and are unplugged either explicitly by the task or implicitly when the task is scheduled out.

We’ve seen how simple merging of requests can potentially improve I/O performance – by having fewer more optimally sized requests. However, throughput isn’t the only factor in achieving acceptable system wide performance, other considerations such as latency and fairly sharing I/O resources between processes are also important. It’s for this reason that the kernel supports I/O schedulers – algorithms that take these factors into consideration and determine what requests are made to the underlying devices and when. We can see, via sysfs that we are using the mq-deadline scheduler:

$ cat /sys/block/mmcblk0/queue/scheduler 
 [mq-deadline] none

However, exploration of I/O schedulers is probably a topic for another blog post.

Finally, when merging I/O requests, how does the kernel know what an optimal I/O request is for a device? What would have happened if we requested 10 blocks of 128KB, would it have merged it to one 1280KB request? Let’s find out:

$ btrace -t /dev/mmcblk0 | grep " C "
 179,0    2      390     0.065672021    28  C  WS 0 + 1024 (65285549) [0]
 179,0    2      391     0.109364068    28  C  WS 1024 + 1024 (107779680) [0]
 179,0    2      392     0.132054879    28  C  WS 2048 + 512 (66514186) [0]

Ah – so it’s decided to merge the request into 3 requests of 512KB bytes, why? To find out more we need to look at the value of some sysfs files:

$ cd /sys/block/mmcblk0/queue/
$ cat hw_sector_size max_sectors_kb max_segments max_segment_size 
 512
 512
 256
 65536

This tells us that the maximum size the device supports is 512 KB (hw_sector_size). However this can be capped by max_sectors_kb which is a writeable file that controls the maximum size that the block layer will allow, in our case it matches the maximum supported size by the device. It’s also interesting to note that devices can report an optimal size (which may be smaller than the maximum size) via optimal_io_size.

As we’ve seen the initial 128KB buffers of memory we are trying to write to our SD card are broken down into 4KB pages, however these pages are unlikely to be physically contiguous in memory – as a result scatter gather DMA is often used with the request being broken down into ‘segments’ of contigious blocks of memory. The max_segments and max_segment_size describe the limits of this, in our case we can have a I/O request containing up to 256 segments of 64KB – that could allow us a maximum transfer (if the device supported it) of 16MB. However in practice, where the page cache is involved, the maximum would likely be 256 segments of 4KB so only 1MB.

As you can see there is much complexity in the journey of write requests from userspace through to an underlying storage medium – complexity which arises due to optimisations aimed at improving system wide performance (some which come from the context of minimising spinning-disk seek time). Thankfully with the help of btrace we can good visibility on what is going on under the hood, and by using sysfs controls and trivial invocations of dd we can see the effect of these optimisations. Understanding the path of requests from userspace to device and the optimisations along the way is really helpful when attempting to understand and improve work-load performance. However, we really have only scratched the surface!

The Journey of I/O from Userspace to Device

Popular Posts

Linaro Connect 2024: Madrid

Booting ARM without an ARM

Discovering CPU features from userspace with ELF_HWCAP

i.MX6 UL Bus Encryption Engine (BEE) in the Linux Kernel

You may also like...

Popular Posts