IO and Cache in Linux

Jason · 2023年 Dec 5日 13:49

In actual business development, we often encounter the problem of writing files not taking effect due to power failure issues, so it is necessary for us to study the root cause.

Cache

As shown in the figure, when the program calls various file operation functions, the flow of user data to the disk is as shown in the figure. The figure describes the hierarchical relationship of file operation functions under Linux and the location of the memory cache layer. The solid black line in the middle is the boundary between user mode and kernel mode.

Analyzing this figure from top to bottom, first are the file operation functions defined by the C language stdio library, which are cross-platform encapsulation functions implemented in user mode. The file operation functions implemented in stdio have their own stdio buffer, which is a cache implemented in user mode. The reason for using a cache here is simple - system calls are always expensive. If the user code reads or writes files repeatedly with a small size, the stdio library can aggregate multiple read or write operations through the buffer to improve program performance. The stdio library also supports the fflush(3) function to actively flush the buffer, and actively calling the underlying system call immediately updates the data in the buffer. In particular, the setbuf(3) function can set the user-mode buffer of the stdio library, and even cancel the use of the buffer.

There is also a buffer between the system call read(2)/write(2) and the actual disk read and write, which is referred to as the Kernel buffer cache. In Linux, the cache of files is habitually referred to as Page Cache, and the cache of lower-level devices is referred to as Buffer Cache. These two concepts are easy to confuse, and here is a brief introduction to the conceptual differences: Page Cache is used to cache the content of files, which is more related to the file system. The content of the file needs to be mapped to the actual physical disk, and this mapping relationship is completed by the file system; Buffer Cache is used to cache the data of storage device blocks (such as disk sectors), regardless of whether there is a file system (the metadata of the file system is cached in the Buffer Cache).

IO stack in the Linux kernel

From the diagram, it can be seen that there are roughly three levels in the IO stack of Linux, starting from the system call interface:

File system layer: Taking write(2) as an example, the kernel copies the user data specified by the write(2) parameter to the file system cache and synchronizes it with the lower layer in a timely manner.
Block layer: Manages the IO queue of block devices, merges and sorts IO requests.
Device layer: Interacts with memory directly through DMA to complete the interaction between data and specific devices.

Combining this diagram, we can think about how the mechanisms used in Linux system programming, such as Buffered IO, mmap(2), and Direct IO, are related to the Linux IO stack. The diagram is a bit complex, so I will draw a simplified diagram and add the locations of these mechanisms to it.

What is the process of reading a file using traditional Buffered IO with read(2)? Assuming that a cold file (not in cache) needs to be read, after opening the file with open(2), the kernel establishes a series of data structures. Next, read(2) is called, and when it reaches the file system layer, it finds that the requested disk mapping is not in the Page Cache. The corresponding Page Cache is then created and associated with the relevant sector. The request then reaches the block device layer, where it queues in the IO queue and undergoes a series of scheduling before reaching the device driver layer. At this point, the corresponding disk sector is usually read into the cache using DMA, and then read(2) copies the data to the user-provided user-space buffer (as specified by the read(2) parameters).

How many times is the data copied during the entire process? If we count from the disk to the Page Cache as the first copy, then the second copy is from the Page Cache to the user-space buffer. What does mmap(2) do? mmap(2) directly maps the Page Cache to the user-space address space, so there is no second copy process when reading files using mmap(2). What about Direct IO? This mechanism is even more aggressive, directly connecting the user-space and block IO layers, and bypassing the Page Cache to copy data directly from the disk to the user-space. What are the benefits? For write operations, the process maps the process's buffer to the disk sector and transfers data using DMA, reducing the need for a copy process at the Page Cache layer and improving write efficiency. For reading, the first read is certainly faster than the traditional method, but subsequent reads are not as fast as the traditional method (of course, you can also do your own caching in user-space, as some commercial databases do).

In addition to the traditional Buffered IO, which can read and write files freely using offset + length, both mmap(2) and Direct IO require data to be aligned to pages, and Direct IO also requires reads and writes to be multiples of the underlying storage device block size (even Linux 2.4 requires it to be a multiple of the file system logical block size). Therefore, as the interface becomes more low-level, the apparent efficiency gains come at the cost of more work at the application layer.

Synchronization of Page Cache

Broadly speaking, there are two ways to synchronize the cache, namely Write Through and Write Back. From the names, it can be seen that these two methods are concepts derived from different processing methods of write operations (there is no cache consistency if it is pure read, right?). Corresponding to the Linux Page Cache, Write Through refers to the write(2) operation copying data to the Page Cache and synchronizing with the underlying write operation immediately, and returning after completing the underlying update. Write Back is the opposite, which means that after writing to the Page Cache, it can return. The update operation from the Page Cache to the underlying layer is performed asynchronously.

By default, Linux Buffered IO uses the Write Back mechanism, that is, the write operation of file operations only writes to the Page Cache and returns, and the update operation from the Page Cache to the disk is performed asynchronously. The modified memory pages in the Page Cache are called dirty pages, and the dirty pages are written to the disk by a kernel thread called pdflush (Page Dirty Flush) at a specific time. The timing and conditions for writing are as follows:

When the free memory is lower than a specific threshold, the kernel must write the dirty pages back to the disk to release memory.

When the dirty pages reside in memory for more than a specific threshold, the kernel must write the timed-out dirty pages back to the disk.

When a user process calls the sync(2), fsync(2), or fdatasync(2) system call, the kernel performs the corresponding write-back operation.

The flushing strategy is determined by the following parameters (the numerical unit is 1/100 second):

Code: Select all

Flush is executed every 5 seconds
# cat /proc/sys/vm/dirty_writeback_centisecs 
500
Dirty data that has been resident in memory for more than 30 seconds will be written to the disk by flush at the next execution
# cat /proc/sys/vm/dirty_expire_centisecs 
3000
If the dirty pages occupy more than 10% of the total physical memory, flush is triggered to write the dirty data back to the disk
# cat /proc/sys/vm/dirty_background_ratio 
10

The default is Write Back. If you want to specify that a certain file is Write Through, that is, when the reliability of the write operation overwhelms efficiency, can it be achieved? Of course, in addition to system calls such as fsync(2) mentioned earlier, pass in the O_SYNC flag when opening a file with open(2).