CVE-2026-23157

Description

In the Linux kernel, the following vulnerability has been resolved:btrfs: do not strictly require dirty metadata threshold for metadata writepages[BUG]There is an internal report that over 1000 processes arewaiting at the io_schedule_timeout() of balance_dirty_pages(), causinga system hang and trigger a kernel coredump.The kernel is v6.4 kernel based, but the root problem still applies toany upstream kernel before v6.18.[CAUSE]From Jan Kara for his wisdom on the dirty page balance behavior first. This cgroup dirty limit was what was actually playing the role here because the cgroup had only a small amount of memory and so the dirty limit for it was something like 16MB. Dirty throttling is responsible for enforcing that nobody can dirty (significantly) more dirty memory than theres dirty limit. Thus when a task is dirtying pages it periodically enters into balance_dirty_pages() and we let it sleep there to slow down the dirtying. When the system is over dirty limit already (either globally or within a cgroup of the running task), we will not let the task exit from balance_dirty_pages() until the number of dirty pages drops below the limit. So in this particular case, as I already mentioned, there was a cgroup with relatively small amount of memory and as a result with dirty limit set at 16MB. A task from that cgroup has dirtied about 28MB worth of pages in btrfs btree inode and these were practically the only dirty pages in that cgroup.So that means the only way to reduce the dirty pages of that cgroup isto writeback the dirty pages of btrfs btree inode, and only after thatthose processes can exit balance_dirty_pages().Now back to the btrfs part, btree_writepages() is responsible forwriting back dirty btree inode pages.The problem here is, there is a btrfs internal threshold that if thebtree inodes dirty bytes are below the 32M threshold, it will notdo any writeback.This behavior is to batch as much metadata as possible so we wont writeback those tree blocks and then later re-COW them again for anothermodification.This internal 32MiB is higher than the existing dirty page size (28MiB),meaning no writeback will happen, causing a deadlock between btrfs andcgroup:- Btrfs doesnt want to write back btree inode until more dirty pages- Cgroup/MM doesnt want more dirty pages for btrfs btree inode Thus any process touching that btree inode is put into sleep until the number of dirty pages is reduced.Thanks Jan Kara a lot for the analysis of the root cause.[ENHANCEMENT]Since kernel commit b55102826d7d (btrfs: set AS_KERNEL_FILE on thebtree_inode), btrfs btree inode pages will only be charged to the rootcgroup which should have a much larger limit than btrfs 32MiBthreshold.So it should not affect newer kernels.But for all current LTS kernels, they are all affected by this problem,and backporting the whole AS_KERNEL_FILE may not be a good idea.Even for newer kernels I still think its a good idea to getrid of the internal threshold at btree_writepages(), since for most casescgroup/MM has a better view of full system memory usage than btrfs fixedthreshold.For internal callers using btrfs_btree_balance_dirty() since thatfunction is already doing internal threshold check, we dont need tobother them.But for external callers of btree_writepages(), just respect theirrequests and write back whatever they want, ignoring the internalbtrfs threshold to avoid such deadlock on btree inode dirty pagebalancing.

Risk Information

Base Score
5.5
MODERATE
Vector
CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H
EPSS Score
Exploitation Probability
0.007

Associated Vulnerability

No records found

Patch Details

No records found

References

https://nvd.nist.gov/vuln/detail/CVE-2023-1234
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2023-1234