-
Notifications
You must be signed in to change notification settings - Fork 15
/
delalloc.txt
153 lines (126 loc) · 8.3 KB
/
delalloc.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
Delayed allocation (delalloc)
=============================
Most filesystems resorts to all kinds of trickery to ensure performance. One of
those tricks is called delayed allocation. This mean the filesystem goes to
great lengths to try and delay as much as possible the moment calling into the
extent allocator. BRFS makes no exception. Due to the nature of implementation
the code involved in this operation is a bit involved. The idea of this document
is to illuminate how delayed allocations (delallocs) are implemented in btrfs.
The complexicty stems primarily from the fact that BTRFS relies on a lot of
call back calls and asynchornicity to it's hard to follow the code.
Buffered write path
-------------------
Delallocs are created in the buffered write path (btrfs_buffered_write). That
function is fairly involved but the gist really is:
1. Allocate enough space to satisfy the space being written. This is achieved
via btrfs_delalloc_reserve_metadata. This function ensures there is enough
space for metadata (checksum bytes and data structures describing the extents
being written).
2. A number of pages necessary to house the number of bytes being written are
prepared by means of two functions: prepare_pages allocates the pages in the
pagecache tree of the file and lock_and_cleanup_extent_if_need ensures that
if there are any pending writes to the range of the file being written they
are completed.
3. Those pages are filled in with the new data via btrfs_copy_from_user.
At this point we have an inode and a range of page cache pages, corresponding
to the range of the file being currently written. No btrfs data structures have
yet been created for those "dirty" pages. Since we are using delalloc we've
only reserved space for the data/metadata but still don't have an idea where
on disk the data is going to be persisted. So there needs to be a way to tell
btrfs "those pages need something done to them in order to actually persist" the
data. This final step of the process is implemented via:
4. btrfs_dirty_pages. What this function does is to ensure that for any
writes after EOF in-memory copy of the extent data structures are created
(btrfs_find_new_delalloc_bytes). After that btrfs_set_extent_delalloc is
called which sets the EXTENT_DELALLOC bit to the newly written range of the
file. Finally all the newly written pages have they dirty flag set. This implies
that they are going to be written back to disk when the writeback timer expires
or if the system is low on memory.
Despite the buffered writeback code being messy, it's really those 4 logical
steps which trigger the delalloc logic. Bear in mind that we still haven't
called into the allocator to persist data i.e stuff is delayed.
Until the pages are presisted any reads to the region which has been written
will be served directly from the pages currently present in memory.
Page writeback
--------------
When the system is either low on memory or the writeback time has expired
pag writeback is triggered. There is a substantial amount of generic code which
is invoked before the filesystem is eventually called. The entry point into the
fs is the ->writepages function pointer, which is set to btrfs_writepages, this
is really a thin wrapper over extent_write_cache_pages. The last function goes
through every page of the inode passed (i.e. the address_space argument) and
calls __extent_writepage on it. __extent_writepage is responsible for allocating
all the necessary on-disk data structures describing the to-be-written page
and submitting the page for writeback. This is performed in the following
sequence:
1. writepage_delalloc is called. This function goes through the io_tree of the
inode whose page we are writing and finds a range marked with DELALLOC starting
at the offset in the written page. This is done by calling find_lock_delalloc_range.
Remember that the range was initialised by the buffered write path. Once such
a range is found, as defined by delalloc_start/delalloc_end variables they are passed to the
btrfs_run_delalloc_range(). This function pointer really points to
run_delalloc_range, which is a simple dispatch function. The purpose of the
dispatch function is to decide which routine to call to "fill the delalloc
range". The decision is based on whether NODATACOW is enabled (run_delalloc_nocow
is called in this case) or whether the BTRFS_INODE_PREALLOC flag is set, signalling
there are preallocated extents (also handled by run_delalloc_nocow). Or whether
compression is needed (in this case cow_file_range_async is called). Finally if
none of the above is true - cow_file_range is called.
For the sake of simplicity only cow_file_range is going to be considered in this
article. The basic idea is to allocate extents via btrfs_reserve_extent on disk
for the range. Following a successful reservation ordered data struct is
allocated via btrfs_add_ordered_extent. It's used to track the state of
pending io for a particular range. More on this will be discussed in the End IO
path section.
So let's stop for a moment and consider where we are in the lifetime of the
data. So now we have dirty pages created by the buffered write. What just
happened is appropriate extents were allocated for the range of bytes which the
currently written page represents and appropriate ordered data structs were
allocated to track those extents. What remains is for actual IO on the page be
triggered.
2. __extent_writepage_io is called to trigger io. This function really loops
until we've submitted write requests for all extents within the requested range.
The sequence of submission consists of :
1. btrfs_get_extent - to get the newly created extents by btrfs_run_delalloc_range
and use that extent information to calculate the amount that needs
to be written.
2. Once all of this is done, submit_extent_page is called. This function
builds the require 'bio' structs and sets endio routine to
end_bio_extent_writepage.
<TODO: Discuss btrfs_writepage_cow_fixup and
btrfs_writepage_endio_finish_ordered>
End IO path
-----------
Upon completion of data write end_bio_extent_writepage end_io call back is
called. This function calls end_extent_writepage for every segment (page) in
this bio. end_extent_writepage is responsible for performing any fs-specific
tasks in order to complete the io. This is implemented mainly via the
writepage_end_io_hook, which is really btrfs_writepage_end_io_hook. This
function ensures that the appropriate metadata is persisted on disk. First
btrfs_dec_test_ordered_pending is called, its purpose is to search the
tree of ordered extents (added in "Page writeback" section) and subtract the
amount of io this write completed. Then based on the whether all the pending
io bytes are completed (btrfs_ordered_extent->bytes_left == 0) this function
returns 1 (all io completed) or 0 (io still pending).
So the reason why we need ordered extents is that if we want to write something
in the range of 2-4mb we could potentially queue multiple bios to fulfill
this request. Upon every bio completion the generic code will eventually end up
calling btrfs_dec_test_ordered_pending. Most of the time this function will
return 0, since there will be more pending writes. So by returning 0 from
btrfs_dec_test_ordered_pending we stop doing further end io processing and
short-circuit btrfs_writepage_end_io_hook execution.
Alternatively, when the current io completion is the last for this ordered
extent the final job is queued for asynchronous execution. This entails
executing finish_ordered_fn/btrfs_finish_ordered_io. The last function is the
one which does all the necessary changes to the btree in order to bring the
system to a consistent state and "publish" the newly written data. The 2 most
important things that this function performs is calling either
btrfs_mark_extent_written (in case we just wrote to a PREALLOC extent) or
calling insert_reserved_file_extent. And adding the necessary checksums to the
csum tree via add_pending_csums.
It's important to note that the end io context is already asynchronous and
arbitrary and initially it might seems queueing yet another asynchornous job
from there is redundant. The matter of fact is it's not, this is required since
modifying the btree is a pretty heavy operation and in order to avoid deadlocks
as well as not block further ios from being completed we need to do this
processing from a different thread than the end io routine.