rfc9766v1.txt | rfc9766.txt | |||
---|---|---|---|---|
Internet Engineering Task Force (IETF) T. Haynes | Internet Engineering Task Force (IETF) T. Haynes | |||
Request for Comments: 9766 T. Myklebust | Request for Comments: 9766 T. Myklebust | |||
Category: Standards Track Hammerspace | Category: Standards Track Hammerspace | |||
ISSN: 2070-1721 February 2025 | ISSN: 2070-1721 April 2025 | |||
Addition of LAYOUT_WCC to NFSv4.2's Flexible File Layout Type | Extensions for Weak Cache Consistency in NFSv4.2's Flexible File Layout | |||
Abstract | Abstract | |||
This document specifies extensions to the Parallel Network File | This document specifies extensions to NFSv4.2 for improving Weak | |||
System (NFS) version 4 (pNFS) for improving write cache consistency. | Cache Consistency (WCC). These extensions introduce mechanisms that | |||
These extensions introduce mechanisms that ensure partial writes | ensure partial writes performed under a Parallel NFS (pNFS) layout | |||
performed under a pNFS layout remain coherent and correctly tracked. | remain coherent and correctly tracked. The solution addresses | |||
The solution addresses concurrency and data integrity concerns that | concurrency and data integrity concerns that may arise when multiple | |||
may arise when multiple clients write to the same file through | clients write to the same file through separate data servers. By | |||
separate data servers. By defining additional interactions among | defining additional interactions among clients, metadata servers, and | |||
clients, metadata servers, and data servers, this specification | data servers, this specification enhances the reliability of NFSv4 in | |||
enhances the reliability of NFSv4 in parallel-access environments and | parallel-access environments and ensures consistency across diverse | |||
ensures consistency across diverse deployment scenarios. | deployment scenarios. | |||
Status of This Memo | Status of This Memo | |||
This is an Internet Standards Track document. | This is an Internet Standards Track document. | |||
This document is a product of the Internet Engineering Task Force | This document is a product of the Internet Engineering Task Force | |||
(IETF). It represents the consensus of the IETF community. It has | (IETF). It represents the consensus of the IETF community. It has | |||
received public review and has been approved for publication by the | received public review and has been approved for publication by the | |||
Internet Engineering Steering Group (IESG). Further information on | Internet Engineering Steering Group (IESG). Further information on | |||
Internet Standards is available in Section 2 of RFC 7841. | Internet Standards is available in Section 2 of RFC 7841. | |||
skipping to change at line 78 ¶ | skipping to change at line 78 ¶ | |||
5. Security Considerations | 5. Security Considerations | |||
6. IANA Considerations | 6. IANA Considerations | |||
7. References | 7. References | |||
7.1. Normative References | 7.1. Normative References | |||
7.2. Informative References | 7.2. Informative References | |||
Acknowledgments | Acknowledgments | |||
Authors' Addresses | Authors' Addresses | |||
1. Introduction | 1. Introduction | |||
In the Network File System version 4 (NFSv4) with a Parallel NFS | In the Parallel NFS (pNFS) flexible file layout (see [RFC8435]), | |||
(pNFS) flexible file layout (see Section 12 of [RFC8435]) server, | ||||
there is no mechanism for the data servers to update the metadata | there is no mechanism for the data servers to update the metadata | |||
servers when the data portion of the file is modified. The metadata | servers when the data portion of the file is modified. The metadata | |||
server needs this knowledge to correspondingly update the metadata | server needs this knowledge to correspondingly update the metadata | |||
portion of the file. If the client is using NFSv3 as the protocol | portion of the file. If the client is using NFSv3 as the protocol | |||
with the data server, it can leverage Weak Cache Consistency (WCC) to | with the data server, it can leverage Weak Cache Consistency (WCC) to | |||
update the metadata server of the attribute changes. In this | update the metadata server of the attribute changes. In this | |||
document, we introduce a new operation called LAYOUT_WCC to NFSv4.2, | document, we introduce a new operation called LAYOUT_WCC to NFSv4.2, | |||
which allows the client to periodically report the attributes of the | which allows the client to periodically report the attributes of the | |||
data files to the metadata server. | data files to the metadata server. | |||
skipping to change at line 121 ¶ | skipping to change at line 120 ¶ | |||
metadata server (MDS): the pNFS server that provides metadata | metadata server (MDS): the pNFS server that provides metadata | |||
information for a file system object. | information for a file system object. | |||
storage device: the target to which clients may direct I/O requests | storage device: the target to which clients may direct I/O requests | |||
when they hold an appropriate layout. Note that each data server | when they hold an appropriate layout. Note that each data server | |||
is a storage device but that some storage device are not data | is a storage device but that some storage device are not data | |||
servers. (See Section 2.1 of [RFC8434] for a discussion on the | servers. (See Section 2.1 of [RFC8434] for a discussion on the | |||
difference between a data server and a storage device.) | difference between a data server and a storage device.) | |||
weak cache consistency (WCC): In NFSv3, WCC allows the client to | weak cache consistency (WCC): the mechanism in NFSv3 that allows the | |||
check for file attribute changes before and after an operation | client to check for file attribute changes before and after an | |||
(see Section 2.6 of [RFC1813]). | operation (see Section 2.6 of [RFC1813]). | |||
1.2. Requirements Language | 1.2. Requirements Language | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | |||
"OPTIONAL" in this document are to be interpreted as described in | "OPTIONAL" in this document are to be interpreted as described in | |||
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all | BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all | |||
capitals, as shown here. | capitals, as shown here. | |||
2. Weak Cache Consistency (WCC) | 2. Weak Cache Consistency (WCC) | |||
A pNFS layout type enables the metadata server to inform the client | A pNFS layout type enables the metadata server to inform the client | |||
of both the storage protocol and the locations of the data that the | of both the storage protocol and the locations of the data that the | |||
client should use when communicating with the storage devices. The | client should use when communicating with the storage devices. The | |||
flexible file layout type, as specified in [RFC8435], describes how | flexible file layout type, as specified in [RFC8435], describes how | |||
data servers using NFSv3 can be accessed. The client is restricted | data servers using NFSv3 can be accessed. The client is restricted | |||
to performing the following NFSv3 operations on the filehandles | to performing the following NFSv3 operations on the filehandles | |||
provided in the layout: READ (Section 3.3.6 of [RFC1813]), WRITE | provided in the layout: READ, WRITE, and COMMIT (see Sections 3.3.6, | |||
(Section 3.3.7 of [RFC1813]), and COMMIT (Section 3.3.21 of | 3.3.7, and 3.3.21 of [RFC1813], respectively). In other words, the | |||
[RFC1813]). In other words, the client may only use NFSv3 operations | client may only use NFSv3 operations that act directly on the data | |||
that act directly on the data portion of the file. | portion of the file. | |||
Because there is no control protocol (see [RFC8434]) possible with | Because there is no control protocol (see [RFC8434]) possible with | |||
all data servers, NFSv3 is used as the control protocol. As such, | all data servers, NFSv3 is used as the control protocol. As such, | |||
the following NFSv3 operations are commonly used by the metadata | the following NFSv3 operations are commonly used by the metadata | |||
server: CREATE (see Section 3.3.8 of [RFC1813]), GETATTR (see | server: CREATE, GETATTR, and SETATTR (see Sections 3.3.8, 3.3.1, and | |||
Section 3.3.1 of [RFC1813]), and SETATTR (see Section 3.3.2 of | 3.3.2 of [RFC1813], respectively). That is, the metadata server is | |||
[RFC1813]). That is, the metadata server is only allowed to use | only allowed to use NFSv3 operations that directly act on the | |||
NFSv3 operations that directly act on the metadata portion of the | metadata portion of the data file. GETATTR allows the metadata | |||
data file. GETATTR allows the metadata server to mainly retrieve the | server to mainly retrieve the mtime (modify time), ctime (change | |||
mtime (modify time), ctime (change time), and atime (access time). | time), and atime (access time). The metadata server can use this | |||
The metadata server can use this information to determine if the | information to determine if the client modified the file whilst it | |||
client modified the file whilst it held an iomode of LAYOUTIOMODE4_RW | held an iomode of LAYOUTIOMODE4_RW (see Section 3.3.20 of [RFC8881]). | |||
(see Section 3.3.20 of [RFC8881]). Then it can determine the | Then it can determine the following for the metadata file: | |||
following for the metadata file: time_modify (see Section 5.8.2.43 of | time_modify, time_metadata, and time_access (see Sections 5.8.2.43, | |||
[RFC8881]), time_metadata (see Section 5.8.2.42 of [RFC8881]), and | 5.8.2.42, and 5.8.2.37 of [RFC8881], respectively). That is, it can | |||
time_access (see Section 5.8.2.37 of [RFC8881]). That is, it can | ||||
determine the information to return to clients in an NFSv4.2 GETATTR | determine the information to return to clients in an NFSv4.2 GETATTR | |||
response. | response. | |||
For example, the metadata server might issue an NFSv3 GETATTR | For example, the metadata server might issue an NFSv3 GETATTR | |||
operation to the data server, which is typically triggered by a | operation to the data server, which is typically triggered by a | |||
client's NFSv4 GETATTR request to the metadata server. In addition | client's NFSv4 GETATTR request to the metadata server. In addition | |||
to the cost of each individual GETATTR operation, the data server can | to the cost of each individual GETATTR operation, the data server can | |||
be overwhelmed by a large volume of such requests. NFSv3 addressed a | be overwhelmed by a large volume of such requests. NFSv3 addressed a | |||
similar challenge by including a post-operation attribute in the READ | similar challenge by including a post-operation attribute in the READ | |||
and WRITE operations to report WCC data (see Section 2.6 of | and WRITE operations to report WCC data (see Section 2.6 of | |||
[RFC1813]). | [RFC1813]). | |||
Each NFSv3 operation entails a single round trip between the client | Each NFSv3 operation entails a single round trip between the client | |||
and server. Consequently, issuing a WRITE followed by a GETATTR | and server. Consequently, issuing a WRITE followed by a GETATTR | |||
would require two round trips. In that situation, the retrieved | would require two round trips. In that situation, the retrieved | |||
attribute information is regarded as strict server-client | attribute information is regarded as having strict server-client | |||
consistency. By contrast, NFSv4 enables a WRITE and GETATTR to be | consistency. By contrast, NFSv4 enables a WRITE and GETATTR to be | |||
combined within a compound operation, which requires only one round | combined within a compound operation, which requires only one round | |||
trip. This combined approach is likewise considered strict server- | trip. This combined approach is likewise considered to have strict | |||
client consistency. Essentially, NFSv4 READ and WRITE operations | server-client consistency. Essentially, NFSv4 READ and WRITE | |||
omit post-operation attributes, allowing the client to determine | operations omit post-operation attributes, allowing the client to | |||
whether it requires that information. | determine whether it requires that information. | |||
Whilst NFSv4 got rid of the requirement for WCC information to be | Whilst NFSv4 got rid of the requirement for WCC information to be | |||
supplied by the WRITE or READ operations, the introduction of pNFS | supplied by the WRITE or READ operations, the introduction of pNFS | |||
reintroduces the same problem. The metadata server has to | reintroduces the same problem. The metadata server has to | |||
communicate with the data server in order to get the data that could | communicate with the data server in order to get the data that could | |||
be provided by a WCC model. | be provided by a WCC model. | |||
With the flexible file layout type, the client can leverage the NFSv3 | With the flexible file layout type, the client can leverage the NFSv3 | |||
WCC to service the proxying of times (see Section 5 of [RFC9754]), | WCC to service the proxying of times (see Section 5 of [RFC9754]), | |||
but the granularity of this data is limited. With client-side | but the granularity of this data is limited. With client-side | |||
skipping to change at line 290 ¶ | skipping to change at line 288 ¶ | |||
- time_modify (see Section 5.8.2.43 of [RFC8881]) | - time_modify (see Section 5.8.2.43 of [RFC8881]) | |||
* Whenever it sends an NFS4ERR_ACCESS error via LAYOUTRETURN or | * Whenever it sends an NFS4ERR_ACCESS error via LAYOUTRETURN or | |||
LAYOUTERROR. It could have already gotten the NFSv3 uid and gid | LAYOUTERROR. It could have already gotten the NFSv3 uid and gid | |||
values back in the WCC of the WRITE, READ, or COMMIT operation | values back in the WCC of the WRITE, READ, or COMMIT operation | |||
that got the error. Thus, it could report that information back | that got the error. Thus, it could report that information back | |||
to the metadata server, saving it from querying that information | to the metadata server, saving it from querying that information | |||
via an NFSv3 GETATTR. | via an NFSv3 GETATTR. | |||
* Whenever it sends a SETATTR to refresh the proxied times (see | * Whenever it sends a SETATTR to refresh the proxied times (see | |||
Section 5 of [RFC9754]). The metadata server is going to want to | Section 5 of [RFC9754]). The metadata server will correlate these | |||
correlate these times in order to detect later modification to the | times in order to detect later modification to the data file. | |||
data file. | ||||
3.4.2. Examples of What to Send in LAYOUT_WCC | 3.4.2. Examples of What to Send in LAYOUT_WCC | |||
The NFSv3 attributes returned in the WCC of WRITE, READ, and COMMIT | The NFSv3 attributes returned in the WCC of WRITE, READ, and COMMIT | |||
operations are a smaller subset of what can be transmitted as an | operations are a smaller subset of what can be transmitted as an | |||
NFSv4 attribute. The mapping of NFSv3 to NFSv4 attributes is shown | NFSv4 attribute. The mapping of NFSv3 to NFSv4 attributes is shown | |||
in Table 1. The LAYOUT_WCC MUST provide all of these attributes to | in Table 1. The LAYOUT_WCC MUST provide all of these attributes to | |||
the metadata server. Both the uid and gid are stringified into their | the metadata server. Both the uid and gid are stringified into their | |||
respective attributes of owner and owner_group. In the case of | respective attributes of owner and owner_group. In the case of | |||
NFS4ERR_ACCESS, the reason to provide these two attributes is that | NFS4ERR_ACCESS, the reason to provide these two attributes is that | |||
skipping to change at line 416 ¶ | skipping to change at line 413 ¶ | |||
attributes present. Or it could decide to present only the two | attributes present. Or it could decide to present only the two | |||
mirrors that had been changed. | mirrors that had been changed. | |||
In either case, the combination of ffdsw_deviceid, ffdsw_stateid, and | In either case, the combination of ffdsw_deviceid, ffdsw_stateid, and | |||
ffdsw_fh_vers will uniquely identify the attributes to be updated. | ffdsw_fh_vers will uniquely identify the attributes to be updated. | |||
All three arguments are required. A layout might have multiple data | All three arguments are required. A layout might have multiple data | |||
files on the same storage device, in which case the ffdsw_deviceid | files on the same storage device, in which case the ffdsw_deviceid | |||
and ffdsw_stateid would match, but the ffdsw_fh_vers would not. | and ffdsw_stateid would match, but the ffdsw_fh_vers would not. | |||
The ffdsw_attributes are processed similar to the obj_attributes in | The ffdsw_attributes are processed similar to the obj_attributes in | |||
the SETATTR arguments (see Section 18.34 of [RFC8881]). | the SETATTR arguments (see Section 18.30 of [RFC8881]). | |||
4. Extraction of XDR | 4. Extraction of XDR | |||
This document contains the XDR [RFC4506] description of the new open | This document contains the XDR [RFC4506] description of the new | |||
flags for delegating the file to the client. The XDR description is | NFSv4.2 operation LAYOUT_WCC. The XDR description is embedded in | |||
embedded in this document in a way that makes it simple for the | this document in a way that makes it simple for the reader to extract | |||
reader to extract into a ready-to-compile form. The reader can feed | into a ready-to-compile form. The reader can feed this document into | |||
this document into the following shell script to produce the machine- | the following shell script to produce the machine-readable XDR | |||
readable XDR description of the new flags: | description of the new NFSv4.2 operation LAYOUT_WCC. | |||
<CODE BEGINS> | <CODE BEGINS> | |||
#!/bin/sh | #!/bin/sh | |||
grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' | grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' | |||
<CODE ENDS> | <CODE ENDS> | |||
That is, if the above script is stored in a file called 'extract.sh', | That is, if the above script is stored in a file called 'extract.sh', | |||
and this document is in a file called 'spec.txt', then the reader can | and this document is in a file called 'spec.txt', then the reader can | |||
do: | do: | |||
<CODE BEGINS> | <CODE BEGINS> | |||
sh extract.sh < spec.txt > layout_wcc.x | sh extract.sh < spec.txt > layout_wcc.x | |||
<CODE ENDS> | <CODE ENDS> | |||
The effect of the script is to remove leading white space from each | The effect of the script is to remove leading blank space from each | |||
line, plus a sentinel sequence of '///'. XDR descriptions with the | line, plus a sentinel sequence of '///'. XDR descriptions with the | |||
sentinel sequence are embedded throughout the document. | sentinel sequence are embedded throughout the document. | |||
Note that the XDR code contained in this document depends on types | Note that the XDR code contained in this document depends on types | |||
from the NFSv4.2 nfs4_prot.x file (generated from [RFC7863]). This | from the NFSv4.2 nfs4_prot.x file (generated from [RFC7863]). This | |||
includes both nfs types that end with a 4 (such as offset4 and | includes both nfs types that end with a 4 (such as offset4 and | |||
length4) as well as more generic types (such as uint32_t and | length4) as well as more generic types (such as uint32_t and | |||
uint64_t). | uint64_t). | |||
While the XDR can be appended to that from [RFC7863], the various | While the XDR can be appended to that from [RFC7863], the various | |||
End of changes. 13 change blocks. | ||||
49 lines changed or deleted | 46 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. |