add documentation of v5 fields

Document the new fields and data structures added in XFS v5 filesystems. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
author: Darrick J. Wong <darrick.wong@oracle.com> 2016-01-05 10:31:41 +1100
committer: Dave Chinner <david@fromorbit.com> 2016-01-05 10:31:41 +1100
commit: 372d53f774091bba060f811ea57bd8ea4d775c51 (patch)
tree: 7750a256b4594831cb84c07bfb82298ee9fd56e0
parent: 13be7859678f0ea1c7bc9485127a3f3c20a5d6b9 (diff)
download: xfs-documentation-372d53f774091bba060f811ea57bd8ea4d775c51.tar.gz
8 files changed, 680 insertions, 33 deletions
diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
index 9ba26c2..5f091df 100644
--- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
+++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
@@ -40,8 +40,6 @@ superblock is one sector in length.
 The superblock is defined by the following structure. The description of each
 field follows.
 
-TODO: update for v5 formats.
-
 [source, c]
 ----
 struct xfs_sb
@@ -91,6 +89,20 @@ struct xfs_sb
 	__uint16_t		sb_logsectsize;
 	__uint32_t		sb_logsunit;
 	__uint32_t		sb_features2;
+	__uint32_t		sb_bad_features2;
+
+	/* version 5 superblock fields start here */
+	__uint32_t		sb_features_compat;
+	__uint32_t		sb_features_ro_compat;
+	__uint32_t		sb_features_incompat;
+	__uint32_t		sb_features_log_incompat;
+
+	__uint32_t		sb_crc;
+	xfs_extlen_t		sb_spino_align;
+
+	xfs_ino_t		sb_pquotino;
+	xfs_lsn_t		sb_lsn;
+	uuid_t			sb_meta_uuid;
 };
 ----
 *sb_magicnum*::
@@ -152,8 +164,8 @@ Number of blocks for the journaling log.
 Filesystem version number. This is a bitmask specifying the features enabled
 when creating the filesystem. Any disk checking tools or drivers that do not
 recognize any set bits must not operate upon the filesystem. Most of the flags
-indicate features introduced over time. If the value of the lower nibble is 4,
-the higher bits indicate feature flags as follows:
+indicate features introduced over time. If the value of the lower nibble is >=
+4, the higher bits indicate feature flags as follows:
 
 .Version 4 Superblock version flags
 [options="header"]
@@ -178,13 +190,17 @@ Version 2 directories are used. This is always set.
 Set if the sb_features2 field in the superblock contains more flags.
 |=====
 
+If the lower nibble of this value is 5, then this is a v5 filesystem; the
++XFS_SB_VERSION2_CRCBIT+ feature must be set in +sb_features2+.
+
 *sb_sectsize*::
 Specifies the underlying disk sector size in bytes.  Typically this is 512 or
 4096 bytes. This determines the minimum I/O alignment, especially for direct I/O.
 
 *sb_inodesize*::
 Size of the inode in bytes. The default is 256 (2 inodes per standard sector)
-but can be made as large as 2048 bytes when creating the filesystem.
+but can be made as large as 2048 bytes when creating the filesystem.  On a v5
+filesystem, the default and minimum inode size are both 512 bytes.
 
 *sb_inopblock*::
 Number of inodes per block. This is equivalent to +sb_blocksize / sb_inodesize+.
@@ -273,7 +289,11 @@ Miscellaneous flags.
 Reserved and must be zero (``vn'' stands for version number).
 
 *sb_inoalignmt*::
-Inode chunk alignment in fsblocks.
+Inode chunk alignment in fsblocks.  Prior to v5, the default value provided for
+inode chunks to have an 8KiB alignment.  Starting with v5, the default value
+scales with the multiple of the inode size over 256 bytes.  Concretely, this
+means an alignment of 16KiB for 512-byte inodes, 32KiB for 1024-byte inodes,
+etc.
 
 *sb_unit*::
 Underlying stripe or raid unit in blocks.
@@ -324,12 +344,81 @@ its parent inode. The primary purpose for this information is in backup systems.
 can be used to enforce disk space usage quotas for a particular group of
 directories.  This flag indicates that project IDs can be 32 bits in size.
 
+| +XFS_SB_VERSION2_CRCBIT+	|
+Metadata checksumming.  All metadata blocks have an extended header containing
+the block checksum, a copy of the metadata UUID, the log sequence number of the
+last update to prevent stale replays, and a back pointer to the owner of the
+block.  This feature must be and can only be set if the lowest nibble of
++sb_versionnum+ is set to 5.
+
 | +XFS_SB_VERSION2_FTYPE+	|
 Directory file type.  Each directory entry records the type of the inode to
 which the entry points.  This speeds up directory iteration by removing the
 need to load every inode into memory.
 |=====
 
+*sb_bad_features2*::
+This field mirrors +sb_features2+, due to past 64-bit alignment errors.
+
+*sb_features_compat*::
+Read-write compatible feature flags.  The kernel can still read and write this
+FS even if it doesn't understand the flag.  Currently, there are no valid
+flags.
+
+*sb_features_ro_compat*::
+Read-only compatible feature flags.  The kernel can still read this FS even if
+it doesn't understand the flag.
+
+.Extended Version 5 Superblock Read-Only compatibility flags
+[options="header"]
+|=====
+| Flag				| Description
+| +XFS_SB_FEAT_RO_COMPAT_FINOBT+ |
+Free inode B+tree.  Each allocation group contains a B+tree to track inode chunks
+containing free inodes.  This is a performance optimization to reduce the time
+required to allocate inodes.
+|=====
+
+*sb_features_incompat*::
+Read-write incompatible feature flags.  The kernel cannot read or write this
+FS if it doesn't understand the flag.
+
+.Extended Version 5 Superblock Read-Write incompatibility flags
+[options="header"]
+|=====
+| Flag				| Description
+| +XFS_SB_FEAT_INCOMPAT_FTYPE+ |
+Directory file type.  Each directory entry tracks the type of the inode to
+which the entry points.  This is a performance optimization to remove the need
+to load every inode into memory to iterate a directory.
+
+| +XFS_SB_FEAT_INCOMPAT_META_UUID+ |
+Metadata UUID.  The UUID stamped into each metadata block must match the value
+in +sb_meta_uuid+.  This enables the administrator to change +sb_uuid+ at will
+without having to rewrite the entire filesystem.
+|=====
+
+*sb_features_log_incompat*::
+Read-write incompatible feature flags for the log.  The kernel cannot read or
+write this FS log if it doesn't understand the flag.  Currently, no flags are
+defined.
+
+*sb_crc*::
+Superblock checksum.
+
+*sb_spino_align*::
+Sparse inode alignment.
+
+*sb_pquotino*::
+Project quota inode.
+
+*sb_lsn*::
+Log sequence number of the last superblock update.
+
+*sb_meta_uuid*::
+If the +XFS_SB_FEAT_INCOMPAT_META_UUID+ feature is set, then the UUID field in
+all metadata blocks must match this UUID.  If not, the block header UUID field
+must match +sb_uuid+.
 
 === xfs_db Superblock Example
 
@@ -405,7 +494,7 @@ features2 = 8
 
 The XFS filesystem tracks free space in an allocation group using two B+trees.
 One B+tree tracks space by block number, the second by the size of the free
-space block. This scheme allows XFS to quickly find free space near a given
+space block. This scheme allows XFS to find quickly free space near a given
 block or of a given size.
 
 All block numbers, indexes, and counts are AG relative.
@@ -434,6 +523,15 @@ struct xfs_agf {
      __be32              agf_freeblks;
      __be32              agf_longest;
      __be32              agf_btreeblks;
+
+     /* version 5 filesystem fields start here */
+     uuid_t              agf_uuid;
+     __be64              agf_spare64[16];
+
+     /* unlogged fields, written during buffer writeback. */
+     __be64              agf_lsn;
+     __be32              agf_crc;
+     __be32              agf_spare2;
 };
 ----
 
@@ -483,6 +581,22 @@ Specifies the number of blocks of longest contiguous free space in the AG.
 Specifies the number of blocks used for the free space B+trees. This is only
 used if the +XFS_SB_VERSION2_LAZYSBCOUNTBIT+ bit is set in +sb_features2+.
 
+*agf_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*agf_spare64*::
+Empty space in the logged part of the AGF sector, for use for future features.
+
+*agf_lsn*::
+Log sequence number of the last AGF write.
+
+*agf_crc*::
+Checksum of the AGF sector.
+
+*agf_spare2*::
+Empty space in the unlogged part of the AGF sector.
+
 [[Short_Format_Btrees]]
 === Short Format B+trees
 
@@ -499,6 +613,13 @@ struct xfs_btree_sblock {
      __be16                    bb_numrecs;
      __be32                    bb_leftsib;
      __be32                    bb_rightsib;
+
+     /* version 5 filesystem fields start here */
+     __be64                    bb_blkno;
+     __be64                    bb_lsn;
+     uuid_t                    bb_uuid;
+     __be32                    bb_owner;
+     __le32                    bb_crc;
 };
 ----
 
@@ -519,6 +640,22 @@ AG block number of the left sibling of this B+tree node.
 *bb_rightsib*::
 AG block number of the right sibling of this B+tree node.
 
+*bb_blkno*::
+FS block number of this B+tree block.
+
+*bb_lsn*::
+Log sequence number of the last write to this block.
+
+*bb_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*bb_owner*::
+The AG number that this B+tree block ought to be in.
+
+*bb_crc*::
+Checksum of the B+tree block.
+
 [[AG_Free_Space_Btrees]]
 === AG Free Space B+trees
 
@@ -553,7 +690,9 @@ typedef __be32 xfs_alloc_ptr_t;
 * As the free space tracking is AG relative, all the block numbers are only
 32-bits.
 * The +bb_magic+ value depends on the B+tree: ``ABTB'' (0x41425442) for the block
-offset B+tree, ``ABTC'' (0x41425443) for the block count B+tree.
+offset B+tree, ``ABTC'' (0x41425443) for the block count B+tree.  On a v5
+filesystem, these are ``AB3B'' (0x41423342) and ``AB3C'' (0x41423343),
+respectively.
 * The +xfs_btree_sblock_t+ header is used for intermediate B+tree node as well
 as the leaves.
 * For a typical 4KB filesystem block size, the offset for the +xfs_alloc_ptr_t+
@@ -595,6 +734,38 @@ Active elements in the array are specified by the
 xref:AG_Free_Space_Block[AGF's] +agf_flfirst+, +agf_fllast+ and +agf_flcount+
 values. The array is managed as a circular list.
 
+On a v5 filesystem, the following header precedes the free list entries:
+
+[source, c]
+----
+struct xfs_agfl {
+     __be32              agfl_magicnum;
+     __be32              agfl_seqno;
+     uuid_t              agfl_uuid;
+     __be64              agfl_lsn;
+     __be32              agfl_crc;
+};
+----
+
+*agfl_magicnum*::
+Specifies the magic number for the AGFL sector: "XAFL" (0x5841464c).
+
+*agfl_seqno*::
+Specifies the AG number for the sector.
+
+*agfl_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*agfl_lsn*::
+Log sequence number of the last AGFL write.
+
+*agfl_crc*::
+Checksum of the AGFL sector.
+
+On a v4 filesystem there is no header; the array of free block numbers begins
+at the beginning of the sector.
+
 .AG Free List layout
 image::images/16.png[]
 
@@ -739,6 +910,18 @@ struct xfs_agi {
      __be32              agi_newino;
      __be32              agi_dirino;
      __be32              agi_unlinked[64];
+
+     /*
+      * v5 filesystem fields start here; this marks the end of logging region 1
+      * and start of logging region 2.
+      */
+     uuid_t              agi_uuid;
+     __be32              agi_crc;
+     __be32              agi_pad32;
+     __be64              agi_lsn;
+
+     __be32              agi_free_root;
+     __be32              agi_free_level;
 }
 ----
 *agi_magicnum*::
@@ -775,19 +958,45 @@ Deprecated and not used, this is always set to NULL (-1).
 Hash table of unlinked (deleted) inodes that are still being referenced. Refer
 to xref:Unlinked_Pointer[unlinked list pointers] for more information.
 
+*agi_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*agi_crc*::
+Checksum of the AGI sector.
+
+*agi_pad32*::
+Padding field, otherwise unused.
+
+*agi_lsn*::
+Log sequence number of the last write to this block.
+
+*agi_free_root*::
+Specifies the block number in the AG containing the root of the free inode
+B+tree.
+
+*agi_free_level*::
+Specifies the number of levels in the free inode B+tree.
 
 [[Inode_Btrees]]
 == Inode B+trees
 
 Inodes are allocated in chunks of 64, and a B+tree is used to track these chunks
 of inodes as they are allocated and freed. The block containing root of the
-B+tree is defined by the AGI's +agi_root+ value.
+B+tree is defined by the AGI's +agi_root+ value.  If the
++XFS_SB_FEAT_RO_COMPAT_FINOBT+ feature is enabled, a second B+tree is used to
+track the chunks containing free inodes; this is an optimization to speed up
+inode allocation.
 
 The B+tree header for the nodes and leaves use the +xfs_btree_sblock+ structure
 which is the same as the header used in the xref:AG_Free_Space_Btrees[AGF
 B+trees].
 
-The magic number of the inode B+tree is ``IABT'' (0x49414254).
+The magic number of the inode B+tree is ``IABT'' (0x49414254).  On a v5
+filesystem, the magic number is ``IAB3'' (0x49414233).
+
+The magic number of the free inode B+tree is ``FIBT'' (0x46494254).  On a v5
+filesystem, the magic number is ``FIB3'' (0x46494254).
 
 Leaves contain an array of the following structure:
 
diff --git a/design/XFS_Filesystem_Structure/data_extents.asciidoc b/design/XFS_Filesystem_Structure/data_extents.asciidoc
index af9ba44..a39045d 100644
--- a/design/XFS_Filesystem_Structure/data_extents.asciidoc
+++ b/design/XFS_Filesystem_Structure/data_extents.asciidoc
@@ -94,8 +94,9 @@ image::images/32.png[]
 
 The number of extents that can fit in the inode depends on the inode size and
 +di_forkoff+. For a default 256 byte inode with no extended attributes, a file
-can have up to 9 extents with this format. Beyond this, extents have to use the
-B+tree format.
+can have up to 9 extents with this format.  On a default v5 filesystem with 512
+byte inodes, a file can have up to 21 extents with this format.  Beyond that,
+extents have to use the B+tree format.
 
 === xfs_db Inode Data Fork Extents Example
 
@@ -242,7 +243,7 @@ and the leaves. This will be less if +di_forkoff+ is not zero (i.e. attributes
 are in use on the inode).
 
 [[Long_Format_Btrees]]
-== Long Format B+trees
+=== Long Format B+trees
 
 The subsequent nodes and leaves of the B+tree use the +xfs_btree_lblock+
 declaration:
@@ -255,11 +256,20 @@ struct xfs_btree_lblock {
      __be16                    bb_numrecs;
      __be64                    bb_leftsib;
      __be64                    bb_rightsib;
+
+     /* version 5 filesystem fields start here */
+     __be64                    bb_blkno;
+     __be64                    bb_lsn;
+     uuid_t                    bb_uuid;
+     __be64                    bb_owner;
+     __le32                    bb_crc;
+     __be32                    bb_pad;
 };
 ----
 
 *bb_magic*::
 Specifies the magic number for the BMBT block: ``BMAP'' (0x424d4150).
+On a v5 filesystem, this is ``BMA3'' (0x424d4133).
 
 *bb_level*::
 The level of the tree in which this block is found.  If this value is 0, this
@@ -275,6 +285,25 @@ FS block number of the left sibling of this B+tree node.
 *bb_rightsib*::
 FS block number of the right sibling of this B+tree node.
 
+*bb_blkno*::
+FS block number of this B+tree block.
+
+*bb_lsn*::
+Log sequence number of the last write to this block.
+
+*bb_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*bb_owner*::
+The AG number that this B+tree block ought to be in.
+
+*bb_crc*::
+Checksum of the B+tree block.
+
+*bb_pad*::
+Pads the structure to 64 bytes.
+
 // force-split the lists
 
 * For intermediate nodes, the data following +xfs_btree_lblock+ is the same as
diff --git a/design/XFS_Filesystem_Structure/directories.asciidoc b/design/XFS_Filesystem_Structure/directories.asciidoc
index b539535..bccf912 100644
--- a/design/XFS_Filesystem_Structure/directories.asciidoc
+++ b/design/XFS_Filesystem_Structure/directories.asciidoc
@@ -358,7 +358,7 @@ typedef struct xfs_dir2_block {
 ----
 
 *hdr*::
-Directory block header.
+Directory block header.  On a v5 filesystem this is +xfs_dir3_data_hdr_t+.
 
 *u*::
 Union of directory and unused entries.
@@ -383,8 +383,62 @@ Magic number for this directory block.
 *bestfree*::
 An array pointing to free regions in the directory block.
 
+On a v5 filesystem, directory and attribute blocks are formatted with v3
+headers, which contain extra data:
+
 [source, c]
 ----
+struct xfs_dir3_blk_hdr {
+     __be32                     magic;
+     __be32                     crc;
+     __be64                     blkno;
+     __be64                     lsn;
+     uuid_t                     uuid;
+     __be64                     owner;
+};
+----
+
+*magic*::
+Magic number for this directory block.
+
+*crc*::
+Checksum of the directory block.
+
+*blkno*::
+Block number of this directory block.
+
+*lsn*::
+Log sequence number of the last write to this block.
+
+*uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*owner*::
+The inode number that this directory block belongs to.
+
+[source, c]
+----
+struct xfs_dir3_data_hdr {
+     struct xfs_dir3_blk_hdr    hdr;
+     xfs_dir2_data_free_t       best_free[XFS_DIR2_DATA_FD_COUNT];
+     __be32                     pad;
+};
+----
+
+*hdr*::
+The v5 directory/attribute block header.
+
+*best_free*::
+An array pointing to free regions in the directory block.
+
+*pad*::
+Padding to maintain a 64-bit alignment.
+
+Within the block, data structures are as follows:
+
+[source, c]
+-----
 typedef struct xfs_dir2_data_free {
      xfs_dir2_data_off_t        offset;
      xfs_dir2_data_off_t        length;
@@ -494,7 +548,8 @@ Following is a diagram of how these pieces fit together for a block directory.
 .Block directory layout
 image::images/43.png[]
 
-* The magic number in the header is ``XD2B'' (0x58443242).
+* The magic number in the header is ``XD2B'' (0x58443242), or ``XDB3'' (0x58444233)
+on a v5 filesystem.
 
 * The +tag+ in the +xfs_dir2_data_entry_t+ structure stores its offset from the
 start of the block.
@@ -736,7 +791,7 @@ Currently, this is 32GB and in the extent view, a block offset of
 decimal).
 
 * Blocks with directory entries (``data'' extents) have the magic number ``X2D2''
-(0x58443244).
+(0x58443244), or ``XDD3'' (0x58444433) on a v5 filesystem.
 
 * The ``data'' extents have a new header (no ``leaf'' data):
 
@@ -749,7 +804,7 @@ typedef struct xfs_dir2_data {
 ----
 
 *hdr*::
-Data block header.
+Data block header.  On a v5 filesystem, this field is +struct xfs_dir3_data_hdr+.
 
 *u*::
 Union of directory and unused entries, exactly the same as in a block directory.
@@ -769,7 +824,8 @@ typedef struct xfs_dir2_leaf {
 ----
 
 *hdr*::
-Directory leaf header.
+Directory leaf header.  On a v5 filesystem this is +struct
+xfs_dir3_leaf_hdr_t+.
 
 *ents*::
 Hash values of the entries in this block.
@@ -800,6 +856,28 @@ Number of stale/zeroed leaf entries.
 
 [source, c]
 ----
+struct xfs_dir3_leaf_hdr {
+     struct xfs_da3_blkinfo    info;
+     __uint16_t                count;
+     __uint16_t                stale;
+     __be32                    pad;
+};
+----
+
+*info*::
+Leaf B+tree block header.
+
+*count*::
+Number of leaf entries.
+
+*stale*::
+Number of stale/zeroed leaf entries.
+
+*pad*::
+Padding to maintain alignment rules.
+
+[source, c]
+----
 typedef struct xfs_dir2_leaf_tail {
      __uint32_t                bestcount;
 } xfs_dir2_leaf_tail_t;
@@ -839,7 +917,58 @@ Padding to maintain alignment.
 
 // split lists
 
-* The magic number of the leaf block is +XFS_DIR2_LEAF1_MAGIC+ (0xd2f1).
+* On a v5 filesystem, the leaves use the +struct xfs_da3_blkinfo_t+ filesystem
+block header. This header is used in the same place as +xfs_da_blkinfo_t+:
+
+[source, c]
+----
+struct xfs_da3_blkinfo {
+     /* these values are inside xfs_da_blkinfo */
+     __be32                     forw;
+     __be32                     back;
+     __be16                     magic;
+     __be16                     pad;
+
+     __be32                     crc;
+     __be64                     blkno;
+     __be64                     lsn;
+     uuid_t                     uuid;
+     __be64                     owner;
+};
+----
+
+*forw*::
+Logical block offset of the previous B+tree block at this level.
+
+*back*::
+Logical block offset of the next B+tree block at this level.
+
+*magic*::
+Magic number for this directory/attribute block.
+
+*pad*::
+Padding to maintain alignment.
+
+*crc*::
+Checksum of the directory/attribute block.
+
+*blkno*::
+Block number of this directory/attribute block.
+
+*lsn*::
+Log sequence number of the last write to this block.
+
+*uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*owner*::
+The inode number that this directory/attribute block belongs to.
+
+// split lists
+
+* The magic number of the leaf block is +XFS_DIR2_LEAF1_MAGIC+ (0xd2f1); on a
+v5 filesystem it is +XFS_DIR3_LEAF1_MAGIC+ (0x3df1).
 
 * The size of the +ents+ array is specified by +hdr.count+.
 
@@ -1107,13 +1236,15 @@ each ``data'' block. This is not possible with more than one leaf.
 
 * After the ``freeindex'' data moves to its own block, it is possible for the
 leaf data to fit within a single leaf block.  This single leaf block has a
-magic number of +XFS_DIR2_LEAFN_MAGIC+ (0xd2ff).
+magic number of +XFS_DIR2_LEAFN_MAGIC+ (0xd2ff) or on a v5 filesystem,
++XFS_DIR3_LEAFN_MAGIC+ (0x3dff).
 
 * The ``leaf'' blocks eventually change into a B+tree with the generic B+tree
 header pointing to directory ``leaves'' as described in
 xref:Leaf_Directories[Leaf Directories]. Blocks with leaf data still have the
 +LEAFN_MAGIC+ magic number as outlined above.  The top-level tree blocks are
-called ``nodes'' and have a magic number of +XFS_DA_NODE_MAGIC+ (0xfebe).
+called ``nodes'' and have a magic number of +XFS_DA_NODE_MAGIC+ (0xfebe), or on
+a v5 filesystem, +XFS_DA3_NODE_MAGIC+ (0x3ebe).
 
 * Distinguishing between a combined leaf/freeindex block (+LEAF1_MAGIC+), a
 leaf-only block (+LEAFN_MAGIC+), and a btree node block (+NODE_MAGIC+) can only
@@ -1161,6 +1292,50 @@ An array specifying the best free counts in each directory data block.
 
 // split lists
 
+* On a v5 filesystem, the freeindex block uses the following structures:
+
+[source, c]
+----
+struct xfs_dir3_free_hdr {
+     struct xfs_dir3_blk_hdr   hdr;
+     __int32_t                 firstdb;
+     __int32_t                 nvalid;
+     __int32_t                 nused;
+     __int32_t                 pad;
+};
+----
+
+*hdr*::
+v3 directory block header.  The magic number is "XDF3" (0x0x58444633).
+
+*firstdb*::
+The starting directory block number for the bests array.
+
+*nvalid*::
+Number of elements in the bests array.
+
+*nused*::
+Number of valid elements in the bests array.
+
+*pad*::
+Padding to maintain alignment.
+
+[source, c]
+----
+struct xfs_dir3_free {
+     xfs_dir3_free_hdr_t       hdr;
+     __be16                    bests[1];
+};
+----
+
+*hdr*::
+Free block header.
+
+*bests*::
+An array specifying the best free counts in each directory data block.
+
+// split lists
+
 * The location of the leaf blocks can be in any order, the only way to determine
 the appropriate is by the node block hash/before values. Given a hash to look up,
 you read the node's +btree+ array and first +hashval+ in the array that exceeds
@@ -1205,6 +1380,45 @@ The hash value of a particular record.
 The directory/attribute logical block containing all entries up to the
 corresponding hash value.
 
+* On a v5 filesystem, the directory/attribute node blocks have the following
+structure:
+
+[source, c]
+----
+struct xfs_da3_intnode {
+     struct xfs_da3_node_hdr {
+           struct xfs_da3_blkinfo    info;
+           __uint16_t                count;
+           __uint16_t                level;
+           __uint32_t                pad32;
+     } hdr;
+     struct xfs_da_node_entry {
+           xfs_dahash_t              hashval;
+           xfs_dablk_t               before;
+     } btree[1];
+};
+----
+
+*info*::
+Directory/attribute block info.  The magic number is +XFS_DA3_NODE_MAGIC+
+(0x3ebe).
+
+*count*::
+Number of node entries in this block.
+
+*level*::
+The level of this block in the B+tree.
+
+*pad32*::
+Padding to maintain alignment.
+
+*hashval*::
+The hash value of a particular record.
+
+*before*::
+The directory/attribute logical block containing all entries up to the
+corresponding hash value.
+
 * The freeindex's +bests+ array starts from the end of the block and grows to the
 start of the block.
 
diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml
index 9bcecad..8ed38d9 100644
--- a/design/XFS_Filesystem_Structure/docinfo.xml
+++ b/design/XFS_Filesystem_Structure/docinfo.xml
@@ -90,4 +90,20 @@
 			</simplelist>
 		</revdescription>
 	</revision>
+	<revision>
+		<revnumber>3.1</revnumber>
+		<date>October 2015</date>
+		<author>
+			<firstname>Darrick</firstname>
+			<surname>Wong</surname>
+			<email></email>
+		</author>
+		<revdescription>
+			<simplelist>
+				<member>Add v5 fields.</member>
+				<member>Discuss metadata integrity.</member>
+				<member>Document the free inode B+tree.</member>
+			</simplelist>
+		</revdescription>
+	</revision>
 </revhistory>
diff --git a/design/XFS_Filesystem_Structure/extended_attributes.asciidoc b/design/XFS_Filesystem_Structure/extended_attributes.asciidoc
index f268d66..bb773d5 100644
--- a/design/XFS_Filesystem_Structure/extended_attributes.asciidoc
+++ b/design/XFS_Filesystem_Structure/extended_attributes.asciidoc
@@ -322,7 +322,8 @@ with the flags stored as well. The remaining part of the leaf block contains the
 array name/value pairs, where each element varies in length.
 
 Each leaf is based on the +xfs_da_blkinfo_t+ block header declared in the
-section about xref:Directory_Attribute_Block_Header[directories]. The structure
+section about xref:Directory_Attribute_Block_Header[directories].  On a v5
+filesystem, the block header is +xfs_da3_blkinfo_t+.  The structure
 encapsulating all other structures in the attribute block is
 +xfs_attr_leafblock_t+.
 
@@ -459,7 +460,32 @@ size of these entries is determined dynamically.
 A variable-length array of descriptors of remote attributes.  The location and
 size of these entries is determined dynamically.
 
-Each leaf header uses the magic number +XFS_ATTR_LEAF_MAGIC+ (0xfbee).
+On a v5 filesystem, the header becomes +xfs_da3_blkinfo_t+ to accomodate the
+extra metadata integrity fields:
+
+[source, c]
+----
+typedef struct xfs_attr3_leaf_hdr {
+     xfs_da3_blkinfo_t          info;
+     __be16                     count;
+     __be16                     usedbytes;
+     __be16                     firstused;
+     __u8                       holes;
+     __u8                       pad1;
+     xfs_attr_leaf_map_t        freemap[3];
+} xfs_attr3_leaf_hdr_t;
+
+
+typedef struct xfs_attr3_leafblock  {
+     xfs_attr3_leaf_hdr_t          hdr;
+     xfs_attr_leaf_entry_t         entries[1];
+     xfs_attr_leaf_name_local_t    namelist;
+     xfs_attr_leaf_name_remote_t   valuelist;
+} xfs_attr3_leafblock_t;
+----
+
+Each leaf header uses the magic number +XFS_ATTR_LEAF_MAGIC+ (0xfbee).  On a
+v5 filesystem, the magic number is +XFS_ATTR3_LEAF_MAGIC+ (0x3bee).
 
 The hash/index elements in the +entries[]+ array are packed from the top of the
 block. Name/values grow from the bottom but are not packed. The freemap contains
@@ -474,7 +500,8 @@ For attributes with small values (ie. the value can be stored within the leaf),
 the +XFS_ATTR_LOCAL+ flag is set for the attribute. The entry details are stored
 using the +xfs_attr_leaf_name_local_t+ structure. For large attribute values
 that cannot be stored within the leaf, separate filesystem blocks are allocated
-to store the value. They use the +xfs_attr_leaf_name_remote_t+ structure.
+to store the value. They use the +xfs_attr_leaf_name_remote_t+ structure.  See
+xref:Remote_Values[Remote Values] for more information.
 
 .Leaf attribute layout
 image::images/69.png[]
@@ -629,6 +656,7 @@ that exceeds the given hash.  The entry is in the block pointed to by the
 +before+ value. 
 
 Each attribute node block has a magic number of +XFS_DA_NODE_MAGIC+ (0xfebe).
+On a v5 filesystem this is +XFS_DA3_NODE_MAGIC+ (0x3ebe).
 
 .Node attribute layout
 image::images/72.png[]
@@ -834,3 +862,50 @@ is two levels deep. The two blocks at offset 513 and 512 (ie. access using the
 +ablock+ command) are intermediate +xfs_da_intnode_t+ nodes that index all the
 attribute leaves.
 
+[[Remote_Values]]
+== Remote Attribute Values
+
+On a v5 filesystem, all remote value blocks start with this header:
+
+[source, c]
+----
+struct xfs_attr3_rmt_hdr {
+	__be32	rm_magic;
+	__be32	rm_offset;
+	__be32	rm_bytes;
+	__be32	rm_crc;
+	uuid_t	rm_uuid;
+	__be64	rm_owner;
+	__be64	rm_blkno;
+	__be64	rm_lsn;
+};
+----
+
+
+*rm_magic*::
+Specifies the magic number for the remote value block: "XARM" (0x5841524d).
+
+*rm_offset*::
+Offset of the remote value data, in bytes.
+
+*rm_bytes*::
+Number of bytes used to contain the remote value data.
+
+*rm_crc*::
+Checksum of the remote value block.
+
+*rm_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*rm_owner*::
+The inode number that this remote value block belongs to.
+
+*rm_blkno*::
+Disk block number of this remote value block.
+
+*rm_lsn*::
+Log sequence number of the last write to this block.
+
+Filesystems formatted prior to v5 do not have this header in the remote block.
+Value data begins immediately at offset zero.
diff --git a/design/XFS_Filesystem_Structure/internal_inodes.asciidoc b/design/XFS_Filesystem_Structure/internal_inodes.asciidoc
index c21f8b4..9ace3ea 100644
--- a/design/XFS_Filesystem_Structure/internal_inodes.asciidoc
+++ b/design/XFS_Filesystem_Structure/internal_inodes.asciidoc
@@ -60,6 +60,11 @@ struct xfs_disk_dquot {
 struct xfs_dqblk {
      struct xfs_disk_dquot dd_diskdq;
      char                  dd_fill[32];
+
+     /* version 5 filesystem fields begin here */
+     __be32                dd_crc;
+     __be64                dd_lsn;
+     uuid_t                dd_uuid;
 };
 ----
 
@@ -150,6 +155,16 @@ soft limit will turn into a hard limit after the elapsed time exceeds ID zero's
 +d_rtbtimer+ value. When +d_rtbcount+ goes back below +d_rtb_softlimit+,
 +d_rtbtimer+ is reset back to zero.
 
+*dd_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*dd_lsn*::
+Log sequence number of the last DQ block write.
+
+*dd_crc*::
+Checksum of the DQ block.
+
 
 [[Real-time_Inodes]]
 == Real-time Inodes
diff --git a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
index da6281b..4aabc55 100644
--- a/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
+++ b/design/XFS_Filesystem_Structure/ondisk_inode.asciidoc
@@ -55,7 +55,7 @@ explain the various structures in use within the inode.
 
 The remaining space in the inode after +di_next_unlinked+ where the two forks
 are located is called the inode's ``literal area''. This starts at offset 100
-(0x64) in the inode.
+(0x64) in a version 1 or 2 inode, and offset 176 (0xb0) in a version 3 inode.
 
 The space for each of the two forks in the literal area is determined by the
 inode size, and +di_core.di_forkoff+. The data fork is located between the start
@@ -99,6 +99,20 @@ struct xfs_dinode_core {
      __uint16_t                di_dmstate;
      __uint16_t                di_flags;
      __uint32_t                di_gen;
+
+     /* di_next_unlinked is the only non-core field in the old dinode */
+     __be32                    di_next_unlinked;
+
+     /* version 5 filesystem (inode version 3) fields start here */
+     __le32                    di_crc;
+     __be64                    di_changecount;
+     __be64                    di_lsn;
+     __be64                    di_flags2;
+     __u8                      di_pad2[16];
+     xfs_timestamp_t           di_crtime;
+     __be64                    di_ino;
+     uuid_t                    di_uuid;
+
 };
 ----
 
@@ -110,10 +124,11 @@ Specifies the mode access bits and type of file using the standard S_Ixxx values
 defined in stat.h.
 
 *di_version*::
-Specifies the inode version which currently can only be 1 or 2. The inode
+Specifies the inode version which currently can only be 1, 2, or 3. The inode
 version specifies the usage of the +di_onlink+, +di_nlink+ and +di_projid+
 values in the inode core. Initially, inodes are created as v1 but can be
-converted on the fly to v2 when required.
+converted on the fly to v2 when required.  v3 inodes are created only for v5
+filesystems.
 
 *di_format*::
 Specifies the format of the data fork in conjunction with the +di_mode+ type.
@@ -284,6 +299,35 @@ A generation number used for inode identification. This is used by tools that do
 inode scanning such as backup tools and xfsdump. An inode's generation number
 can change by unlinking and creating a new file that reuses the inode.  
 
+*di_next_unlinked*::
+See the section on xref:Unlinked_Pointer[unlinked inode pointers] for more
+information.
+
+*di_crc*::
+Checksum of the inode.
+
+*di_changecount*::
+Counts the number of changes made to the attributes in this inode.
+
+*di_lsn*::
+Log sequence number of the last inode write.
+
+*di_flags2*::
+Specifies extended flags associated with a v3 inode.  There are no flags defined
+currently.
+
+*di_pad2*::
+Padding for future expansion of the inode.
+
+*di_crtime*::
+Specifies the time when this inode was created.
+
+*di_ino*::
+The full inode number of this inode.
+
+*di_uuid*::
+The UUID of this inode, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
 
 [[Unlinked_Pointer]]
 == Unlinked Pointer
@@ -311,12 +355,12 @@ image::images/28.png[]
 == Data Fork
 
 The structure of the inode's data fork based is on the inode's type and
-+di_format+. It always starts at offset 100 (0x64) in the inode's space which is
-the start of the inode's ``literal area''. The size of the data fork is determined
-by the type and format. The maximum size is determined by the inode size and
-+di_forkoff+. In code, use the +XFS_DFORK_PTR+ macro specifying +XFS_DATA_FORK+
-for the ``which'' parameter. Alternatively, the +XFS_DFORK_DPTR+ macro can be
-used.
++di_format+. The data fork begins at the start of the inode's ``literal area''.
+This area starts at offset 100 (0x64), or offset 176 (0xb0) in a v3 inode. The
+size of the data fork is determined by the type and format. The maximum size is
+determined by the inode size and +di_forkoff+. In code, use the +XFS_DFORK_PTR+
+macro specifying +XFS_DATA_FORK+ for the ``which'' parameter. Alternatively,
+the +XFS_DFORK_DPTR+ macro can be used.
 
 Each of the following sub-sections summarises the contents of the data fork
 based on the inode type.
diff --git a/design/XFS_Filesystem_Structure/symbolic_links.asciidoc b/design/XFS_Filesystem_Structure/symbolic_links.asciidoc
index 5d2c4e8..bfe5eb9 100644
--- a/design/XFS_Filesystem_Structure/symbolic_links.asciidoc
+++ b/design/XFS_Filesystem_Structure/symbolic_links.asciidoc
@@ -63,6 +63,51 @@ by the data fork's +di_bmx[]+ array. In the significant majority of cases, this
 will be in one filesystem block as a symlink cannot be longer than 1024
 characters.
 
+On a v5 filesystem, the first block of each extent starts with the following
+header structure:
+
+[source, c]
+----
+struct xfs_dsymlink_hdr {
+     __be32                    sl_magic;
+     __be32                    sl_offset;
+     __be32                    sl_bytes;
+     __be32                    sl_crc;
+     uuid_t                    sl_uuid;
+     __be64                    sl_owner;
+     __be64                    sl_blkno;
+     __be64                    sl_lsn;
+};
+-----
+
+*sl_magic*::
+Specifies the magic number for the symlink block: "XSLM" (0x58534c4d).
+
+*sl_offset*::
+Offset of the symbolic link target data, in bytes.
+
+*sl_bytes*::
+Number of bytes used to contain the link target data.
+
+*sl_crc*::
+Checksum of the symlink block.
+
+*sl_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*sl_owner*::
+The inode number that this symlink block belongs to.
+
+*sl_blkno*::
+Disk block number of this symlink.
+
+*sl_lsn*::
+Log sequence number of the last write to this block.
+
+Filesystems formatted prior to v5 do not have this header in the remote block.
+Symlink data begins immediately at offset zero.
+
 .Symbolic link extent layout
 image::images/62.png[]
author	Darrick J. Wong <darrick.wong@oracle.com>	2016-01-05 10:31:41 +1100
committer	Dave Chinner <david@fromorbit.com>	2016-01-05 10:31:41 +1100
commit	372d53f774091bba060f811ea57bd8ea4d775c51 (patch)
tree	7750a256b4594831cb84c07bfb82298ee9fd56e0
parent	13be7859678f0ea1c7bc9485127a3f3c20a5d6b9 (diff)
download	xfs-documentation-372d53f774091bba060f811ea57bd8ea4d775c51.tar.gz