ext4
Why a new filesystem?
The kernel maintainers of ext3 wanted to make major changes to the ext3 filesystem, but got a lot of resistance and didn't convince Linus that it would be a good idea.
They discussed it amongst themselves some more and came back with a proposal to put the new code into a new experimental filesystem "ext4" while keeping ext3 stable and maintained until ext4 could take over. This is like what happened in the move from ext2 to ext3.
Size limits
The limit of the ext3 filesystem is 2TiB with 1kB blocks and 8 TiB with 4kB blocks. This is because ext2/ext3 uses a 31bit block addressing (the block number is signed...), so the maximum address space with 1 kB block is 2 TiB and with 4kB blocks is 8TiB.
The extents feature changes ext4 to have 48bit block addressing, with 4kB blocks that is 2^60 = 1024 PiB (PetaBytes).
There is also another feature to change ext4 to use 64bit block addressing, with 4kB blocks that is 2^75 which is a-very-large-number!
Size limit of file is more complex to compute because of the use of direct, indirect, double indirect, triple indirect index blocks ... but by introducing extents we remove this limit.
Features
EXT4_FEATURE_COMPAT_DIR_PREALLOC |
0x0001 |
EXT4_FEATURE_COMPAT_IMAGIC_INODES |
0x0002 |
EXT4_FEATURE_COMPAT_HAS_JOURNAL |
0x0004 |
EXT4_FEATURE_COMPAT_EXT_ATTR |
0x0008 |
EXT4_FEATURE_COMPAT_RESIZE_INODE |
0x0010 |
EXT4_FEATURE_COMPAT_DIR_INDEX |
0x0020 |
EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER |
0x0001 |
EXT4_FEATURE_RO_COMPAT_LARGE_FILE |
0x0002 |
EXT4_FEATURE_RO_COMPAT_BTREE_DIR |
0x0004 |
**EXT4_FEATURE_RO_COMPAT_HUGE_FILE** |
0x0008 |
**EXT4_FEATURE_RO_COMPAT_GDT_CSUM** |
0x0010 |
**EXT4_FEATURE_RO_COMPAT_DIR_NLINK** |
0x0020 |
**EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE** |
0x0040 |
EXT4_FEATURE_INCOMPAT_COMPRESSION |
0x0001 |
EXT4_FEATURE_INCOMPAT_FILETYPE |
0x0002 |
EXT4_FEATURE_INCOMPAT_RECOVER (Needs recovery) |
0x0004 |
EXT4_FEATURE_INCOMPAT_JOURNAL_DEV |
0x0008 |
**EXT4_FEATURE_INCOMPAT_META_BG** |
0x0010 |
**EXT4_FEATURE_INCOMPAT_EXTENTS** |
0x0040 |
**EXT4_FEATURE_INCOMPAT_64BIT ** |
0x0080 |
**EXT4_FEATURE_INCOMPAT_MMP ** |
0x0100 |
**EXT4_FEATURE_INCOMPAT_FLEX_BG** |
0x0200 |
New features
huge_file (HUGE_FILE)
In ext3 an inode's block count was stored as a number of sectors, but in ext4 the block count can be stored as the number of filesystem blocks. This automagically triggered when a file grows too large to be represented in the old ext3 way. The filesystem has the huge_file feature marked and the inode has the HUGE_FILE flag.
uninit_bg (GDT_CSUM)
Used to be called uninit_groups.
Create a filesystem without initializing all of the block groups. This feature also enables checksums and highest-inode-used statistics in each blockgroup. This feature can speed up filesystem creation time noticably (if lazy_itable_init is enabled), and can also reduce e2fscktime dramatically. It is only supported by the ext4 filesystem in recent Linux kernels.
mkfs.ext3 option: lazy_itable_init[= <0 to disable, 1 to enable>]
If enabled and the uninit_bg feature is enabled, the inode table will not fully initialized by mke2fs. This speeds up filesystem initialization noticeably, but it requires the kernel to finish initializing the filesystem in the background when the filesystem is first mounted. If the option value is omitted, it defaults to 1 to enable lazy inode table initialization.
__le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 block_group,
struct ext4_group_desc *gdp)
{
__u16 crc = 0;
if (sbi->s_es->s_feature_ro_compat &
cpu_to_le32(EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
int offset = offsetof(struct ext4_group_desc, bg_checksum);
__le32 le_group = cpu_to_le32(block_group);
crc = crc16(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
crc = crc16(crc, (__u8 *)&le_group, sizeof(le_group));
crc = crc16(crc, (__u8 *)gdp, offset);
offset += sizeof(gdp->bg_checksum); /* skip checksum */
/* for checksum of struct ext4_group_desc do the rest...*/
if ((sbi->s_es->s_feature_incompat &
cpu_to_le32(EXT4_FEATURE_INCOMPAT_64BIT)) &&
offset < le16_to_cpu(sbi->s_es->s_desc_size))
crc = crc16(crc, (__u8 *)gdp + offset,
le16_to_cpu(sbi->s_es->s_desc_size) -
offset);
}
return cpu_to_le16(crc);
}
dir_nlink (DIR_NLINK)
In ext3 if there are too many hard links to a file or directory on a filesystem the error EMLINK will be returned. /usr/src/linux-2.4.19/include/linux/ext3_fs.h has define EXT3_LINK_MAX 32000. This could be caused by a directory having too many subdirectories (each subdirectory has .. as a hardlink to it's parent directory which causes that directory's hardlink count to be increased by one. So yes, this does mean that you are limited to 32000 subdirectories in one directory in ext3, even if you have hashdirs enabled.) As a consequence of this you can stat(2) a directory and add one (for ..) and you will know how many directories are in the current directory. (or subtract one (for .) to find out how many subdirectories there are).
ext4.h has define EXT4_LINK_MAX 65000 and the DIR_NLINK feature is set if the directory has a dirindex and 1) nlinks > EXT4_LINK_MAX or 2) nlinks == 2, since this indicates that nlinks count was previously 1. In English: directories are no longer limited to a maximum number of subdirectories.
extra_isize (EXTRA_ISIZE)
Determine the minimum size of new large inodes, if present, via the i_extra_isize field in the inode.
meta_bg (META_BG)
After extending the limit created by 32-bit block num- bers, the filesystem capacity is still restricted by the number of block groups in the filesystem. In ext3, for safety concerns all block group descriptors copies are kept in the first block group. With the new uninitial- ized block group feature discussed in section 4.1 the new block group descriptor size is 64 bytes. Given the default 128 MB(2^27 bytes) block group size, ext4 can have at most 227/64 = 221 block groups. This limits the entire filesystem size to 221 ∗227= 2^48bytes or 256TB.
The solution to this problem is to use the metablock group feature (META_BG), which is already in ext3 for all 2.6 releases. With the META_BG feature, ext4 filesystems are partitioned into many metablock groups. Each metablock group is a cluster of block groups whose group descriptor structures can be stored in a sin- gle disk block. For ext4 filesystems with 4 KB block size, a single metablock group partition includes 64 block groups, or 8 GB of disk space. The metablock group feature moves the location of the group descrip- tors from the congested first block group of the whole filesystem into the first group of each metablock group itself. The backups are in the second and last group of each metablock group. This increases the 2^21 maximum block groups limit to the hard limit 2^32, allowing support for the full 1 EB filesystem.
The change in the filesystem format replaces the current scheme where the superblock is followed by a variable-length set of block group descriptors. Instead, the superblock and a single block group descriptor block is placed at the beginning of the first, second, and last block groups in a meta-block group. A meta-block group is a collection of block groups which can be described by a single block group descriptor block. Since the size of the block group descriptor structure is 32 bytes, a meta-block group contains 32 block groups for filesystems with a 1KB block size, and 128 block groups for filesystems with a 4KB blocksize. Filesystems can either be created using this new block group descriptor layout, or existing filesystems can be resized on-line, and a new field in the superblock will indicate the first block group using this new layout.
NB: The original intention was that META_BG would place the bitmaps and inode tables at the beginning of each metagroup by default, but that the constraints about where to put the bitmaps and inode tables would be completely relaxed from the point of view of requirements by the kernel and e2fsck. Unfortunately while I had patches which removed the constraints checking, they never made it into mainline of either the kernel or e2fsprogs.
**meta_bg is incompatible with resize_inode and afaik cannot be set through tune2fs, so you got to do it at mkfs time**
- mkfs.ext4dev -O flex_bg,meta_bg,^resize_inode ...
I believe that you no longer need to add the ^resize_inode option since current e2fsprogs removes the option if meta_bg is specified.
extents (EXTENTS)
Instead of using the indirect block scheme for storing the location ofdata blocks in an inode, use extents instead. This is a much more efficient encoding which speeds up filesystem access, especially for large files.
An extent is a contiguous area of storage in a computer file system, reserved for a file. When starting to write to a file, a whole extent is allocated. When writing to the file again, possibly after doing other write operations, the data continues where the previous write left off. This reduces or eliminates file fragmentation.
Extents replace the inode's block array with an array of extents. Each extent gives the start and length to a run of blocks containing the file's data. Since filesystems go to a lot of effort to make files contiguous on the disk anyway, this saves time in finding out where the files are.
ext4 will still support files stored in the old way, using block pointers, but new files will be created using extents.
64bit (64BIT)
/* 64bit support valid if EXT4_FEATURE_COMPAT_64BIT */
/*150*/ __le32 s_blocks_count_hi; /* Blocks count */
__le32 s_r_blocks_count_hi; /* Reserved blocks count */
__le32 s_free_blocks_count_hi; /* Free blocks count */
multi mount protection (MMP)
MMP was added by Kalpak Shah.
He said "There have been reported instances of a filesystem having been mounted at 2 places at the same time causing a lot of damage to the filesystem. This patch reserves superblock fields and an INCOMPAT flag for adding multiple mount protection(MMP) support within the ext4 filesystem itself. The superblock will have a block number (s_mmp_block) which will hold a MMP structure which has a sequence number which will be periodically updated every 5 seconds by a mounted filesystem. Whenever a filesystem will be mounted it will wait for s_mmp_interval seconds to make sure that the MMP sequence does not change. To further make sure, we write a random sequence number into the MMP block and wait for another s_mmp_interval secs. If the sequence no. doesn't change then the mount will succeed. In case of failure, the nodename, bdevname and the time at which the MMP block was last updated will be displaye d. tune2fs can be used to set s_mmp_interval as desired.
It will also protect against running e2fsck on a mounted filesystem by adding similar logic to ext2fs_open(). "
Ted Ts'o wasn't too keen on it ... "So aside from being !@#!@ annoying (which is why it will never be the default), it does work" ... and ... " I'm on the fence on this."
... but it has been merged into ext4 (but not e2fsprogs?).
flex_bg (FLEX_BG)
Allow bitmaps and inode tables for a block group to be placed anywhere on the storage media (use with -G option to group meta-data in order to create a large virtual block group).
- ability to pack bitmaps and inode tables into larger virtual groups via the flex_bg feature
- Inode allocation using large virtual block groups via flex_bg
Old features
(Had to look these up to make sure they weren't new)
dir_prealloc
All it does is pre-allocate empty directory blocks for directories growing over 1 block.
imagic_inodes
Added in 1999-05-29, some feature of the Andrew Filesystem (AFS).
inode
Extra fields in a inode....
struct {
__le32 l_i_version;
} linux1;
...
struct {
__le16 l_i_blocks_high; /* were l_i_reserved1 */
__le16 l_i_file_acl_high;
__le16 l_i_uid_high; /* these 2 fields */
__le16 l_i_gid_high; /* were reserved2[0] */
__u32 l_i_reserved2;
} linux2;
...
//0x80
__le16 i_extra_isize;
__le16 i_pad1;
__le32 i_ctime_extra; /* extra Change time (nsec << 2 | epoch) */
__le32 i_mtime_extra; /* extra Modification time(nsec << 2 | epoch) */
__le32 i_atime_extra; /* extra Access time (nsec << 2 | epoch) */
__le32 i_crtime; /* File Creation time */
__le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
__le32 i_version_hi; /* high 32 bits for 64-bit version */
superblock
Extra fields in the superblock....
__le32 s_hash_seed[4]; /* HTREE hash seed */
__u8 s_def_hash_version; /* Default hash version to use */
__u8 s_reserved_char_pad;
__le16 s_desc_size; /* size of group descriptor */
/*100*/ __le32 s_default_mount_opts;
__le32 s_first_meta_bg; /* First metablock block group */
__le32 s_mkfs_time; /* When the filesystem was created */
__le32 s_jnl_blocks[17]; /* Backup of the journal inode */
/* 64bit support valid if EXT4_FEATURE_COMPAT_64BIT */
/*150*/ __le32 s_blocks_count_hi; /* Blocks count */
__le32 s_r_blocks_count_hi; /* Reserved blocks count */
__le32 s_free_blocks_count_hi; /* Free blocks count */
__le16 s_min_extra_isize; /* All inodes have at least # bytes */
__le16 s_want_extra_isize; /* New inodes should reserve # bytes */
__le32 s_flags; /* Miscellaneous flags */
__le16 s_raid_stride; /* RAID stride */
__le16 s_mmp_interval; /* # seconds to wait in MMP checking */
__le64 s_mmp_block; /* Block for multi-mount protection */
__le32 s_raid_stripe_width; /* blocks on all data disks (N*stride)*/
__u8 s_log_groups_per_flex; /* FLEX_BG group size */
__u8 s_reserved_char_pad2;
__le16 s_reserved_pad;
group descriptor
Extra fields in a group descriptor....
__u32 bg_reserved[2]; /* Likely block/inode bitmap checksum */
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
__le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
__le32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */
__le32 bg_inode_table_hi; /* Inodes table block MSB */
__le16 bg_free_blocks_count_hi;/* Free blocks count MSB */
__le16 bg_free_inodes_count_hi;/* Free inodes count MSB */
__le16 bg_used_dirs_count_hi; /* Directories count MSB */
__le16 bg_itable_unused_hi; /* Unused inodes count MSB */
__u32 bg_reserved2[3];
more
There is a ext4_extent_header, followed by an array of the ext4_extent struct....
/*
* Each block (leaves and indexes), even inode-stored has header.
*/
struct ext4_extent_header {
__le16 eh_magic; /* probably will support different formats */
__le16 eh_entries; /* number of valid entries */
__le16 eh_max; /* capacity of store in entries */
__le16 eh_depth; /* has tree real underlying blocks? */
__le32 eh_generation; /* generation of the tree */
};/*
* This is the extent on-disk structure.
* It's used at the bottom of the tree.
*/
struct ext4_extent {
__le32 ee_block; /* first logical block extent covers */
__le16 ee_len; /* number of blocks covered by extent */
__le16 ee_start_hi; /* high 16 bits of physical block */
__le32 ee_start; /* low 32 bits of physical block */
};Here, ee_block is the index (within the file, not on disk) of the first block covered by this extent. The number of blocks in the extent is stored in ee_len, and the pointer to the first of those blocks (on disk, now) lives in the combination of ee_start and ee_start_hi. By storing physical block numbers this way, ext4 can handle 48-bit block numbers - enough to index a 1024 PiB device. That should be enough to last for a couple years or so.
For files with few extents, all of the information can be stored within the on-disk inode itself. As the number of extents grows, however, the available space runs out. In that case, a form of indirect blocks is used; the in-inode extents array describes ranges of blocks holding extents arrays of their own. The tree of indirect extents blocks can grow to an essentially unlimited depth, allowing the filesystem to represent even very large, highly-fragmented files.
Beyond extents, relatively little had to be done to prepare ext4 for 48-bit block addressing. The signed, 32-bit block numbers are gone, having been converted to the larger sector_t type. Some reserved space in the ext4 superblock has been grabbed to store the high 16 bits of some global block counts. Much of the tracking of free blocks within the filesystem is done using block numbers relative to the beginning of the block group, so that code did not need to change much at all. A few tweaks to the journaling code were required for it to be able to handle the larger block numbers.
hi words have been added to ext4_group_desc
struct ext4_group_desc
{
__le32 bg_block_bitmap; /* Blocks bitmap block */
__le32 bg_inode_bitmap; /* Inodes bitmap block */
__le32 bg_inode_table; /* Inodes table block */
__le16 bg_free_blocks_count; /* Free blocks count */
__le16 bg_free_inodes_count; /* Free inodes count */
__le16 bg_used_dirs_count; /* Directories count */
__u16 bg_flags;
__u32 bg_reserved[3];
__le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
__le32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */
__le32 bg_inode_table_hi; /* Inodes table block MSB */
};/* data type for filesystem-wide blocks number */ typedef unsigned long long ext4_fsblk_t;
More Info
http://kernelnewbies.org/Linux_2_6_19 -- ext4 was introduced in this version from Linus
http://linux.inet.hr/first_benchmarks_of_the_ext4_file_system.html
---
Ext4: The Fourth Extended Filesystem for Linux
In ordinary UNIX/BSD/ext1-3 and other traditional operating systems, you will allocate memory by allocating blocks one at a time or when a write() requests it. Ext4 breaks this tradition because allocating memory the previous way becomes a bottleneck. Instead, ext4 allocates memory by:
- Allocating several blocks of memory at once in extents which are ranges of contiguous physical blocks. In ext4, a single extent can map up to 128 MB of contiguous space with a 4 KB block size.
- o + This results in batching during memory allocation which saves valuable CPU time. o + This allows us to get contiguous data instead of having data strewn all over the disk.
- Delayed allocation where extents are not created until the data is going to be written to the disk.
- o + The data is able to be laid out more intelligently because size of the data is known. o - This type of allocation requires more RAM. o - This type of allocation results in more fragmentation. o - If the extent is large, searching for a larger portion of free space can take more CPU time. o - The delay provides a longer window of time where you can lose data (race conditions).
- + Example of race condition
- 1 fd = open ("f.new", O_WRONLY); 2 write (fd, ...); 3 rename ("f.new", f)
- Because of delayed allocation, the disk data for f might not be allocated for the write() before the program attempts to rename! The solution to this problem is to add "fdatasync(fd)" after the write. Fdatasync forces all currently queued I/O operations associated with the file indicated by the file descriptor to the synchronized I/O completion state. This, of course, makes the program slower but will force the write to happen instead of delaying it. Developers of ext4 are trying to detect this pattern and add fdatasync() automatically.
- + Example of race condition
- o + The data is able to be laid out more intelligently because size of the data is known. o - This type of allocation requires more RAM. o - This type of allocation results in more fragmentation. o - If the extent is large, searching for a larger portion of free space can take more CPU time. o - The delay provides a longer window of time where you can lose data (race conditions).
- Ext4 introduces a new syscall, fallocate(fd, 128*1024*1024) which will try to allocate space to fd with the specified size in the form of an extent. This syscall will not affect what you see in the file system.
- It is bad to use ext4 on a flash drive because getting contiguous data on a flash drive does not help. Extents are only useful for disks.
