ZFS Fun
Contents |
Introduction
ZFS exists under Linux since quite some time as a userland stuuff (FUSE). It seems to support most of the aspect available on Solaris (up to which ZFS version?). RAID-Z and RAID-Z2 are also supported.
ZFS features and limitations
ZFS offers an impressive amount of features even putting aside its hybrid nature (both a filesystem and a volume manager -- zvol) covered in detail on Wikipedia. One of the most fundamental point to keep in mind about ZFS is it targets a legendary reliability in terms of preserving data integrity. ZFS uses several techniques to detect and repair (self-healing) corrupted data, simply speaking it makes an aggressive use of checksums and relies on data redundancy, the price pay is it requires a bit more CPU processing power than traditional filesystems and RAID solution. However, the Wikipedia article about ZFS also mention it is strongly discouraged to use ZFS over classic RAID arrays as it can not control the data redundancy,thus ruining most of its benefits.
In short, ZFS has the following features (not exhaustive):
- Storage pool (if you are used to BTRFS volumes should be familiar)
- Plenty of space:
- 256 zettabytes per storage pool (2^64 storages pools max in a system).
- 16 exabytes max for a single file
- 2^48 entries max per directory
- Virtual block-devices support support over a ZFS pool (zvol) - (extremely cool when jointly used over a RAID-Z volume)
- Read-only Snapshot support (it is possible to get a read-write copy of them, those are named clones)
- Encryption support (supported only at ZFS version 30 and upper, ZFS version 31 is shipped with Oracle Solaris 11 so that version is mandatory if you plan to encrypt your ZFS datasets/pools)
- Built-in RAID-5-like-over-steroid capabilities known as RAID-Z and RAID-6-like-over-steroid capabilities known as RAID-Z2. RAID-Z3 (triple parity) also exists.
- Copy-on-Write transactional filesystem
- Meta-attributes support (properties) allowing you to you easily drive the show like "That directory is encrypted", "that directory is limited to 5GiB", "That directory is exported via NFS" and so on. Depending on what you define, ZFS takes the appropriates actions!
- Dynamic striping to optimize data throughput
- Variable block length
- Data duplication
- Automatic pool re-silvering
- Transparent data compression / encryption (later requires Solaris 11)
Most notable limitations are:
- Lack a features ZFS developers knows as "Block Pointer rewrite functionality" (planned to be developed), without it ZFS suffers of currently not being able to:
- Pool defragmentation (COW techniques used in ZFS mitigates the problem)
- Pool resizing
- Data compression (re-applying)
- Adding an additional device in a RAID-Z/Z2/Z3 pool to increase it size (however, it is possible to replace in sequence each one of the disks composing a RAID-Z/Z2/Z3)
- NOT A CLUSTERED FILESYSTEM like Lustre, GFS or OCFS2
- No data healing if used on a single device (corruption can still be detected), workaround if to force a data duplication on the drive
- No support of TRIMming (SSD devices)
ZFS on well known operating systems
Linux
Despite the source code of ZFS is open, its license (Sun CDDL) is incompatible with the license governing the Linux kernel (GNU GPL v2) thus preventing its direct integration. However a couple of ports exists, but suffers of maturity and lack of features. As of writing (September 2011) two known implementations exists:
- ZFS-fuse: a totally userland implementation relying on FUSE. Funtoo provides the version 0.7.0 in its portage tree. Worth mentioning at its subject that:
- It supports zpool version 23
- It has improved robustness and stability
- It does not support zvols (feature not planned in a near future according project roadmap
- ZFS on Linux: a native implementation of ZFS in kernel mode. The project claims to have "a functional and stable SPA, DMU, ZVOL, and Posix Layer (ZPL)". Current upstream version is 0.6.0-rc5 (can mount ZFS filesystems and support zpool version 28), however neither Gentoo and Funtoo have ebuilds for this port (yet). As ZFS on Linux is an out-of-tree Linux kernel implementation, patches must be waited after each Linux kernel release. As of september 2011, the project claims to have support for Linux 2.6.26 up to Linux 3.0.0, Linux 3.1 series kernels are not officially supported and ZFS on Linux is far from being mature and usable on production systems. It suffers from a couple of major issues like:
Solaris/OpenIndiana
- Oracle Solaris: remains the de facto reference platform for ZFS implementation: ZFS on this platform is now considered as mature and usable on production systems. Solaris 11 uses ZFS even for its "system" pool (aka rpool). A great advantage of this: it is now quite easy to revert the effect of a patch at the condition a snapshot has been taken just before applying it. In the "old good" times of Solaris 10 and before, reverting a patch was possible but could be tricky and complex when possible. ZFS is far from being new in Solaris as it takes its roots in 2005 to be, then, integrated in Solaris 10 6/06 introduced in June 2006.
- OpenIndiana: is based on the Illuminos kernel (a derivative of the now defunct OpenSolaris) which aims to provide absolute binary compatibility with Sun/Oracle Solaris. Worth mentioning that Solaris kernel and the Illumos kernel were both sharing the same code base, however, they now follows a different path since Oracle announced the discontinuation of OpenSolaris (August 13th 2010). Like Oracle Solaris, OpenIndiana uses ZFS for its system pool. The illumos kernel ZFS support lags a bit behind Oracle: it supports zpool version 28 where as Oracle Solaris 11 has zpool version 31 support, data encryption being supported at zpool version 30.
*BSD
- FreeBSD: ZFS is present in FreeBSD since FreeBSD 7 (zpool version 6) and FreeBSD can boot on a ZFS volume (zfsboot). ZFS support has been vastly enhanced in FreeBSD 8.x (8.0 supports zpool version 13) and FreeBSD 9 (currently at beta-1, that later supporting zpool version 28). ZFS in FreeBSD is now considered as fully functional and mature. FreeBSD derivatives such as the popular FreeNAS takes befenits of ZFS and integrated it in their tools. In the case of that latter, it have, for example, supports for zvol though its Web management interface (FreeNAS >= 8.0.1).
- NetBSD: ZFS has been srtaed to be ported as a GSoC project in 2007 and is present in the NetBSD mainstream since 2009 (zpool version 13).
- OpenBSD: No ZFS support yet and not planned until Oracle changes some policies according to the project FAQ.
ZFS alternatives
- WAFL seems to have severe limitation [1] (document is not dated), also an interesting article lies here
- BTRFS is advancing every week but it still lacks suck a feature like the capability of emulating a virtual block device over a storage pool (zvol) and it has a built-in support for RAID-0/1 only. At date of writing, it is still experimental where as ZFS is used on big production servers.
- VxFS has also been targeted by comparisons like this one (a bit controversial). VxFS has been known in the industry since 1993 and has a legendary flexibility. Symantec acquired VxFS and now propose a basic version (no clustering for example) otf it under the same Veritas Storage Foundation Basic
- An interesting discussion about modern filesystems can be found on OSNews.com
ZFS vs BTRFS
BTRFS and ZFS are sibbling in their concepts and of course have differences:
- both are transactional filesystems (in BTRFS a a transaction is a sequence of low level operations)
- both implement for example the pool concept (called a "volume" in BTRFS)
- both can do snapshots although in ZFS a snapshot is a read only thing and its attributes can't be modified. BTRFS on his side has has writable snapshots (known as clones in ZFS)
- both can organize their storage pool in several logical divisions (called datasets in ZFS and subvolumes in BTRFS).
- As their equivalent in in BTRFS (subvolumes), ZFS datasets appears as directories
- Where as a ZFS snapshot is "hidden" in a sub-directory (named .zfs), BTRFS snapshots appears as visible directories
- While ZFS manages rollback in a transparent manner (the filesystem knows where and how rollback the data), rollingback data in BTRFS requires a bit more work as the system administrator must umount/remount a BTRFS subvolume.
- ZFS has a kind of sophisticated RAID-5 called RAID-Z (and now RAID-Z2 ~ RAID-6), similar capabilities are planned for BTRFS but not yet available as of september 2011
- A ZFS filesytem can be snapshotted and sent through the network, BTRFS has not yet reach that integration level
- Whereas ZFS makes an aggressive use of properties to govern the behaviour of the different datasets (quotas, sharing over NFS, encryption, compression and so on), BTRFS does not use this notion or in a much light manner and only through the mount command.
- ZFS has no journal (!), this is not a design flaw but an interesting intrinsic feature :) See page 7 of "ZFS The last word on filesystems". Also worth mentioning that BTRFS still lacks a viable filesystem checking tool (announced in august 2011) and sometimes crashes when an invalid log is encountered. BTRFS tools present in experimental branches can however mitigate the problem by allowing the system administrator to clear the BTRFS log in case of a disaster happen (see our article BTRFS Fun).
ZFS resource naming restrictions
Before going further, you must be aware of restrictions concerning the names you can use on a ZFS filesystem. The general rule is: you can can use all of the alphanumeric characters plus the following specials are allowed:
- Underscore (_)
- Hyphen (-)
- Colon (:)
- Period (.)
The name used to designate a ZFS pool has no particular restriction except:
- it can't use one the reserved words in particular:
- mirror
- raidz (raidz2, raidz3 and so on)
- spare
- cache
- log
- names must begin with an alphanumeric character (same for ZFS datasets).
Playing with ZFS
Requirements
- Kernel with FUSE stuff enabled
- sys-fs/zfs-fuse installed
- /etc/init.d/zfs started (automatically detects and mounts pools)
- Disk size of 64 Mbytes as a bare minimum (128 Mbytes is the minimum size of a pool). Multiple disk will be simulated through the use of several raw images accessed via the Linux loopback devices.
- At least 512 MB of RAM
Your first ZFS pool
To start with, four raw disks (2 GB each) are created:
# for i in 0 1 2 3; do dd if=/dev/zero of=/tmp/zfs-test-disk0${i}.img bs=2G count=1; done
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB) copied, 40.3722 s, 53.2 MB/s
...
Then let's see what loopback devices are in use and which is the first free:
# losetup -a # losetup -f /dev/loop0
In the above example nothing is used and the first available loopback device is /dev/loop0. Now associate all of the disks with a loopback device (/tmp/zfs-test-disk00.img -> /dev/loop/0, /tmp/zfs-test-disk01.img -> /dev/loop/1 and so on):
# for i in 0 1 2 3; do losetup /dev/loop${i} /tmp/zfs-test-disk0${i}.img; done
# losetup -a
/dev/loop0: [000c]:781455 (/tmp/zfs-test-disk00.img)
/dev/loop1: [000c]:806903 (/tmp/zfs-test-disk01.img)
/dev/loop2: [000c]:807274 (/tmp/zfs-test-disk02.img)
/dev/loop3: [000c]:781298 (/tmp/zfs-test-disk03.img)
Pool creation
It is now time to create our first ZFS data pool and this is accomplished by one of the two commands you have to retain: zfspool. For now, we will ask it to do a simple job: get all of the just created devices and create an aggregated pool:
# zfs create myfirstpool /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 # mount ... kstat on /zfs-kstat type fuse (rw,nosuid,nodev,allow_other) myfirstpool on /myfirstpool type fuse (rw,allow_other,default_permissions)
Note that the pool has also been mounted on /myfirstpool! Forget kstat for now, it is mounted automatically by zfs-fuse and countains some performance statistics. Oh by the way, we have used block devices (loopback devices are block devices) to create our ZFS pool, however ZFS can also deal directly with files and the taxonomy used in the ZFS world retains the term vdev (virtual device). Let's be curious a bit and see what df reports:
# df -h # myfirstpool 7.9G 21K 7.9G 1% /myfirstpool
Cool! About 8GB are reported, this is barely the sum of our four vdevs minus some metadata. What can we do with 8 GB of free storage space? Copy some files in it of course!
Some file operations
# cp -a /usr/src/linux-3.1-rc4 /myfirstpool # df -h myfirstpool 7.9G 662M 7.2G 9% /myfirstpool # cd /myfirstpool # ls -l /myfirstpool total 3 drwxrwxr-x 24 root root 56 Aug 29 08:41 linux-3.1-rc4 # ls -l /myfirstpool/linux-3.1-rc4 total 29 -rw-rw-r-- 1 root root 18693 Aug 29 00:16 COPYING -rw-rw-r-- 1 root root 94790 Aug 29 00:16 CREDITS drwxrwxr-x 94 root root 222 Aug 29 00:16 Documentation -rw-rw-r-- 1 root root 2464 Aug 29 00:16 Kbuild -rw-rw-r-- 1 root root 252 Aug 29 00:16 Kconfig -rw-rw-r-- 1 root root 200918 Aug 29 00:16 MAINTAINERS -rw-rw-r-- 1 root root 53537 Aug 29 00:16 Makefile -rw-r--r-- 1 root root 364907 Aug 29 08:41 Module.symvers -rw-rw-r-- 1 root root 17459 Aug 29 00:16 README .... drwxrwxr-x 22 root root 41 Aug 29 08:41 sound drwxrwxr-x 9 root root 9 Aug 29 00:16 tools drwxrwxr-x 2 root root 11 Aug 29 08:38 usr drwxrwxr-x 3 root root 3 Aug 29 00:16 virt -rwxr-xr-x 1 root root 13126551 Aug 29 08:41 vmlinux -rw-r--r-- 1 root root 14771911 Aug 29 08:41 vmlinux.o # make clean # df -h Filesystem Size Used Avail Use% Mounted on ... myfirstpool 7.9G 444M 7.4G 6% /myfirstpool
In fact nothing magic, a ZFS pool is acting just like any other existing filesystem :)
Unmounting/remounting the pool
If ZFS behaves just like any other filesystem, can we unmount it?
# umount /myfirstpool # mount | grep myfirstpool #
No more /myfirstpool in our light of sight. So yes, it is possible to unmount a ZFS pool just like with any other filesystem. But... How can we remount it then? Simple! First check the list of all ZFS pools known by the system:
# zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT myfirstpool 7.94G 444M 7.50G 5% 1.00x ONLINE -
Then mount it again:
# zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT myfirstpool 7.94G 444M 7.50G 5% 1.00x ONLINE - # zfs mount myfirstpool
Oh! Did you noticed? We used the zfs command instead of the zpool command. You will understand the reason of using zfs instead of zpool a bit later, for now just remember that zfs and zpool are the only two commands used to interact with the ZFS universe. Also note that zfs mount... is the one and only way to remount a ZFS pool in the VFS arborescence so you can't be confused or do errors.
The missing leading / ahead of myfirstpool is not a typo. When a pool is created, ZFS writes in the pool metadata where it must be mounted. Unless overridden, it is assumed that the pool is to be mounted directly under the VFS root in a mountpoint which has the same name of the pool.
Let's check what happened:
# mount | grep myfirstpool myfirstpool on /myfirstpool type fuse (rw,allow_other,default_permissions) # ls -l /myfirstpool total 3 drwxrwxr-x 23 root root 33 Sep 4 18:18 linux-3.1-rc4
Everything is back again!
ZFS datasets
Just like your house is a kind of big container subdivided in many others container (rooms), a ZFS pool can be divided in several logical containers known as datasets. Basically, the role of a dataset is to fullfill the so well known adage divide and conquer as they define the frontiers where all ZFS operations take place: it is only possible, for example, to take a snapshot/do a rollback of a dataset taken at whole.
Creating and destroying datasets
Creating a dataset in a pool is pretty easy to achieve: you invoke the zfs command, you give it the name of the pool to divide and the name of the dataset to create. To create three datasets named myfirstDS, mysecondDS, mythirdDS in myfirstpool(again the missing / ahead of myfirstpool is not a typo) :
# zfs create myfirstpool/myfirstDS # zfs create myfirstpool/mysecondDS # zfs create myfirstpool/mythirdDS # ls -l /myfirstpool total 7 drwxrwxr-x 23 root root 33 Sep 4 18:18 linux-3.1-rc4 drwxr-xr-x 2 root root 2 Sep 4 23:34 myfirstDS drwxr-xr-x 2 root root 2 Sep 4 23:34 mysecondDS drwxr-xr-x 2 root root 2 Sep 4 23:34 mythirdDS
Datasets are appearing just as if they were regular directories. Are they? Try to remove one of those:
# rmdir /myfirstpool/myfirstDS rmdir: failed to remove `/myfirstpool/myfirstDS': Device or resource busy
This behavior is absolutely normal, datasets are special entities and must be managed via ZFS commands. Trouble: how a regular directory with files opened by a running process can be distinguished from a ZFS dataset? Both looks similar! Here again, the zfs command rescues us:
# zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 444M 7.38G 444M /myfirstpool myfirstpool/myfirstDS 21K 7.38G 21K /myfirstpool/myfirstDS myfirstpool/mysecondDS 21K 7.38G 21K /myfirstpool/mysecondDS myfirstpool/mythirdDS 21K 7.38G 21K /myfirstpool/mythirdDS
Not obvious but zfs list also reveals you a great secret: we lied you in the previous paragraphs. It it not possible to mount a ZFS pool in the VFS arborescence as only datasets can be mounted. So where is the prank? Our myfirstpool had been mounted in the VFS and you never defined any datasets in it. How is that possible? Is there some ZFS black magic lying behind? No. When you created the ZFS pool myfirstpool, a special dataset had also been created in the pool automatically for you: the root dataset. When you typed zfs mount mypool, you had in fact interact with this root dataset and not with the pool in itself. The operation was transparent for you and you never noticed its presence although using the zfs command instead of zpool could have given you a hint about what lies under the hood. You see that root dataset in the first line of what zfs list reported in the example above.
So the root dataset (myfirstpool) is mounted on /myfirstpool, myfirstDS is then mounted inside (/myfirstpool/myfirstDS) ditto for mysecondDS and mythirdDS. Mounted is the exact term because if we have a look at what the mount command reports we can see that those datasets have been effectively mounted:
# mount rootfs on / type rootfs (rw) ... myfirstpool on /myfirstpool type fuse (rw,allow_other,default_permissions) myfirstpool/myfirstDS on /myfirstpool/myfirstDS type fuse (rw,allow_other,default_permissions) myfirstpool/mysecondDS on /myfirstpool/mysecondDS type fuse (rw,allow_other,default_permissions) myfirstpool/mythirdDS on /myfirstpool/mythirdDS type fuse (rw,allow_other,default_permissions)
As we did before, we can copy some files in the newly created datasets just like they were regular directories:
# cp -a /usr/portage /myfirstpool/mythirdDS # ls -l /myfirstpool/mythirdDS/* total 438 drwxr-xr-x 45 root root 46 Aug 31 07:37 app-accessibility drwxr-xr-x 202 root root 203 Sep 2 07:21 app-admin drwxr-xr-x 3 root root 4 Aug 18 18:13 app-antivirus drwxr-xr-x 93 root root 94 Aug 18 18:13 app-arch drwxr-xr-x 38 root root 39 Aug 18 18:13 app-backup drwxr-xr-x 30 root root 31 Aug 18 18:13 app-benchmarks drwxr-xr-x 66 root root 67 Aug 18 18:13 app-cdr drwxr-xr-x 96 root root 97 Aug 18 18:13 app-crypt drwxr-xr-x 358 root root 359 Aug 18 18:13 app-dicts ... # df -h | grep DS myfirstpool/myfirstDS 5.6G 21K 5.6G 1% /myfirstpool/myfirstDS myfirstpool/mysecondDS 5.6G 21K 5.6G 1% /myfirstpool/mysecondDS myfirstpool/mythirdDS 7.4G 1.9G 5.6G 25% /myfirstpool/mythirdDS
Notice what df returns: our four datasets shares (don't forget the root dataset!) shares the same storage capacity. Logical indeed: as they are all contained in the same pool they cannot exceed its own storage capacity. Is it possible to cap the maximum capacity of a dataset? Yes, for now just retain that datasets:
- are logical containers where ZFS operations take place
- are concerned at whole by ZFS operations (again: you cannot snapshot/rollback a particular directory located in a dataset, you can only operate at the dataset level)
We have three datasets, but the third is pretty useless and contains a lot of garbage. Is it possible to remove it with a simple rm -rf? Let's try:
# rm -rf /myfirstpool/mythirdDS rm: cannot remove `/myfirstpool/mythirdDS': Device or resource busy
This is perfectly normal, remember that datasets are special entities that requires special care and they are not deletable through regular shell commands. However it is possible to destroy them and here again, the zfs command comes at our rescue:
# zfs destroy myfirstpool/mythirdDS # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 444M 7.38G 444M /myfirstpool myfirstpool/myfirstDS 21K 7.38G 21K /myfirstpool/myfirstDS myfirstpool/mysecondDS 21K 7.38G 21K /myfirstpool/mysecondDS
Et voila! No more third dataset. :)
A bit more subtle case: let's mythirdDS and put another nested one in it then try to destroy mythirdDS again:
# zfs create myfirstpool/mythirdDS # zfs create myfirstpool/mythirdDS/nestedSD # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 444M 7.38G 444M /myfirstpool myfirstpool/myfirstDS 21K 7.38G 21K /myfirstpool/myfirstDS myfirstpool/mysecondDS 21K 7.38G 21K /myfirstpool/mysecondDS myfirstpool/mythirdDS 42K 7.38G 21K /myfirstpool/mythirdDS myfirstpool/mythirdDS/nestedDS 21K 7.38G 21K /myfirstpool/mythirdDS/nestedDS # zfs destroy myfirstpool/mythirdDS cannot destroy 'myfirstpool/mythirdDS': filesystem has children use '-r' to destroy the following datasets: myfirstpool/mythirdDS/nestedDS
zfs tells us it has found some others datasets located in mythirdDS and, thus, is unable to delete it without you consent to make a recursive destruction (-r parameter). Before trying to destroy the dataset again let's create some more nested datasets plus a couple of directories inside mythirdDS:
# zfs create myfirstpool/mythirdDS/nestedSD # zfs create myfirstpool/mythirdDS/nestedSD2 # zfs create myfirstpool/mythirdDS/nestedSD3 # mkdir /myfirstpool/mythirdDS/dir1 # mkdir /myfirstpool/mythirdDS/dir2 # mkdir /myfirstpool/mythirdDS/dir3 # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 444M 7.38G 444M /myfirstpool myfirstpool/myfirstDS 21K 7.38G 21K /myfirstpool/myfirstDS myfirstpool/mysecondDS 21K 7.38G 21K /myfirstpool/mysecondDS myfirstpool/mythirdDS 84K 7.38G 21K /myfirstpool/mythirdDS myfirstpool/mythirdDS/mynestedDS 21K 7.38G 21K /myfirstpool/mythirdDS/mynestedDS myfirstpool/mythirdDS/mynestedDS2 21K 7.38G 21K /myfirstpool/mythirdDS/mynestedDS2 myfirstpool/mythirdDS/mynestedDS3 21K 7.38G 21K /myfirstpool/mythirdDS/mynestedDS3 # zfs destroy -r myfirstpool/mythirdDS
Now what happens if we try to destroy mythird again this time with '-r'?
# zfs destroy -r myfirstpool/mythirdDS cannot destroy 'myfirstpool/mythirdDS/mynestedDS': dataset is busy
This is not as exactly normal as it should and seems to be a bug in zfs-fuse, the expected behavior is to automatically unmount any dataset contained inside mythirdDS then destroy it including mythirdDS itself. The same kind of operation on a Solaris machine with a similar dataset structure gives:
# zfs list NAME USED AVAIL REFER MOUNTPOINT .... rpool1/swap 4.04G 23.2G 123M - testpool/test 55.4K 3.76T 55.4K /testpool/test testpool/test/ds1 44.9K 3.76T 44.9K /testpool/test/ds1 testpool/test/ds2 44.9K 3.76T 44.9K /testpool/test/ds2 testpool/test/ds3 44.9K 3.76T 44.9K /testpool/test/ds3 testpool/test2 44.9K 3.76T 44.9K /testpool/test2 # mkdir /testpool/test/dir1 # mkdir /testpool/test/dir2 # mkdir /testpool/test/dir1 # zfs destroy -r testpool/test # zfs list NAME USED AVAIL REFER MOUNTPOINT .... rpool1/swap 4.04G 23.2G 123M - testpool/test2 44.9K 3.76T 44.9K /testpool/test2
To go back on ZFS Fuse, just do a few attempts and mythirdDS should vanish (you may also have to do an explicit zfs destroy mythirdDS at the end).
Snapshotting and rolling back a dataset
This is, by far, one of the coolest feature of ZFS: you can litterally take a photograph of a dataset, do whatever you want with the dataset then restore it in the exact same state just as if nothing had ever happened in the middle. To start with, let's copy some files in mysecondDS:
# cp -a /usr/portage /myfirstpool/mysecondDS # ls /myfirstpool/mysecondDS/portage total 200 drwxr-xr-x 45 root root 46 Aug 31 07:37 app-accessibility drwxr-xr-x 202 root root 203 Sep 2 07:21 app-admin drwxr-xr-x 3 root root 4 Aug 18 18:13 app-antivirus drwxr-xr-x 93 root root 94 Aug 18 18:13 app-arch ... drwxr-xr-x 57 root root 58 Aug 22 08:56 x11-wm drwxr-xr-x 16 root root 17 Aug 18 18:13 xfce-base drwxr-xr-x 54 root root 55 Aug 18 18:13 xfce-extra
Now, let's take a snapshot of mysecondDS. Because we manipulate a dataset and not the pool, we rely on the zfs command:
# zfs snapshot myfirstpool/mysecondDS@Charlie
The syntax is always pool/dataset@snapshot-name, the name of the snapshot is left at your discretion however you must use an at sign (@) to separate the snapshot name from the rest of the path.
After running that command,
# ls -la /myfirstpool/mysecondDS total 9 drwxr-xr-x 3 root root 3 Sep 5 16:49 . drwxr-xr-x 6 root root 6 Sep 5 15:43 .. drwxr-xr-x 164 root root 169 Aug 18 18:25 portage
You were not thinking you would see something like @Charlie or Charlie lying in /myfirstpool/mysecondDS were you? Of course not, this is obvious ;-) Can zfs be of any help this time? It has rescued us several times in the past:
# zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 2.27G 5.54G 444M /myfirstpool myfirstpool/myfirstDS 21K 5.54G 21K /myfirstpool/myfirstDS myfirstpool/mysecondDS 1.84G 5.54G 1.84G /myfirstpool/mysecondDS #
So where the heck is Charlie? And how on earth can we use it if *nothing* is visible to us. Again the answer is: zfs! This time we invoke it with the -t parameter set to 'all' meaning "list all dataset including snapshots":
# zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 2.27G 5.54G 444M /myfirstpool myfirstpool/myfirstDS 21K 5.54G 21K /myfirstpool/myfirstDS myfirstpool/mysecondDS 1.84G 5.54G 1.84G /myfirstpool/mysecondDS myfirstpool/mysecondDS@Charlie 37K - 1.84G - #
Notice that Charlie is not mounted and although mysecondDS holds near 2GB of data, Charlie takes only a couple of kilobytes in the dataset. This is the consequence of ZFS being a Copy-on-write filesystem, duplicating all of the data blocks is not required. They will be duplicated only when needed: when ZFS sense a change in a data block, it will create a copy of it thus leaving intact the datablock pointed by a snapshot. At the time they are taken, snapshots occupy very little space in the datasets however as the time goes on they tend to "stick"more and more data blocks to be in use. It is wise to delete snapshots when become not needed anymore.
OpenIndiana and Oracle Solaris supports an interesting feature not available in ZFS Fuse: a kind of secret door in the form of a virtual directory named .zfs (notice the dot ahead). "secret door" because it is really secret! You cannot see it even with ls -la, however .zfs is present in just any of your datasets and holds some very interesting clues:
# zfs list -t all ... testpool/test2 205K 3.76T 70.3K /testpool/test2 testpool/test2@snap1 0 - 70.3K - # cd /testpool/test2 # ls -la total 22 drwxr-xr-x 11 root root 11 2011-09-05 17:34 . drwxr-xr-x 6 root root 6 2011-09-05 16:13 .. drwxr-xr-x 2 root root 2 2011-09-05 17:34 .sometest drwxr-xr-x 2 root root 2 2011-09-05 17:34 .xyz drwxr-xr-x 2 root root 2 2011-09-05 16:13 dir1 drwxr-xr-x 2 root root 2 2011-09-05 16:13 dir2 ... # cd /testpool/test2/.zfs # pwd /testpool/test2/.zfs # ls -l ls -l total 2 dr-xr-xr-x 2 root root 2 2011-09-05 16:13 shares dr-xr-xr-x 3 root root 3 2011-09-05 17:19 snapshot # cd snapshot # ls -l total 2 drwxr-xr-x 9 root root 9 2011-09-05 17:19 snap1 # cd snap1 # ls -l total 22 drwxr-xr-x 11 root root 11 2011-09-05 17:34 . drwxr-xr-x 6 root root 6 2011-09-05 16:13 .. drwxr-xr-x 2 root root 2 2011-09-05 17:34 .sometest drwxr-xr-x 2 root root 2 2011-09-05 17:34 .xyz drwxr-xr-x 2 root root 2 2011-09-05 16:13 dir1 drwxr-xr-x 2 root root 2 2011-09-05 16:13 dir2 ...
Despite you cannot change the snapshot contents, you can access it without having to roll it back to examine its contents. Extremely nifty design choice from the ZFS designers!
Now we have found Charlie, let's do some changes in the mysecondDS:
# rm -rf /myfirstpool/mysecondDS/portage # echo "Hello, world" > /myfirstpool/mysecondDS/hello.txt # ls -l /myfirstpool/mysecondDS total 1 -rw-r--r-- 1 root root 13 Sep 5 18:07 hello.txt # cat /myfirstpool/mysecondDS/hello.txt Hello, world
Whooops...removing portage was not the best idea to have and we do not bother about hello.txt. We will have to move back at checkpoint Charlie!
# zfs rollback myfirstpool/mysecondDS@Charlie # ls -l /myfirstpool/mysecondDS total 6 drwxr-xr-x 164 root root 169 Aug 18 18:25 portage
Again, ZFS handled everything for you and you now have the contents of mysecondDS exactly as it was at the time the snapshot Charlie was taken. Not more complicated than that. Hang on you hat, we have not finished.
Dealing with several snapshots (time-traveling machine)
So far we only used a single snapshot just to keep things simple. However a dataset can hold several snapshots and moreover you can do a delta between two snapshots and nothing is really much more complicated than you have seen so far.
Let's consider myfirstDS this time. This dataset should be empty as we did nothing in it so far:
# ls -la /myfirstpool/myfirstDS total 3 drwxr-xr-x 2 root root 2 Sep 4 23:34 . drwxr-xr-x 6 root root 6 Sep 5 15:43 ..
Now generate some contents, take a snapshot (snapshot-1), add more content, take a snapshot again (snapshot-2), do some more modifications and take a third snapshot (snapshot-3):
# echo "Hello, world" > /myfirstpool/myfirstDS/hello.txt # cp /usr/src/linux-3.1-rc4.tar.bz2 /myfirstpool/myfirstDS # ls -l /myfirstpool/myfirstDS # ls -l /myfirstpool/myfirstDS total 75580 -rw-r--r-- 1 root root 13 Sep 5 22:38 hello.txt -rw-r--r-- 1 root root 77220912 Sep 5 22:38 linux-3.1-rc4.tar.bz2 # zfs snapshot myfirstpool/myfirstDS@snapshot-1 # echo "Goodbye, world" > /myfirstpool/myfirstDS/goodbye.txt # echo "Are you there?" >> /myfirstpool/myfirstDS/hello.txt # cp /usr/src/linux-3.0.tar.bz2 /myfirstpool/myfirstDS # rm /myfirstpool/myfirstDS/linux-3.1-rc4.tar.bz2 # zfs snapshot myfirstpool/myfirstDS@snapshot-2 # echo "Still there?" >> /myfirstpool/myfirstDS/goodbye.txt # rm /myfirstpool/myfirstDS/hello.txt # cp /proc/config.gz /myfirstpool/myfirstDS # zfs snapshot myfirstpool/myfirstDS@snapshot-3 # zfs list -t all # zfs list -t all NAME USED AVAIL REFER MOUNTPOINT myfirstpool 2.41G 5.40G 444M /myfirstpool myfirstpool/myfirstDS 147M 5.40G 73.3M /myfirstpool/myfirstDS myfirstpool/myfirstDS@snapshot-1 73.8M - 73.8M - myfirstpool/myfirstDS@snapshot-2 20K - 73.3M - myfirstpool/myfirstDS@snapshot-3 0 - 73.3M -
Wow, nice demonstration on how a Copy-on-Write filesystem like ZFS works: what do we observe? First it is quite obvious to see that snapshot-1 is quite big. Is is possible that having a so big snapshot to be the consequence of removing /myfirstDS/linux-3.1-rc4.tar.bz2? Absolutely. Remember that a snapshot is a photograph of what a dataset contains at a given time, deleted information and unmodified original information is retained by the snapshot even you delete it from the dataset or bring in some changes to it. If you look again at the command history between snapshot-2 and snapshot-3, you will notice that we removed a small file and changed another small file a bit thus having a little information delta between what the dataset content at this time and what it also actually contains leading to a very small snapshot at the end. The third dataset is the exact copy of what the current dataset contains thus its size is very close to zero (truncated to zero on what you see).
$100 question: "How can I see what changed between snapshots?". Answer: yes, you can! Nuance is: ZFS Fuse does not support it yet :( Nevertheless here is what snapshots diffing looks like on an OpenIndiana/Solaris machine:
# zfs create san/test2 # cd /san/test2 # wget http://www.kernel.org/pub/linux/kernel/v3.0/testing/patch-3.1-rc4.bz2 # echo "Hello,world" > hello.txt # zfs snapshot san/test2@s1 # rm patch-3.1-rc4.bz2 # echo 'Goodbye!' > goodbye.txt # echo 'Still there?' >> hello.txt # zfs snapshot san/test2@s2 # echo 'Hello, again' >> hello.txt # ln -s goodbye.txt goodbye2.txt # mv hello.txt hello-new.txt # zfs snapshot san/test2@s3 # zfs list -t all | grep test2 san/test2 8.49M 3.76T 47.9K /san/test2 san/test2@s1 8.41M - 8.42M - san/test2@s2 29.2K - 46.4K - san/test2@s3 0 - 47.9K - # zfs diff san/test2@s1 san/test2@s2 M /san/test2/ - /san/test2/patch-3.1-rc4.bz2 M /san/test2/hello.txt + /san/test2/goodbye.txt # zfs diff san/test2@s2 san/test2@s3 M /san/test2/ R /san/test2/hello.txt -> /san/test2/hello-new.txt + /san/test2/goodbye2.txt # zfs diff san/test2@s1 san/test2@s3 M /san/test2/ - /san/test2/patch-3.1-rc4.bz2 R /san/test2/hello.txt -> /san/test2/hello-new.txt + /san/test2/goodbye.txt + /san/test2/goodbye2.txt
Where M,R,+,- stands for:
- M: item has been modified
- R: item has been renamed
- +: item has been added
- -: item has been removed
Observe the output of each diff and draw you own conclusion on what we did at each step and what appears in the diff.
If ZFS-Fuse does not impléments (yet) a diffing capability, it is till handy and it can jump several steps backwards in time. Note that we could jump just a leap backwards (i.e. rolling back to snapshot-2) but you will see nothing more interesting than what you had seen in the previous section.
Streaming datasets over the network (full/incremental)
Govern by attributes
Each dataset has its own properties (aka attributes) like:
- size limit
- compression (on/off)
- encryption (on/off)
- quota per user/group
- checksum usage => never turn that property off unless having very good reasons you are likely to never have, doing so will prevent ZFS detect data corruption)
Not all of a dataset properties are editable, some of them are set by the operating system for you and can't be modified.
Data redundancy with ZFS
ZFS/RAID-Z vs RAID-5
RAID-5 is very commonly used nowadays because of its simplicity, efficiency and fault-tolerance. Although the technology did its proof over decades, it has a major drawback known as "The RAID-5 write hole". if you are familiar with RAID-5 you already know that is consists of spreading the stripes across all of the disks within the array and interleaving them with a special stripe called the parity. Several schemes of spreading stripes/parity between disks exists in the natures, each one with its own pros and cons, however the "standard" one (also known as left-asynchronous) is:
Disk_0 | Disk_1 | Disk_2 | Disk_3 [D0_S0] | [D0_S1] | [D0_S2] | [D0_P] [D1_S0] | [D1_S1] | [D1_P] | [D1_S2] [D2_S0] | [D2_P] | [D2_S1] | [D2_S2] [D2_P] | [D2_S0] | [D2_S1] | [D2_S2]
The parity is simply computed by XORing the stripes of the same "row", thus giving the general equation:
- [Dn_S0] XOR [Dn_S1] XOR ... XOR [Dn_Sm] XOR [Dn_P] = 0
This equation can be rewritten in several ways:
- [Dn_S0] XOR [Dn_S1] XOR ... XOR [Dn_Sm] = [Dn_P]
- [Dn_S1] XOR [Dn_S2] XOR ... XOR [Dn_Sm] XOR [Dn_P] = [Dn_S0]
- [Dn_S0] XOR [Dn_S2] XOR ... XOR [Dn_Sm] XOR [Dn_P] = [Dn_S1]
- ...and so on!
Because the equations are a combinations of exclusive-or, it is possible to easily compute a parameter if it is missing. Let say we have 3 stripes plus one parity composed of 4 bits each but one of them is missing due to a disk failure:
- D0_S0 = 1011
- D0_S1 = 0010
- D0_S2 = <missing>
- D0_P = 0110
However we know that:
- D0_S0 XOR D0_S1 XOR D0_S2 XOR D0_P = 0000 also rewritten as:
- D0_S2 = D0_S1 XOR D0_S2 XOR D0_P
Applying boolean algebra it gives: D0_S2 = 1011 XOR 0010 XOR 0110 = 1111. Proof: 1011 XOR 0010 XOR 1111 = 0110 this is the same as D0_P
'So what's the deal?' Okay now the funny part, forgot the above hypothesis and imagine we have this:
- D0_S0 = 1011
- D0_S1 = 0010
- D0_S2 = 1101
- D0_P = 0110
Applying boolean algebra magics gives 1011 XOR 0010 XOR 1101 => 0100. Problem: this is different of D0_P (0110). Can you tell which one (or which ONES) of the four terms lies? If you find a mathematically acceptable solution, found your company because you have just solved a big computer science problem. If humans can't solve the question, imagine how hard it is for the poor little RAID-5 controller to determine which stripe is right and which one lies and the resulting "datageddon" (i.e. massive data corruption on the RAID-5 array) when the RAID-5 controller detect error and start to rebuild the array.
This is not science fiction, this a pure reality and the weakness stays in the RAID-5 simplicity. Here is how it can happen: an urban legend with RAID-5 arrays is that they update stripes in an atomic transaction (all of the stripes+parity are written or none of them). Too bad, this is just not true, the data is written on the fly and if for a reason or another the machine where the RAID-5 array has a power outage or crash, the RAID-5 controller will simply have no idea about what he was doing and which stripes are up to date which ones are not up to date. Of course, RAID controllers in servers do have a replaceable on-board battery and most of the time the server they reside in is connected to an auxiliary source like a battery-based UPS or a diesel/gas electricity generator. However, Murphy laws or unpredictable hazards can, sometimes, happens....
Another funny scenario: imagine a machine with a RAID-5 array (on UPS this time) but with non ECC memory. the RAID-5 controller splits the data buffer in stripes, computes a data stripe and starts to write them on the different disks of the array. But...but...but... For some odd reason, only one bit in one of the stripes flips (cosmic rays, RFI...) after the parity calculation. Too bad too sad, one of the written stripes contains corrupted data and it is silently written on the array. Datageddon in sight!
Not to make you freaking: storage units have sophisticated error correction capability (a magnetic surface or an optical recording surface is not perfect and reading/writing error occurs) masking most the cases. However, some established statistics estimates that even with error correction mechanism one bit over 10^16 bits transferred is incorrect. 10^16 is really huge but unfortunately in this beginning of the XXIst century with datacenters brewing massive amounts of data with several hundreds to not say thousands servers this this number starts to give headaches: a big datacenter can face to silent data corruption every 15 minutes (Wikepedia). No typo here, a potential disaster may silently appear 5 times an hour for every single day of the year. Detection techniques exists but traditional RAID-5 arrays in them selves can be a problem. Ironic for a so popular and widely used solution :)
If RAID-5 was an acceptable trade-off in the past decades, it simply made its time. RAID-5 is dead? *Horray!*
Final words and lessons learned
Source: solaris-zfs-administration-guide