Difference between pages "Keychain" and "BTRFS Fun"

(Difference between pages)
 
 
Line 1: Line 1:
{{Article
+
{{Article}}
|Author=Drobbins
+
{{fancyimportant|BTRFS is still '''experimental''' even with latest Linux kernels (3.4-rc at date of writing) so be prepared to lose some data sooner or later or hit a severe issue/regressions/"itchy" bugs. Subliminal message: '''Do not put critical data on BTRFS partitions'''.}}
}}
+
== Introduction ==
+
  
<tt>Keychain</tt> helps you to manage SSH and GPG keys in a convenient and secure manner. It acts as a frontend to <tt>ssh-agent</tt> and <tt>ssh-add</tt>, but allows you to easily have one long running <tt>ssh-agent</tt> process per system, rather than the norm of one <tt>ssh-agent</tt> per login session.
+
= Introduction =
__TOC__
+
This dramatically reduces the number of times you need to enter your passphrase. With <tt>keychain</tt>, you only need to enter a passphrase once every time your local machine is rebooted. <tt>Keychain</tt> also makes it easy for remote cron jobs to securely &quot;hook in&quot; to a long running <tt>ssh-agent</tt> process, allowing your scripts to take advantage of key-based logins.
+
{{#seo:
+
|title=Keychain (SSH/GPG Key Management)
+
|keywords=keychain,ssh,gpg,funtoo,linux,gentoo,Daniel Robbins
+
|description=Keychain is a shell script that helps you to manage your SSH and GPG keys more easily.
+
}}
+
== Download and Resources ==
+
  
The latest release of keychain is version <tt>2.7.2_beta1</tt>, and was released on July 7, 2014. The current version of keychain supports <tt>gpg-agent</tt> as well as <tt>ssh-agent</tt>.
+
BTRFS is an advanced filesystem mostly contributed by Sun/Oracle whose origins take place in 2007. A good summary is given in
 +
[http://lwn.net/Articles/342892/]. BTRFS aims to provide a modern answer for making storage more flexible and efficient. According to its main contributor, Chris Mason, the goal was "to let Linux scale for the storage that will be available. Scaling is not just about addressing the storage but also means being able to administer and to manage it with a clean interface that lets people see what's being used and makes it more reliable." (Ref. [http://en.wikipedia.org/wiki/Btrfs http://en.wikipedia.org/wiki/Btrfs]).
  
Keychain is compatible with many operating systems, including <tt>AIX</tt>, <tt>*BSD</tt>, <tt>Cygwin</tt>, <tt>MacOS X</tt>, <tt>Linux</tt>, <tt>HP/UX</tt>, <tt>Tru64 UNIX</tt>, <tt>IRIX</tt>, <tt>Solaris</tt> and <tt>GNU Hurd</tt>.
+
Btrfs, often compared to ZFS, is offering some interesting features like:
  
=== Download ===
+
* Using very few fixed location metadata, thus allowing an existing ext2/ext3 filesystem to be "upgraded" in-place to BTRFS.
 +
* Operations are transactional
 +
* Online volume defragmentation (online filesystem check is on the radar but is not yet implemented).
 +
* Built-in storage pool capabilities (no need for LVM)
 +
* Built-in RAID capabilities (both for the data and filesystem metadata). RAID-5/6 is planned for 3.5 kernels
 +
* Capabilities to grow/shrink the volume
 +
* Subvolumes and snapshots (extremely powerful, you can "rollback" to a previous filesystem state as if nothing had happened).
 +
* Copy-On-Write
 +
* Usage of B-Trees to store the internal filesystem structures (B-Trees are known to have a logarithmic growth in depth, thus making them more efficient when scanning)
  
* ''Release Archive''
+
= Requirements =
** [http://www.funtoo.org/distfiles/keychain/keychain-2.7.2_beta1.tar.bz2 keychain 2.7.2_beta1]
+
** [http://www.funtoo.org/distfiles/keychain/keychain-2.7.1.tar.bz2 keychain 2.7.1]
+
  
* ''Apple MacOS X Packages''
+
A recent Linux kernel (BTRFS metadata format evolves from time to time and mounting using a recent Linux kernel can make the BTRFS volume unreadable with an older kernel revision, e.g. Linux 2.6.31 vs Linux 2.6.30). You must also use sys-fs/btrfs-progs (0.19 or better use -9999 which points to the git repository).
** [http://www.funtoo.org/distfiles/keychain/keychain-2.7.1-macosx.tar.gz keychain 2.7.1 MacOS X package]
+
  
Keychain development sources can be found in the [http://www.github.com/funtoo/keychain keychain git repository]. Please use the [https://bugs.funtoo.org Funtoo Linux bug tracker] and [irc://irc.freenode.net/funtoo #funtoo irc channel] for keychain support questions as well as bug reports.
+
= Playing with BTRFS storage pool capabilities =
  
=== Project History ===
+
Whereas it would possible to use btrfs just as you are used to under a non-LVM system, it shines in using its built-in storage pool capabilities. Tired of playing with LVM ? :-) Good news: you do not need it anymore with btrfs.
  
Daniel Robbins originally wrote <tt>keychain</tt> 1.0 through 2.0.3. 1.0 was written around June 2001, and 2.0.3 was released in late August, 2002.
+
== Setting up a storage pool ==
  
After 2.0.3, <tt>keychain</tt> was maintained by various Gentoo developers, including Seth Chandler, Mike Frysinger and Robin H. Johnson, through July 3, 2003.
+
BTRFS terminology is a bit confusing. If you already have used another 'advanced' filesystem like ZFS or some mechanism like LVM, it's good to know that there are many correlations. In the BTRFS world, the word ''volume'' corresponds to a storage ''pool'' (ZFS) or a ''volume group'' (LVM). Ref. [http://www.rkeene.org/projects/info/wiki.cgi/165 http://www.rkeene.org/projects/info/wiki.cgi/165]
  
On April 21, 2004, Aron Griffis committed a major rewrite of <tt>keychain</tt> which was released as 2.2.0. Aron continued to actively maintain and improve <tt>keychain</tt> through October 2006 and the <tt>keychain</tt> 2.6.8 release. He also made a few commits after that date, up through mid-July, 2007. At this point, <tt>keychain</tt> had reached a point of maturity.
+
The test bench uses disk images through loopback devices. Of course, in a real world case, you will use local drives or units though a SAN. To start with, 5 devices of 1 GiB are allocated:
  
In mid-July, 2009, Daniel Robbins migrated Aron's mercurial repository to git and set up a new project page on funtoo.org, and made a few bug fix commits to the git repo that had been collecting in [http://bugs.gentoo.org bugs.gentoo.org]. Daniel continues to maintain <tt>keychain</tt> and supporting documentation on funtoo.org, and plans to make regular maintenance releases of <tt>keychain</tt> as needed.
+
<console>
 +
###i## dd if=/dev/zero of=/tmp/btrfs-vol0.img bs=1G count=1
 +
###i## dd if=/dev/zero of=/tmp/btrfs-vol1.img bs=1G count=1
 +
###i## dd if=/dev/zero of=/tmp/btrfs-vol2.img bs=1G count=1
 +
###i## dd if=/dev/zero of=/tmp/btrfs-vol3.img bs=1G count=1
 +
###i## dd if=/dev/zero of=/tmp/btrfs-vol4.img bs=1G count=1
 +
</console>
 +
 
 +
Then attached:
 +
 
 +
<console>
 +
###i## losetup /dev/loop0 /tmp/btrfs-vol0.img
 +
###i## losetup /dev/loop1 /tmp/btrfs-vol1.img
 +
###i## losetup /dev/loop2 /tmp/btrfs-vol2.img
 +
###i## losetup /dev/loop3 /tmp/btrfs-vol3.img
 +
###i## losetup /dev/loop4 /tmp/btrfs-vol4.img
 +
</console>
 +
 
 +
== Creating the initial volume (pool) ==
 +
 
 +
BTRFS uses different strategies to store data and for the filesystem metadata (ref. [https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices]).
  
== Quick Setup ==
+
By default the behavior is:
 +
* metadata is '''replicated''' on all of the devices.  If a single device is used the metadata is duplicated inside this single device (useful in case of corruption or bad sector, there is a higher chance that one of the two copies is clean). To tell btrfs to maintain a single copy of the metadata, just use ''single''. Remember: '''dead metadata = dead volume with no chance of recovery.'''
 +
* data is '''spread''' amongst all of the devices (this means no redundancy; any data block left on a defective device will be inaccessible)
  
=== Linux ===
+
To create a BTRFS volume made of multiple devices with default options, use:
  
To install under Gentoo or Funtoo Linux, type
 
 
<console>
 
<console>
###i## emerge keychain
+
###i## mkfs.btrfs /dev/loop0 /dev/loop1 /dev/loop2
 
</console>
 
</console>
  
For other Linux distributions, use your distribution's package manager, or download and install using the source tarball above. Then generate RSA/DSA keys if necessary. The quick install docs assume you have a DSA key pair named <tt>id_dsa</tt> and <tt>id_dsa.pub</tt> in your <tt>~/.ssh/</tt> directory. Add the following to your <tt>~/.bash_profile</tt>:
+
To create a BTRFS volume made of a single device with a single copy of the metadata (dangerous!), use:
  
{{file|name=~/.bash_profile|body=
+
<console>
eval `keychain --eval --agents ssh id_rsa`
+
###i## mkfs.btrfs -m single /dev/loop0
}}
+
</console>
  
If you want to take advantage of GPG functionality, ensure that GNU Privacy Guard is installed and omit the <tt>--agents ssh</tt> option above.
+
To create a BTRFS volume made of multiple devices with metadata spread amongst all of the devices, use:
  
=== Apple MacOS X ===
+
<console>
 +
###i## mkfs.btrfs -m raid0 /dev/loop0 /dev/loop1 /dev/loop2
 +
</console>
  
To install under MacOS X, install the MacOS X package for keychain. Assuming you have an <tt>id_dsa</tt> and <tt>id_dsa.pub</tt> key pair in your <tt>~/.ssh/</tt> directory, add the following to your <tt>~/.bash_profile</tt>:
+
To create a BTRFS volume made of multiple devices, with metadata spread amongst all of the devices and data mirrored on all of the devices (you probably don't want this in a real setup), use:
  
{{file|name=~/.bash_profile|body=
+
<console>
eval `keychain --eval --agents ssh --inherit any id_dsa`
+
###i## mkfs.btrfs -m raid0 -d raid1 /dev/loop0 /dev/loop1 /dev/loop2
}}
+
</console>
  
{{Fancynote|The <tt>--inherit any</tt> option above causes keychain to inherit any ssh key passphrases stored in your Apple MacOS Keychain. If you would prefer for this to not happen, then this option can be omitted.}}
+
To create a fully redundant BTRFS volume (data and metadata mirrored amongst all of the devices), use:
  
== Background ==
+
<console>
 +
###i## mkfs.btrfs -d raid1 /dev/loop0 /dev/loop1 /dev/loop2
 +
</console>
  
You're probably familiar with <tt>ssh</tt>, which has become a secure replacement for the venerable <tt>telnet</tt> and <tt>rsh</tt> commands.
+
{{Fancynote|Technically you can use anything as a physical volume: you can have a volume composed of 2 local hard drives, 3 USB keys, 1 loopback device pointing to a file on a NFS share and 3 logical devices accessed through your SAN (you would be an idiot, but you can, nevertheless). Having different physical volume sizes would lead to issues, but it works :-).}}
  
Typically, when one uses <tt>ssh</tt> to connect to a remote system, one supplies a secret passphrase to <tt>ssh</tt>, which is then passed in encrypted form over the network to the remote server. This passphrase is used by the remote <tt>sshd</tt> server to determine if you should be granted access to the system.
+
== Checking the initial volume ==
  
However, OpenSSH and nearly all other SSH clients and servers have the ability to perform another type of authentication, called asymmetric public key authentication, using the RSA or DSA authentication algorithms. They are very useful, but can also be complicated to use. <tt>keychain</tt> has been designed to make it easy to take advantage of the benefits of RSA and DSA authentication.
+
To verify the devices of which BTRFS volume is composed, just use '''btrfs-show ''device'' ''' (old style) or '''btrfs filesystem show ''device'' ''' (new style). You need to specify one of the devices (the metadata has been designed to keep a track of the what device is linked what other device). If the initial volume was set up like this:
  
== Generating a Key Pair ==
+
<console>
 +
###i## mkfs.btrfs /dev/loop0 /dev/loop1 /dev/loop2
  
To use RSA and DSA authentication, first you use a program called <tt>ssh-keygen</tt> (included with OpenSSH) to generate a ''key pair'' -- two small files. One of the files is the ''public key''. The other small file contains the ''private key''. <tt>ssh-keygen</tt> will ask you for a passphrase, and this passphrase will be used to encrypt your private key. You will need to supply this passphrase to use your private key. If you wanted to generate a DSA key pair, you would do this:
+
WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
 +
WARNING! - see http://btrfs.wiki.kernel.org before using
  
<console># ##i##ssh-keygen -t dsa
+
adding device /dev/loop1 id 2
Generating public/private dsa key pair.</console>
+
adding device /dev/loop2 id 3
You would then be prompted for a location to store your key pair. If you do not have one currently stored in <tt>~/.ssh</tt>, it is fine to accept the default location:
+
fs created label (null) on /dev/loop0
 +
        nodesize 4096 leafsize 4096 sectorsize 4096 size 3.00GB
 +
Btrfs Btrfs v0.19
 +
</console>
  
<console>Enter file in which to save the key (/root/.ssh/id_dsa): </console>
+
It can be checked with one of these commands (They are equivalent):
Then, you are prompted for a passphrase. This passphrase is used to encrypt the ''private key'' on disk, so even if it is stolen, it will be difficult for someone else to use it to successfully authenticate as you with any accounts that have been configured to recognize your public key.
+
  
Note that conversely, if you '''do not''' provide a passphrase for your private key file, then your private key file '''will not''' be encrypted. This means that if someone steals your private key file, ''they will have the full ability to authenticate with any remote accounts that are set up with your public key.''
+
<console>
 +
###i## btrfs filesystem show /dev/loop0
 +
###i## btrfs filesystem show /dev/loop1
 +
###i## btrfs filesystem show /dev/loop2
 +
</console>
  
Below, I have supplied a passphrase so that my private key file will be encrypted on disk:
+
The result is the same for all commands:
  
<console>Enter passphrase (empty for no passphrase): ##i#########
+
<pre>
Enter same passphrase again: ##i#########
+
Label: none  uuid: 0a774d9c-b250-420e-9484-b8f982818c09
Your identification has been saved in /var/tmp/id_dsa.
+
        Total devices 3 FS bytes used 28.00KB
Your public key has been saved in /var/tmp/id_dsa.pub.
+
        devid    3 size 1.00GB used 263.94MB path /dev/loop2
The key fingerprint is:
+
         devid    1 size 1.00GB used 275.94MB path /dev/loop0
5c:13:ff:46:7d:b3:bf:0e:37:1e:5e:8c:7b:a3:88:f4 root@devbox-ve
+
         devid    2 size 1.00GB used 110.38MB path /dev/loop1
The key's randomart image is:
+
</pre>
+--[ DSA 1024]----+
+
|          .     |
+
|          o  . |
+
|          o . ..o|
+
|      . . . o  +|
+
|        S    o. |
+
|            . o.|
+
|         .   ..++|
+
|        . o . =o*|
+
|         . E .+*.|
+
+-----------------+</console>
+
  
== Setting up Authentication ==
+
To show all of the volumes that are present:
  
Here's how you use these files to authenticate with a remote server. On the remote server, you would append the contents of your ''public key'' to the <tt>~.ssh/authorized_keys</tt> file, if such a file exists. If it doesn't exist, you can simply create a new <tt>authorized_keys</tt> file in the remote account's <tt>~/.ssh</tt> directory that contains the contents of your local <tt>id_dsa.pub</tt> file.
+
<console>
 +
###i## btrfs filesystem show
 +
Label: none  uuid: 0a774d9c-b250-420e-9484-b8f982818c09
 +
        Total devices 3 FS bytes used 28.00KB
 +
        devid    3 size 1.00GB used 263.94MB path /dev/loop2
 +
        devid    1 size 1.00GB used 275.94MB path /dev/loop0
 +
        devid    2 size 1.00GB used 110.38MB path /dev/loop1
  
Then, if you weren't going to use <tt>keychain</tt>, you'd perform the following steps. On your local client, you would start a program called <tt>ssh-agent</tt>, which runs in the background. Then you would use a program called <tt>ssh-add</tt> to tell <tt>ssh-agent</tt> about your secret private key. Then, if you've set up your environment properly, the next time you run <tt>ssh</tt>, it will find <tt>ssh-agent</tt> running, grab the private key that you added to <tt>ssh-agent</tt> using <tt>ssh-add</tt>, and use this key to authenticate with the remote server.
+
Label: none  uuid: 1701af39-8ea3-4463-8a77-ec75c59e716a
 +
        Total devices 1 FS bytes used 944.40GB
 +
        devid    1 size 1.42TB used 1.04TB path /dev/sda2
  
Again, the steps in the previous paragraph is what you'd do if <tt>keychain</tt> wasn't around to help. If you are using <tt>keychain</tt>, and I hope you are, you would simply add the following line to your <tt>~/.bash_profile</tt> or if a regular user to<tt>~/.bashrc</tt> :
+
Label: none  uuid: 01178c43-7392-425e-8acf-3ed16ab48813
 +
        Total devices 1 FS bytes used 180.14GB
 +
        devid    1 size 406.02GB used 338.54GB path /dev/sda4
 +
</console>
  
{{file|name=~/.bash_profile|body=
+
{{Fancywarning|BTRFS wiki mentions that '''btrfs device scan''' should be performed, consequence of not doing the incantation? Volume not seen?}}
eval `keychain --eval id_dsa`
+
}}
+
  
The next time you log in or source your <tt>~/.bash_profile</tt> or if you use <tt>~/.bashrc</tt>, <tt>keychain</tt> will start, start <tt>ssh-agent</tt> for you if it has not yet been started, use <tt>ssh-add</tt> to add your <tt>id_dsa</tt> private key file to <tt>ssh-agent</tt>, and set up your shell environment so that <tt>ssh</tt> will be able to find <tt>ssh-agent</tt>. If <tt>ssh-agent</tt> is already running, <tt>keychain</tt> will ensure that your <tt>id_dsa</tt> private key has been added to <tt>ssh-agent</tt> and then set up your environment so that <tt>ssh</tt> can find the already-running <tt>ssh-agent</tt>. It will look something like this:
+
== Mounting the initial volume ==
  
Note that when <tt>keychain</tt> runs for the first time after your local system has booted, you will be prompted for a passphrase for your private key file if it is encrypted. But here's the nice thing about using <tt>keychain</tt> -- even if you are using an encrypted private key file, you will only need to enter your passphrase when your system first boots (or in the case of a server, when you first log in.) After that, <tt>ssh-agent</tt> is already running and has your decrypted private key cached in memory. So if you open a new shell, you will see something like this:
+
BTRFS volumes can be mounted like any other filesystem. The cool stuff at the top on the sundae is that the design of the BTRFS metadata makes it possible to use any of the volume devices. The following commands are equivalent:
  
This means that you can now <tt>ssh</tt> to your heart's content, without supplying a passphrase.
+
<console>
 +
###i## mount /dev/loop0 /mnt
 +
###i## mount /dev/loop1 /mnt
 +
###i## mount /dev/loop2 /mnt
 +
</console>
  
You can also execute batch <tt>cron</tt> jobs and scripts that need to use <tt>ssh</tt> or <tt>scp</tt>, and they can take advantage of passwordless RSA/DSA authentication as well. To do this, you would add the following line to the top of a bash script:
+
For every physical device used for mounting the BTRFS volume <tt>df -h</tt> reports the same (in all cases 3 GiB of "free" space is reported):
  
{{file|name=example-script.sh|body=
+
<console>
eval `keychain --noask --eval id_dsa` || exit 1
+
###i## df -h
}}
+
Filesystem      Size  Used Avail Use% Mounted on
 +
/dev/loop1      3.0G  56K  1.8G  1% /mnt
 +
</console>
  
The extra <tt>--noask</tt> option tells <tt>keychain</tt> that it should not prompt for a passphrase if one is needed. Since it is not running interactively, it is better for the script to fail if the decrypted private key isn't cached in memory via <tt>ssh-agent</tt>.
+
The following command prints very useful information (like how the BTRFS volume has been created):
 +
<console>
 +
###i## btrfs filesystem df /mnt     
 +
Data, RAID0: total=409.50MB, used=0.00
 +
Data: total=8.00MB, used=0.00
 +
System, RAID1: total=8.00MB, used=4.00KB
 +
System: total=4.00MB, used=0.00
 +
Metadata, RAID1: total=204.75MB, used=28.00KB
 +
Metadata: total=8.00MB, used=0.00
 +
</console>
 +
By the way, as you can see, for the btrfs command the mount point should be specified, not one of the physical devices.
  
== Keychain Options ==
+
== Shrinking the volume ==
  
=== Specifying Agents ===
+
A common practice in system administration is to leave some head space, instead of using the whole capacity of a storage pool (just in case). With btrfs one can easily shrink volumes. Let's shrink the volume a bit (about 25%):
 +
 
 +
<console>
 +
###i## btrfs filesystem resize -500m /mnt
 +
###i## dh -h
 +
/dev/loop1      2.6G  56K  1.8G  1% /mnt
 +
</console>
 +
 
 +
And yes, it is an on-line resize, there is no need to umount/shrink/mount. So no downtimes! :-) However, a BTRFS volume requires a minimal size... if the shrink is too aggressive the volume won't be resized:
 +
 
 +
<console>
 +
###i## btrfs filesystem resize -1g /mnt 
 +
Resize '/mnt' of '-1g'
 +
ERROR: unable to resize '/mnt'
 +
</console>
 +
 
 +
== Growing the volume ==
 +
 
 +
This is the opposite operation, you can make a BTRFS grow to reach a particular size (e.g. 150 more megabytes):
 +
 
 +
<console>
 +
###i## btrfs filesystem resize +150m /mnt
 +
Resize '/mnt' of '+150m'
 +
###i## dh -h
 +
/dev/loop1      2.7G  56K  1.8G  1% /mnt
 +
</console>
 +
 
 +
You can also take an ''"all you can eat"'' approach via the '''max''' option, meaning all of the possible space will be used for the volume:
 +
 
 +
<console>
 +
###i## btrfs filesystem resize max /mnt
 +
Resize '/mnt' of 'max'
 +
###i## dh -h
 +
/dev/loop1      3.0G  56K  1.8G  1% /mnt
 +
</console>
 +
 
 +
== Adding a new device to the BTRFS volume ==
 +
 
 +
To add a new device to the volume:
 +
 
 +
<console>
 +
###i## btrfs device add /dev/loop4 /mnt
 +
oxygen ~ # btrfs filesystem show /dev/loop4
 +
Label: none  uuid: 0a774d9c-b250-420e-9484-b8f982818c09
 +
        Total devices 4 FS bytes used 28.00KB
 +
        devid    3 size 1.00GB used 263.94MB path /dev/loop2
 +
        devid    4 size 1.00GB used 0.00 path /dev/loop4
 +
        devid    1 size 1.00GB used 275.94MB path /dev/loop0
 +
        devid    2 size 1.00GB used 110.38MB path /dev/loop1
 +
</console>
 +
 
 +
Again, no need to umount the volume first as adding a device is an on-line operation (the device has no space used yet hence the '0.00'). The operation is not finished as we must tell btrfs to prepare the new device (i.e. rebalance/mirror the metadata and the data between all devices):
 +
 
 +
<console>
 +
###i## btrfs filesystem balance /mnt
 +
###i## btrfs filesystem show /dev/loop4
 +
Label: none  uuid: 0a774d9c-b250-420e-9484-b8f982818c09
 +
        Total devices 4 FS bytes used 28.00KB
 +
        devid    3 size 1.00GB used 110.38MB path /dev/loop2
 +
        devid    4 size 1.00GB used 366.38MB path /dev/loop4
 +
        devid    1 size 1.00GB used 378.38MB path /dev/loop0
 +
        devid    2 size 1.00GB used 110.38MB path /dev/loop1
 +
</console>
 +
 
 +
{{Fancynote|Depending on the sizes and what is in the volume a balancing operation could take several minutes or hours.}}
 +
 
 +
== Removing a device from the BTRFS volume ==
 +
 
 +
<console>
 +
###i## btrfs device delete /dev/loop2 /mnt
 +
###i## btrfs filesystem show /dev/loop0 
 +
Label: none  uuid: 0a774d9c-b250-420e-9484-b8f982818c09
 +
        Total devices 4 FS bytes used 28.00KB
 +
        devid    4 size 1.00GB used 264.00MB path /dev/loop4
 +
        devid    1 size 1.00GB used 268.00MB path /dev/loop0
 +
        devid    2 size 1.00GB used 0.00 path /dev/loop1
 +
        *** Some devices missing
 +
###i## df -h
 +
Filesystem      Size  Used Avail Use% Mounted on
 +
/dev/loop1      3.0G  56K  1.5G  1% /mnt
 +
</console>
 +
 
 +
Here again, removing a device is totally dynamic and can be done as an on-line operation! Note that when a device is removed, its content is transparently redistributed among the other devices.
 +
 
 +
Obvious points:
 +
* '''** DO NOT UNPLUG THE DEVICE BEFORE THE END OF THE OPERATION, DATA LOSS WILL RESULT**'''
 +
* If you have used raid0 in either metadata or data at the BTRFS volume creation you will end in a unusable volume if one of the the devices fails before being properly removed from the volume as some stripes will be lost.
 +
 
 +
Once you add a new device to the BTRFS volume as a replacement for a removed one, you can cleanup the references to the missing device:
 +
 
 +
<console>
 +
###i## btrfs device delete missing
 +
</console>
 +
 
 +
== Using a BTRFS volume in degraded mode ==
 +
 
 +
{{fancywarning|It is not possible to use a volume in degraded mode if raid0 has been used for data/metadata and the device had not been properly removed with '''btrfs device delete''' (some stripes will be missing). The situation is even worse if RAID0 is used for the the metadata: trying to mount a BTRFS volume in read/write mode while not all the devices are accessible '''will simply kill the remaining metadata, hence making the BTRFS volume totally unusable'''... you have been warned! :-)}}
 +
 
 +
If you use raid1 or raid10 for data AND metadata and you have a usable submirror accessible (consisting of 1 drive in case of RAID1 or the two drive of the same RAID0 array in case of RAID10), you can mount the array in degraded mode in the case of some devices are missing (e.g. dead SAN link or dead drive) :
 +
 
 +
<console>
 +
###i## mount -o degraded /dev/loop0 /mnt
 +
</console>
 +
 
 +
If you use RAID0 (and have one of your drives inaccessible) the metadata or RAID10 but not enough drives are on-line to even get a degraded mode possible, btrfs will refuse to mount the volume:
 +
 
 +
<console>
 +
###i## mount /dev/loop0 /mnt
 +
mount: wrong fs type, bad option, bad superblock on /dev/loop0,
 +
      missing codepage or helper program, or other error
 +
      In some cases useful info is found in syslog - try
 +
      dmesg | tail  or so
 +
</console>
 +
 
 +
The situation is no better if you have used RAID1 for the metadata and RAID0 for the data, you can mount the drive in degraded mode but you will encounter problems while accessing your files:
 +
 
 +
<console>
 +
###i## cp /mnt/test.dat /tmp
 +
cp: reading `/mnt/test.dat': Input/output error
 +
cp: failed to extend `/tmp/test.dat': Input/output error
 +
</console>
 +
 
 +
= Playing with subvolumes and snapshots =
 +
 
 +
== A story of boxes.... ==
 +
 
 +
When you think about subvolumes in BTRFS, think about boxes. Each one of those can contain items and other smaller boxes ("sub-boxes") which in turn can also contains items and boxes (sub-sub-boxes) and so on. Each box and items has a number and a name, except for the top level box, which has only a number (zero). Now imagine that all of the boxes are semi-opaque: you can see what they contain if you are outside the box but you can't see outside when you are inside the box. Thus, depending on the box you are in you can view either all of the items and sub-boxes (top level box) or only a part of them (any other box but the top level one). To give you a better idea of this somewhat abstract explanation let's illustrate a bit:
 +
 
 +
<pre>
 +
(0) --+-> Item A (1)
 +
      |
 +
      +-> Item B (2)
 +
      |
 +
      +-> Sub-box 1 (3) --+-> Item C (4)
 +
      |                  |
 +
      |                  +-> Sub-sub-box 1.1 (5) --+-> Item D (6)
 +
      |                  |                        |
 +
      |                  |                        +-> Item E (7)
 +
      |                  |                        |
 +
      |                  |                        +-> Sub-Sub-sub-box 1.1.1 (8) ---> Item F (9)
 +
      |                  +-> Item F (10)
 +
      |
 +
      +-> Sub-box 2 (11) --> Item G (12)                   
 +
</pre>
 +
 
 +
What you see in the hierarchy depends on where you are (note that the top level box numbered 0 doesn't have a name, you will see why later). So:
 +
* If you are in the node named top box (numbered 0) you see everything, i.e. things numbered 1 to 12
 +
* If you are in "Sub-sub-box 1.1" (numbered 5), you see only things 6 to 9
 +
* If you are in "Sub-box 2" (numbered 11), you only see what is numbered 12
 +
 
 +
Did you notice? We have two items named 'F' (respectively numbered 9 and 10). This is not a typographic error, this is just to illustrate the fact that every item lives its own peaceful existence in its own box. Although they have the same name, 9 and 10 are two distinct and unrelated objects (of course it is impossible to have two objects named 'F' in the same box, even they would be numbered differently).
 +
 
 +
== ... applied to BTRFS! (or, "What is a volume/subvolume?") ==
 +
 
 +
BTRFS subvolumes work in the exact same manner, with some nuances:
 +
 
 +
* First, imagine a frame that surrounds the whole hierarchy (represented in dots below). This is your BTRFS '''volume'''. A bit abstract at first glance, but BTRFS volumes have no tangible existence, they are just an ''aggregation'' of devices tagged as being clustered together (that fellowship is created when you invoke '''mkfs.btrfs''' or '''btrfs device add''').
 +
* Second, the first level of hierarchy contains '''only''' a single box numbered zero which can never be destroyed (because everything it contains would also be destroyed).
 +
 
 +
If in our analogy of a nested boxes structure we used the word '''"box"''', in the real BTRFS word we use the word '''"subvolume"''' (box => subvolume). Like in our boxes analogy, all subvolumes hold a unique number greater than zero and a name, with the exception of root subvolume located at the very first level of the hierarchy which is ''always'' numbered zero and has no name (BTRFS tools destroy subvolumes by their name not their number so '''no name = no possible destruction'''.  This is a totally intentional architectural choice, not a flaw). 
 +
 
 +
Here is a typical hierarchy:
 +
 
 +
<pre>
 +
.....BTRFS Volume................................................................................................................................
 +
.
 +
.  Root subvolume (0) --+-> Subvolume SV1 (258) ---> Directory D1 --+-> File F1
 +
.                      |                                          |
 +
.                      |                                          +-> File F2
 +
.                      |
 +
.                      +-> Directory D1 --+-> File F1
 +
.                      |                  |
 +
.                      |                  +-> File F2
 +
.                      |                  |
 +
.                      |                  +-> File F3
 +
.                      |                  |
 +
.                      |                  +-> Directory D11 ---> File F4
 +
.                      +-> File F1
 +
.                      |
 +
.                      +-> Subvolume SV2 (259) --+-> Subvolume SV21 (260)
 +
.                                                |
 +
.                                                +-> Subvolume SV22 (261) --+-> Directory D2 ---> File F4
 +
.                                                                            |
 +
.                                                                            +-> Directory D3 --+-> Subvolume SV221 (262) ---> File F5
 +
.                                                                            |                  |
 +
.                                                                            |                  +-> File F6
 +
.                                                                            |                  |
 +
.                                                                            |                  +-> File F7
 +
.                                                                            |
 +
.                                                                            +-> File F8
 +
.
 +
.....................................................................................................................................
 +
</pre>
 +
 
 +
Maybe you have a question: "Okay, What is the difference between a directory and a subvolume? Both can can contain something!". To further confuse you, here is what users get if they reproduce the first level hierarchy on a real machine:
 +
 
 +
<console>
 +
###i## ls -l
 +
total 0
 +
drwx------ 1 root root 0 May 23 12:48 SV1
 +
drwxr-xr-x 1 root root 0 May 23 12:48 D1
 +
-rw-r--r-- 1 root root 0 May 23 12:48 F1
 +
drwx------ 1 root root 0 May 23 12:48 SV2
 +
</console>
 +
 
 +
Although subvolumes SV1 and SV2 have been created with special BTRFS commands they appear just as if they were ordinary directories! A subtle nuance exists, however: think again at our boxes analogy we did before and map the following concepts in the following manner:
 +
 
 +
* a subvolume : the semi-opaque '''box'''
 +
* a directory : a ''sort of'' '''item''' (that can contain something even another subvolume)
 +
* a file : ''another sort of'' '''item'''
 +
 
 +
So, in the internal filesystem metadata SV1 and SV2 are stored in a different manner than D1 (although this is transparently handled for users). You can, however see SV1 and SV2 for what they are (subvolumes) by running the following command (subvolume numbered (0) has been mounted on /mnt):
 +
 
 +
<console>
 +
###i## btrfs subvolume list /mnt
 +
ID 258 top level 5 path SV1
 +
ID 259 top level 5 path SV2
 +
</console>
 +
 
 +
What would we get if we create SV21 and SV22 inside of SV2? Let's try! Before going further you should be aware that a subvolume is created by invoking the magic command '''btrfs subvolume create''':
 +
 
 +
<console>
 +
###i## cd /mnt/SV2
 +
###i## btrfs subvolume create SV21
 +
Create subvolume './SV21'
 +
###i## btrfs subvolume create SV22
 +
Create subvolume './SV22'
 +
###i## btrfs subvolume list /mnt 
 +
ID 258 top level 5 path SV1
 +
ID 259 top level 5 path SV2
 +
ID 260 top level 5 path SV2/SV21
 +
ID 261 top level 5 path SV2/SV22
 +
</console>
 +
 
 +
Again, invoking '''ls''' in /mnt/SV2 will report the subvolumes as being directories:
 +
 
 +
<console>
 +
###i## ls -l
 +
total 0
 +
drwx------ 1 root root 0 May 23 13:15 SV21
 +
drwx------ 1 root root 0 May 23 13:15 SV22
 +
</console>
 +
 
 +
== Changing the point of view on the subvolumes hierarchy ==
 +
 
 +
At some point in our boxes analogy we have talked about what we see and what we don't see depending on our location in the hierarchy. Here lies a big important point: whereas most of the BTRFS users mount the root subvolume (subvolume id = 0, we will retain the ''root subvolume'' terminology) in their VFS hierarchy thus making visible the whole hierarchy contained in the BTRFS volume, it is absolutely possible to mount only a ''subset'' of it. How that could be possible? Simple: Just specify the subvolume number when you invoke mount. For example, to mount the hierarchy in the VFS starting at subvolume SV22 (261) do the following:
 +
 
 +
<console>
 +
###i## mount -o subvolid=261 /dev/loop0 /mnt
 +
</console>
 +
 
 +
Here lies an important notion not disclosed in the previous paragraph: although both directories and subvolumes can act as containers, '''only subvolumes can be mounted in a VFS hierarchy'''. It is a fundamental aspect to remember: you cannot mount a sub-part of a subvolume in the VFS; you can only mount the subvolume in itself. Considering the hierarchy schema in the previous section, if you want to access the directory D3 you have three possibilities:
 +
 
 +
# Mount the non-named subvolume (numbered 0) and access D3 through /mnt/SV2/SV22/D3 if the non-named subvolume is mounted in /mnt
 +
# Mount the subvolume SV2 (numbered 259) and access D3 through /mnt/SV22/D3 if the the subvolume SV2 is mounted in /mnt
 +
# Mount the subvolume SV22 (numbered 261) and access D3 through /mnt/D3 if the the subvolume SV22 is mounted in /mnt
 +
 
 +
This is accomplished by the following commands, respectively:
 +
 
 +
<console>
 +
###i## mount -o subvolid=0 /dev/loop0 /mnt
 +
###i## mount -o subvolid=259 /dev/loop0 /mnt
 +
###i## mount -o subvolid=261 /dev/loop0 /mnt
 +
</console>
 +
 
 +
{{fancynote|When a subvolume is mounted in the VFS, everything located "above" the subvolume is hidden. Concretely, if you mount the subvolume numbered 261 in /mnt, you only see what is under SV22, you won't see what is located above SV22 like SV21, SV2, D1, SV1, etc. }}
 +
 
 +
== The default subvolume ==
 +
 
 +
$100 questions:
 +
1. "If I don't put 'subvolid' in the command line, 1. how does the kernel know which one of the subvolumes it has to mount?
 +
2. Does Omitting the 'subvolid' means automatically 'mount subvolume numbered 0'?".
 +
Answers:
 +
1. BTRFS magic! ;-)
 +
2. No, not necessarily, you can choose something other than the non-named subvolume.
 +
 
 +
When you create a brand new BTRFS filesystem, the system not only creates the initial the root subvolume (numbered 0) but also tags it as being the '''default subvolume'''. When you ask the operating system to mount a subvolume contained in a BTRFS volume without specifying a subvolume number, it determines which of the existing subvolumes has been tagged as "default subvolume" and mounts it. If none of the exiting subvolumes has the tag "default subvolume" (e.g. because the default subvolume has been deleted), the mount command gives up with a rather cryptic message:
 +
 
 +
<console>
 +
###i## mount /dev/loop0 /mnt
 +
mount: No such file or directory
 +
</console>
 +
 
 +
It is also possible to change at any time which subvolume contained in a BTRFS volume is considered the default volume. This is accomplished with '''btrfs subvolume set-default'''. The following tags the subvolume 261 as being the default:
 +
 
 +
<console>
 +
###i## btrfs subvolume set-default 261 /mnt
 +
</console>
 +
 
 +
After that operation, doing the following is exactly the same:
 +
 
 +
<console>
 +
###i## mount /dev/loop0 /mnt
 +
###i## mount -o subvolid=261 /dev/loop0 /mnt
 +
</console>
 +
 
 +
{{fancynote|The chosen new default subvolume must be visible in the VFS when you invoke ''btrfs subvolume set-default''' }}
 +
 
 +
== Deleting subvolumes ==
 +
 
 +
Question: "As subvolumes appear like directories, can I delete a subvolume by doing an rm -rf on it?".
 +
Answer: Yes, you ''can'', but that way is not the most elegant, especially when it contains several gigabytes of data scattered on thousands of files, directories and maybe other subvolumes located in the one you want to remove. It isn't elegant because ''rm -rf'' could take several minutes (or even hours!) to complete whereas something else can do the same job in the fraction of a second.
 +
 
 +
"Huh?" Yes perfectly possible, and here is the cool goodie for the readers who arrived at this point: when you want to remove a subvolume, use '''btrfs subvolume delete''' instead of '''rm -rf'''. That btrfs command will remove the snapshots in a fraction of a second, even it contains several gigabytes of data!
 +
 
 +
{{fancywarning|* You can '''never''' remove the root subvolume of a BTRFS volume as '''btrfs delete''' expects a subvolume name (again: this is not a flaw in the design of BTRFS; removing the subvolume numbered 0 would destroy the entirety of a BTRFS volume...too dangerous).
 +
* If the subvolume you delete was tagged as the default subvolume you will have to designate another default subvolume or explicitly tell the system which one of the subvolumes has to be mounted) }}
 +
 
 +
An example: considering our initial example given [[BTRFS_Fun#..._applied_to_BTRFS.21_.28or_what_is_a_volume.2Fsubvolume.29|above]] and supposing you have mounted non-named subvolume numbered 0 in /mnt, you can remove SV22 by doing:
 +
 
 +
<console>
 +
###i## btrfs subvolume delete /mnt/SV2/SV22
 +
</console>
 +
 
 +
Obviously the BTRFS volume will look like this after the operation:
 +
 
 +
<pre>
 +
.....BTRFS Volume................................................................................................................................
 +
.
 +
.  (0) --+-> Subvolume SV1 (258) ---> Directory D1 --+-> File F1
 +
.        |                                          |
 +
.        |                                          +-> File F2
 +
.        |
 +
.        +-> Directory D1 --+-> File F1
 +
.        |                  |
 +
.        |                  +-> File F2
 +
.        |                  |
 +
.        |                  +-> File F3
 +
.        |                  |
 +
.        |                  +-> Directory D11 ---> File F4
 +
.        +-> File F1
 +
.        |
 +
.        +-> Subvolume SV2 (259) --+-> Subvolume SV21 (260)
 +
.....................................................................................................................................
 +
</pre>
 +
 
 +
== Snapshot and subvolumes ==
 +
 
 +
If you have a good comprehension of what a subvolume is, understanding what a snapshot is won't be a problem: a snapshot is a subvolume with some initial contents. "Some initial contents" here means an exact copy.
 +
 
 +
When you think about snapshots, think about copy-on-write: the data blocks are not duplicated between a mounted subvolume and its snapshot unless you start to make changes to the files (a snapshot can occupy nearly zero extra space on the disk). At time goes on, more and more data blocks will be changed, thus making snapshots "occupy" more and more space on the disk. It is therefore recommended to keep only a minimal set of them and remove unnecessary ones to avoid wasting space on the volume.
 +
 
 +
 
 +
The following illustrates how to take a snaphot of the VFS root:
 +
<console>
 +
###i## btrfs subvolume snapshot / /snap-2011-05-23
 +
Create a snapshot of '/' in '//snap-2011-05-23'
 +
</console>
 +
 
 +
Once created, the snapshot will persist in /snap-2011-05-23 as long as you don't delete it. Note that the snapshot contents will remain exactly the same it was at the time is was taken (as long as you don't make changes... BTRFS snapshots are writable!). A drawback of having snapshots: if you delete some files in the original filesystem, the snapshot still contains them and the disk blocks can't be claimed as free space. Remember to remove unwanted snapshots and keep a bare minimal set of them.
 +
 
 +
== Listing and deleting snaphots ==
 +
 
 +
As there is no distinction between a snapshot and a subvolume, snapshots are managed with the exact same commands, especially when the time has come to delete some of them. An interesting feature in BTRFS is that snapshots are writable. You can take a snapshot and make changes in the files/directories it contains.  A word of caution: there are no undo capbilities! What has been changed has been changed forever... If you need to do several tests just take several snapshots or, better yet, snapshot your snapshot then do whatever you need in this copy-of-the-copy :-).
 +
 
 +
== Using snapshots for system recovery (aka Back to the Future) ==
 +
 
 +
Here is where BTRFS can literally be a lifeboat. Suppose you want to apply some updates via '''emerge -uaDN @world''' but you want to be sure that you can jump back into the past in case something goes seriously wrong after the system update (does libpng14 remind you of anything?!). Here is the "putting-things-together part" of the article!
 +
 
 +
The following only applies if your VFS root and system directories containing '''/sbin, /bin, /usr, /etc....''' are located on a BTRFS volume. To make things simple, the whole structure is supposed to be located in the SAME subvolume of the same BTRFS volume.
 +
 
 +
To jump back into the past you have at least two options:
 +
 
 +
# Fiddle with the default subvolume numbers
 +
# Use the kernel command line parameters in the bootloader configuration files
 +
 
 +
In all cases you must take a snapshot of your VFS root *before* updating the system:
 +
 
 +
<console>
 +
###i## btrfs subvolume snapshot / /before-updating-2011-05-24
 +
Create a snapshot of '/' in '//before-updating-2011-05-24'
 +
</console>
 +
 
 +
{{fancynote|Hint: You can create an empty file at the root of your snapshot with the name of your choice to help you easily identify which subvolume is the currently mounted one (e.g. if the snapshot has been named '''before-updating-2011-05-24''', you can use a slightly different name like '''current-is-before-updating-2011-05-24''' <nowiki>=></nowiki> '''touch /before-updating-2011-05-24/current-is-before-updating-2011-05-24'''). This is extremly useful if you are dealing with several snapshots.}}
 +
 
 +
There is no "better" way; it's just a question of personal preference.
 +
 
 +
=== Way #1: Fiddle with the default subvolume number ===
 +
 
 +
'''Hypothesis:
 +
* Your "production" VFS root partition resides in the root subvolume (subvolid=0),'''
 +
* Your /boot partition (where the bootloader configuration files are stored) is on another standalone partition
 +
 
 +
First search for the newly created subvolume number:
 +
 
 +
<console>
 +
###i## btrfs subvolume list /
 +
'''ID 256''' top level 5 path before-updating-2011-05-24
 +
</console>
 +
 
 +
'256' is the ID to be retained (of course, this ID will differ in your case).
 +
 
 +
Now, change the default subvolume of the BTRFS volume to designate the subvolume (snapshot) ''before-updating'' and not the root subvolume then reboot:
 +
 
 +
<console>
 +
###i## btrfs subvolume set-default 256 /
 +
</console>
 +
 
 +
Once the system has rebooted, and if you followed the advice in the previous paragraph that suggests to create an empty file of the same name as the snapshot, you should be able to see if the mounted VFS root is the copy hold by the snapshot ''before-updating-2011-05-24'':
 +
 
 +
<console>
 +
###i## ls -l /
 +
...
 +
-rw-rw-rw-  1 root root    0 May 24 20:33 current-is-before-updating-2011-05-24
 +
...
 +
</console>
 +
 
 +
The correct subvolume has been used for mounting the VFS!  Excellent! This is now the time to mount your "production" VFS root (remember the root subvolume can only be accessed via its identification number i.e ''0''):
 +
 
 +
<console>
 +
###i## mount -o subvolid=0 /mnt
 +
###i## mount
 +
...
 +
/dev/sda2 on /mnt type btrfs (rw,subvolid=0)
 +
</console>
 +
 
 +
Oh by the way, as the root subvolume is now mounted in <tt>/mnt</tt> let's try something, just for the sake of the demonstration:
 +
 
 +
<console>
 +
###i## ls /mnt
 +
...
 +
drwxr-xr-x  1 root root    0 May 24 20:33 current-is-before-updating-2011-05-24
 +
...
 +
###i## btrfs subvolume list /mnt
 +
ID 256 top level 5 path before-updating-2011-05-24
 +
</console>
 +
 
 +
No doubt possible :-)
 +
Time to rollback! For this '''rsync''' will be used in the following way:
 +
<console>
 +
###i## rsync --progress -aHAX --exclude=/proc --exclude=/dev --exclude=/sys --exclude=/mnt / /mnt
 +
</console>
 +
 
 +
Basically we are asking rsync to:
 +
* preserve timestamps, hard and symbolic links, owner/group IDs, ACLs and any extended attributes (refer to the rsync manual page for further details on options used) and to report its progression
 +
* ignore the mount points where virtual filesystems are mounted (procfs, sysfs...)
 +
* avoid a re-recursion by reprocessing /mnt (you can speed up the process by adding some extra directories if you are sure they don't hold any important changes or any change at all like /var/tmp/portage for example).
 +
 
 +
Be patient! The resync may take several minutes or hours depending on the amount of data amount to process...
 +
 
 +
Once finished, you will have to set the default subvolume to be the root subvolume:
 +
 
 +
<console>
 +
###i## btrfs subvolume set-default 0 /mnt
 +
ID 256 top level 5 path before-updating-2011-05-24
 +
</console>
  
In the images above, you will note that <tt>keychain</tt> starts <tt>ssh-agent</tt>, but also starts <tt>gpg-agent</tt>. Modern versions of <tt>keychain</tt> also support caching decrypted GPG keys via use of <tt>gpg-agent</tt>, and will start <tt>gpg-agent</tt> by default if it is available on your system. To avoid this behavior and only start <tt>ssh-agent</tt>, modify your <tt>~/.bash_profile</tt> as follows:
+
{{fancywarning|'''DO NOT ENTER / instead of /mnt in the above command; it won't work and you will be under the snapshot before-updating-2011-05-24 the next time the machine reboots.'''
  
{{file|name=~/.bash_profile|body=
+
The reason is that subvolume number must be "visible" from the path given at the end of the '''btrfs subvolume set-default''' command line. Again refer the boxes analogy: in our context we are in a subbox numbered 256 which is located *inside* the box numbered 0 (so it can't see neither interfere with it). [TODO: better explain]
eval `keychain --agents ssh --eval id_dsa` || exit 1
+
 
}}
 
}}
  
The additional <tt>--agents ssh</tt> option tells <tt>keychain</tt> just to manage <tt>ssh-agent</tt>, and ignore <tt>gpg-agent</tt> even if it is available.
+
Now just reboot and you should be in business again! Once you have rebooted just check if you are really under the right subvolume:
  
=== Clearing Keys ===
+
<console>
 +
###i## ls /
 +
...
 +
drwxr-xr-x  1 root root    0 May 24 20:33 current-is-before-updating-2011-05-24
 +
...
 +
###i## btrfs subvolume list /
 +
ID 256 top level 5 path before-updating-2011-05-24
 +
</console>
  
Sometimes, it might be necessary to flush all cached keys in memory. To do this, type:
+
At the right place? Excellent! You can now  delete the snapshot if you wish, or better: keep it as a lifeboat of "last good known system state."
  
<console># ##i##keychain --clear</console>
+
=== Way #2: Change the kernel command line in the bootloader configuration files ===
Any agent(s) will continue to run.
+
  
=== Improving Security ===
+
First search for the newly created subvolume number:
  
To improve the security of <tt>keychain</tt>, some people add the <tt>--clear</tt> option to their <tt>~/.bash_profile</tt> <tt>keychain</tt> invocation. The rationale behind this is that any user logging in should be assumed to be an intruder until proven otherwise. This means that you will need to re-enter any passphrases when you log in, but cron jobs will still be able to run when you log out.
+
<console>
 +
###i## btrfs subvolume list /  
 +
'''ID 256''' top level 5 path before-updating-2011-05-24
 +
</console>
  
=== Stopping Agents ===
+
'256' is the ID to be retained (can differ in your case).
  
If you want to stop all agents, which will also of course cause your keys/identities to be flushed from memory, you can do this as follows:
+
Now with your favourite text editor, edit the adequate kernel command line in your bootloader configuration (<tt>/etc/boot.conf</tt>). This file contains is typically organized in several sections (one per kernel present on the system plus some global settings), like the excerpt below:
  
<console># ##i##keychain -k all</console>
+
<pre>
If you have other agents running under your user account, you can also tell <tt>keychain</tt> to just stop only the agents that <tt>keychain</tt> started:
+
set timeout=5
 +
set default=0
  
<console># ##i##keychain -k mine</console>
+
# Production kernel
 +
menuentry "Funtoo Linux production kernel (2.6.39-gentoo x86/64)" {
 +
  insmod part_msdos
 +
  insmod ext2
 +
  ...
 +
  set root=(hd0,1)
 +
  linux /kernel-x86_64-2.6.39-gentoo root=/dev/sda2
 +
  initrd /initramfs-x86_64-2.6.39-gentoo
 +
}
 +
...
 +
</pre>
  
=== GPG ===
+
Find the correct kernel line and add one of the following statements after root=/dev/sdX:
 +
 
 +
<pre>
 +
rootflags=subvol=before-updating-2011-05-24
 +
  - Or -
 +
rootflags=subvolid=256
 +
</pre>
 +
 
 +
{{fancywarning|If the kernel your want to use has been generated with Genkernel, you '''MUST''' use ''real_rootflags<nowiki>=</nowiki>subvol<nowiki>=</nowiki>''... instead of ''rootflags<nowiki>=</nowiki>subvol''<nowiki>=</nowiki>... at the penalty of not having your rootflags taken into consideration by the kernel on reboot. }}
 +
 
 +
 
 +
Applied to the previous example you will get the following if you referred the subvolume by its name:
 +
 
 +
<pre>
 +
set timeout=5
 +
set default=0
 +
 
 +
# Production kernel
 +
menuentry "Funtoo Linux production kernel (2.6.39-gentoo x86/64)" {
 +
  insmod part_msdos
 +
  insmod ext2
 +
  ...
 +
  set root=(hd0,1)
 +
  linux /kernel-x86_64-2.6.39-gentoo root=/dev/sda2 rootflags=subvol=before-updating-2011-05-24
 +
  initrd /initramfs-x86_64-2.6.39-gentoo
 +
}
 +
...
 +
</pre>
 +
 
 +
Or you will get the following if you referred the subvolume by its identification number:
 +
 
 +
<pre>
 +
set timeout=5
 +
set default=0
 +
 
 +
# Production kernel
 +
menuentry "Funtoo Linux production kernel (2.6.39-gentoo x86/64)" {
 +
  insmod part_msdos
 +
  insmod ext2
 +
  ...
 +
  set root=(hd0,1)
 +
  linux /kernel-x86_64-2.6.39-gentoo root=/dev/sda2 rootflags=subvolid=256
 +
  initrd /initramfs-x86_64-2.6.39-gentoo
 +
}
 +
...
 +
</pre>
 +
 
 +
Once the modifications are done, save your changes and take the necessary extra steps to commit the configuration changes on the first sectors of the disk if needed (this mostly applies to the users of LILO; Grub and SILO do not need to be refreshed) and reboot.
 +
 
 +
Once the system has rebooted and if you followed the advice in the previous paragraph that suggests to create an empty file of the same name as the snapshot, you should be able to see if the mounted VFS root is the copy hold by the snapshot ''before-updating-2011-05-24'':
  
Keychain can ask you for your GPG passphrase if you provide it the GPG key ID. To find it out:
 
 
<console>
 
<console>
$##i## gpg -k
+
###i## ls -l /
pub  2048R/DEADBEEF 2012-08-16
+
...
uid                  Name (Comment) <email@host.tld>
+
-rw-rw-rw-   1 root root    0 May 24 20:33 current-is-before-updating-2011-05-24
sub   2048R/86D2FAC6 2012-08-16
+
...
 
</console>
 
</console>
  
Note the '''DEADBEEF''' above is the ID. Then, in your login script, do your usual
+
The correct subvolume has been used for mounting the VFS!  Excellent! This is now the time to mount your "production" VFS root (remember the root subvolume can only be accessed via its identification number 0):
  
 
<console>
 
<console>
$##i## keychain --dir ~/.ssh/.keychain ~/.ssh/id_rsa DEADBEEF
+
###i## mount -o subvolid=0 /mnt
$##i## source ~/.ssh/.keychain/$HOST-sh
+
###i## mount
$##i## source ~/.ssh/.keychain/$HOST-sh-gpg
+
...
 +
/dev/sda2 on /mnt type btrfs (rw,subvolid=0)
 
</console>
 
</console>
  
=== Learning More ===
+
Time to rollback! For this '''rsync''' will be used in the following way:
 +
<console>
 +
###i## rsync --progress -aHAX --exclude=/proc --exclude=/dev --exclude=/sys --exclude=/mnt / /mnt
 +
</console>
  
The instructions above will work on any system that uses <tt>bash</tt> as its default shell, such as most Linux systems and Mac OS X.
+
Here, please refer to what has been said in [[BTRFS_Fun#Way_.231:_Fiddle_with_the_default_subvolume_number|Way #1]] concerning the used options in rsync. Once everything is in place again, edit your bootloader configuration to remove the rootflags/real_rootflags kernel parameter, reboot and check if you are really under the right subvolume:
 +
 
 +
<console>
 +
###i## ls /
 +
...
 +
drwxr-xr-x  1 root root    0 May 24 20:33 current-is-before-updating-2011-05-24
 +
...
 +
###i## btrfs subvolume list /
 +
ID 256 top level 5 path current-is-before-updating-2011-05-24
 +
</console>
 +
 
 +
At the right place? Excellent! You can now  delete the snapshot if you wish, or better: keep it as a lifeboat of "last good known system state."
 +
 
 +
= Some BTRFS practices / returns of experience / gotchas =
 +
 
 +
* Although BTRFS is still evolving, at the date of writing it (still) is '''an experimental filesystem and should be not be used for production systems and for storing critical data''' (even if the data is non critical, having backups on a partition formatted with a "stable" filesystem like Reiser or ext3/4 is recommended).
 +
* From time to time some changes are brought to the metadata (BTRFS format is not definitive at date of writing) and a BTRFS partition could not be used with older Linux kernels (this happened with Linux 2.6.31).
 +
* More and more Linux distributions are proposing the filesystem as an alternative for ext4
 +
* Some reported gotchas: [https://btrfs.wiki.kernel.org/index.php/Gotchas https://btrfs.wiki.kernel.org/index.php/Gotchas]
 +
* Playing around with BTFRS can be a bit tricky especially when dealing with default volumes and mount point (again: the box analogy)
 +
* Using compression (e.g. LZO =>> mount -o compress=lzo) on the filesystem can improve the throughput performance, however many files nowadays are already compressed at application level (music, pictures, videos....).
 +
* Using space caching capabilities (mount -o space_cache) seems to brings some extra slight performance improvements.
 +
* There is very [https://lkml.org/lkml/2010/6/18/144 interesting discussion on BTRFS design limitations with B-Trees] lying on LKML. We ''strongly'' encourage you to read about on
 +
 
 +
== Deploying a Funtoo instance in a subvolume other than the root subvolume ==
 +
 
 +
Some Funtoo core devs have used BTRFS for many months and no major glitches have been reported so far (except an non-aligned memory access trap on SPARC64 in a checksum calculation routine; minor latest kernels may brought a correction) except a long time ago but this was more related to a kernel crash due to a bug that corrupted some internal data rather than the filesystem code in itself.
 +
 
 +
The following can simplify your life in case of recovery '''(not tested)''':
 +
 
 +
When you prepare the disk space that will hold the root of your future Funtoo instance (and so, will hold /usr /bin /sbin /etc etc...), don't use the root subvolume but take an extra step to define a subvolume like illustrated below:
 +
 
 +
<console>
 +
###i## fdisk /dev/sda2
 +
....
 +
###i## mkfs.btrfs /dev/sda2
 +
###i## mount /dev/sda2 /mnt/funtoo
 +
###i## subvolume create /mnt/funtoo /mnt/funtoo/live-vfs-root-20110523
 +
###i## chroot /mnt/funtoo/live-vfs-root-20110523 /bin/bash
 +
</console>
 +
 
 +
Then either:
 +
 
 +
* Set the default subvolume /live-vfs-root-20110523 as being the default subvolume (btrfs subvolume set-default.... remember to inspect the subvolume identification number)
 +
* Use rootflag / real_rootfsflags (always use real_rootfsflags for kernel generated with Genkernel) on the kernel command line in your bootloader configuration file
 +
 
 +
Technically speaking, it won't change your life BUT at system recovery: when you want to rollback to a functional VFS root copy because something happened (buggy system package, too aggressive cleanup that removed Python, dead compiling toolchain...) you can avoid a time costly rsync but at the cost of putting a bit of overhead over your shoulders when taking a snapshot.
 +
 
 +
Here again you have two ways to recover the system:
 +
 
 +
* '''fiddling with the default subvolume:'''
 +
** Mount to the no named volume somewhere (e.g. '''mount -o subvolid=0 /dev/sdX /mnt''')
 +
** Take a snapshot (remember to check its identification number) of your current subvolume and store it under the root volume you just have just mounted ('''btrfs snapshot create / /mnt/before-updating-20110524''') -- (Where is the "frontier"? If 0 is monted does its contennts also appear in the taken snashot located on the same volume?)
 +
** Update your system or do whatever else "dangerous" operation
 +
** If you need to return to the latest good known system state, just set the default subvolume to use to the just taken snapshot ('''btrfs subvolume set-default ''<snapshotnumber here>'' /mnt''')
 +
** Reboot
 +
** Once you have  rebooted, just mount the root subvolume again and delete the subvolume that correspond to the failed system update ('''btrfs subvolume delete /mnt/<buggy VFS rootsnapshot name here>''')
 +
 
 +
* '''fiddling with the kernel command line:'''
 +
** Mount to the no named volume somewhere (e.g. '''mount -o subvolid=0 /dev/sdX /mnt''')
 +
** Take a snapshot (remember to check its identification number) of your current subvolume and store it under the root volume you just have just mounted ('''btrfs snapshot create / /mnt/before-updating-20110524''') -- (Where is the "frontier"? If 0 is mounted does its contents also appear in the taken snapshot located on the same volume?)
 +
** Update your system or do whatever else "dangerous" operation
 +
** If you need to return to the latest good known system state, just set the rootflags/real_rootflags as demonstrated in previous paragraphs in your loader configuration file
 +
** Reboot
 +
** Once you have  rebooted, just mount the root subvolume again and delete the subvolume that correspond to the failed system update ('''btrfs subvolume delete /mnt/<buggy VFS rootsnapshot name here>''')
 +
 
 +
== Space recovery / defragmenting the filesystem ==
 +
 
 +
{{Fancytip|From time to time it is advised to ask for re-optimizing the filesystem structures and data blocks in a subvolume. In BTRFS terminology this is called a defragmentation and it only be performed when the subvolume is mounted in the VFS (online defragmentation):}}
 +
 
 +
<console>
 +
###i## btrfs filesystem defrag /mnt
 +
</console>
 +
 
 +
You can still access the subvolume, even change its contents, while a defragmentation is running.
 +
 
 +
It is also a good idea to remove the snapshots you don't use anymore especially if huge files and/or lots of files are changed because snapshots will still hold some blocks that could be reused.
 +
 
 +
== SSE 4.2 boost ==
 +
 
 +
If your CPU supports hardware calculation of CRC32 (e.g. since Intel Nehalem series and later and AMD Bulldozer series), you are encouraged to enable that support in your kernel since BTRFS makes an aggressive use of those. Just check you have enabled ''CRC32c INTEL hardware acceleration'' in  ''Cryptographic API'' either as a module or as a built-in feature
 +
 
 +
= Recovering an apparent dead BTRFS filesystem =
 +
 
 +
Dealing with a filesystem metadata coherence is a critical in  a filesystem design. Losing some data blocks (i.e. having some corrupted files) is less critical than having a screwed-up and unmountable filesystem especially if you do backups on a regular basis '''(the rule with BTRFS is *do backups*, BTRFS has no mature filesystem repair tool and you *will* end up in having to re-create your filesystem from scratch again sooner or later).'''
 +
 
 +
== Mounting with recovery option (Linux 3.2 and beyond) ==
 +
 
 +
If you are using '''Linux 3.2 and later (only!)''', you can use the ''recovery'' option to make BTRFS seek for a usable copy of tree root (several copies of it exists on the disk). Just mount your filesystem as:
 +
 
 +
<console>
 +
###i## mount -o recovery /dev/yourBTFSvolume /mount/point
 +
</console>
 +
 
 +
== btrfs-select-super / btrfs-zero-log ==
 +
 
 +
Two other handy tools exist but they are not deployed by default by ''sys-fs/btrfs-progs'' (even ''btrfs-progs-9999'') ebuilds because they only lie in the branch ''"next"'' of the'' btrfs-progs'' Git repository:
 +
 
 +
* btrfs-select-super
 +
* btrfs-zero-log
 +
 
 +
=== Building the btrfs-progs goodies ===
 +
 +
The two tools this section is about are not build by default and Funtoo ebuilds does not build them as well for the moment. So you must build them manually:
 +
 
 +
<console>
 +
###i## mkdir ~/src
 +
###i## cd ~/src
 +
###i## git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git
 +
###i## cd btrfs-progs
 +
###i## make && make btrfs-select-super && make btrfs-zero-log
 +
</console>
 +
 
 +
{{fancynote|In the past, ''btrfs-select-super'' and ''btrfs-zero-log'' were lying in the git-next branch, this is no longer the case and those tools are available in the master branch }}
 +
 
 +
=== Fixing dead superblock ===
 +
 
 +
In case of a corrupted superblock, start by asking btrfsck to use an alternate copy of the superblock instead of the superblock #0. This is achieved via the -s option followed by the number of the alternate copy you wish to use. In the following example we ask for using the superblock copy #2 of /dev/sda7:
 +
 
 +
<console>
 +
###i## ./btrfsck --s 2 /dev/sd7
 +
</console>
 +
 
 +
When btrfsck is happy, use btrfs-super-select to restore the default superblock (copy #0) with a clean copy.  In the following example we ask for restoring the superblock of /dev/sda7 with its copy #2:
 +
 
 +
<console>
 +
###i## ./btrfs-super-select -s 2  /dev/sda7
 +
</console>
 +
 
 +
Note that this will overwrite all the other supers on the disk, which means you really only get one shot at it. 
 +
 
 +
'''If you run btrfs-super-select prior prior to figuring out which one is good, you've lost your chance to find a good one.'''
 +
 
 +
=== Clearing the BTRFS journal ===
 +
 
 +
''' This will only help with one specific problem! '''
 +
 
 +
If you are unable to mount a BTRFS partition after a hard shutdown, crash or power loss, it may be due to faulty log playback in kernels prior to 3.2.  The first thing to try is updating your kernel, and mounting.  If this isn't possible, an alternate solution lies in truncating the BTRFS journal, but only if you see "replay_one_*" functions in the oops callstack.
 +
 
 +
To truncate the journal of a BTRFS partition (and thereby lose any changes that only exist in the log!), just give the filesystem to process to ''btrfs-zero-log'':
 +
 
 +
<console>
 +
###i## ./btrfs-zero-log /dev/sda7
 +
</console>
 +
 
 +
This is not a generic technique, and works by permanently throwing away a small amount of potentially good data.
 +
 
 +
== Using btrfsck ==
 +
 
 +
{{fancywarning|Extremely experimental...}}
 +
 
 +
If one thing is famous in the BTRFS world it would be the so-wished fully functional ''btrfsck''. A read-only version of the tool was existing out there for years, however at the begining of 2012, BTRFS developers made a public and very experimental release: the secret jewel lies in the branch ''dangerdonteveruse'' of the BTRFS Git repository hold by Chris Mason on kernel.org.
 +
 
 +
<console>
 +
###i## git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git
 +
###i## cd btrfs-progs
 +
###i## git checkout dangerdonteveruse
 +
###i## make
 +
</console>
 +
 
 +
So far the tool can:
 +
* Fix errors in the extents tree and in blocks groups accounting
 +
* Wipe the CRC tree and create a brand new one (you can to mount the filesystem with CRC checking disabled )
 +
 
 +
To repair:
 +
 
 +
<console>
 +
###i## btrfsck --repair /dev/''yourBTRFSvolume''
 +
</console>
 +
 
 +
To wipe the CRC tree:
 +
<console>
 +
###i## btrfsck --init-csum-tree /dev/''yourBTRFSvolume''
 +
</console>
  
To learn more about the many things that <tt>keychain</tt> can do, including alternate shell support, consult the keychain man page, or type <tt>keychain --help | less</tt> for a full list of command options.
+
Two other options exist in the source code: ''--super'' (equivalent of btrfs-select-super ?) and ''--init-extent-tree'' (clears out any extent?)
  
I also recommend you read my original series of articles about [http://www.openssh.com OpenSSH] that I wrote for IBM developerWorks, called <tt>OpenSSH Key Management</tt>. Please note that <tt>keychain</tt> 1.0 was released along with Part 2 of this article, which was written in 2001. <tt>keychain</tt> has changed quite a bit since then. In other words, read these articles for the conceptual and [http://www.openssh.com OpenSSH] information, but consult the <tt>keychain</tt> man page for command-line options and usage instructions :)
+
= Final words =
  
* [http://www.ibm.com/developerworks/library/l-keyc.html Common Threads: OpenSSH key management, Part 1] - Understanding RSA/DSA Authentication
+
We give the great lines here but BTRFS can be very tricky especially when several subvolumes coming from several BTRFS volumes are used. And remember: BTRFS is still experimental at date of writing :)
* [http://www.ibm.com/developerworks/library/l-keyc2/ Common Threads: OpenSSH key management, Part 2] - Introducing <tt>ssh-agent</tt> and <tt>keychain</tt>
+
* [http://www.ibm.com/developerworks/library/l-keyc3/ Common Threads: OpenSSH key management, Part 3] - Agent forwarding and <tt>keychain</tt> improvements
+
  
As mentioned at the top of the page, <tt>keychain</tt> development sources can be found in the [http://www.github.com/funtoo/keychain keychain git repository]. Please use the [http://groups.google.com/group/funtoo-dev funtoo-dev mailing list] and [irc://irc.freenode.net/funtoo #funtoo irc channel] for keychain support questions as well as bug reports.
+
== Lessons learned ==
 +
* Very interesting but still lacks some important features present in ZFS like RAID-Z, virtual volumes, management by attributes, filesystem streaming, etc.
 +
* Extremly interesting for Gentoo/Funtoo systems partitions (snapshot/rollback capabilities). However not integrated in portage yet.
 +
* If possible, use a file monitoring tool like TripWire this is handy to see what file has been corrupted once the filesystem is recovered or if a bug happens
 +
* '''It is highly advised to not use the root subvolume when deploying a new Funtoo instance''' or put any kind of data on it in a more general case. Rolling back a data snapshot will be much easier and much less error prone (no copy process, just a matter of 'swapping' the subvolumes).  
 +
* Backup, backup backup your data! ;)
  
[[Category:HOWTO]]
+
[[Category:Labs]]
[[Category:Projects]]
+
[[Category:First Steps]]
+
 
[[Category:Articles]]
 
[[Category:Articles]]
 +
[[Category:Featured]]
 +
[[Category:Filesystems]]
 
{{ArticleFooter}}
 
{{ArticleFooter}}

Latest revision as of 09:41, December 28, 2014

Support Funtoo and help us grow! Donate $15 per month and get a free SSD-based Funtoo Virtual Container.

Important

BTRFS is still experimental even with latest Linux kernels (3.4-rc at date of writing) so be prepared to lose some data sooner or later or hit a severe issue/regressions/"itchy" bugs. Subliminal message: Do not put critical data on BTRFS partitions.


Introduction

BTRFS is an advanced filesystem mostly contributed by Sun/Oracle whose origins take place in 2007. A good summary is given in [1]. BTRFS aims to provide a modern answer for making storage more flexible and efficient. According to its main contributor, Chris Mason, the goal was "to let Linux scale for the storage that will be available. Scaling is not just about addressing the storage but also means being able to administer and to manage it with a clean interface that lets people see what's being used and makes it more reliable." (Ref. http://en.wikipedia.org/wiki/Btrfs).

Btrfs, often compared to ZFS, is offering some interesting features like:

  • Using very few fixed location metadata, thus allowing an existing ext2/ext3 filesystem to be "upgraded" in-place to BTRFS.
  • Operations are transactional
  • Online volume defragmentation (online filesystem check is on the radar but is not yet implemented).
  • Built-in storage pool capabilities (no need for LVM)
  • Built-in RAID capabilities (both for the data and filesystem metadata). RAID-5/6 is planned for 3.5 kernels
  • Capabilities to grow/shrink the volume
  • Subvolumes and snapshots (extremely powerful, you can "rollback" to a previous filesystem state as if nothing had happened).
  • Copy-On-Write
  • Usage of B-Trees to store the internal filesystem structures (B-Trees are known to have a logarithmic growth in depth, thus making them more efficient when scanning)

Requirements

A recent Linux kernel (BTRFS metadata format evolves from time to time and mounting using a recent Linux kernel can make the BTRFS volume unreadable with an older kernel revision, e.g. Linux 2.6.31 vs Linux 2.6.30). You must also use sys-fs/btrfs-progs (0.19 or better use -9999 which points to the git repository).

Playing with BTRFS storage pool capabilities

Whereas it would possible to use btrfs just as you are used to under a non-LVM system, it shines in using its built-in storage pool capabilities. Tired of playing with LVM ? :-) Good news: you do not need it anymore with btrfs.

Setting up a storage pool

BTRFS terminology is a bit confusing. If you already have used another 'advanced' filesystem like ZFS or some mechanism like LVM, it's good to know that there are many correlations. In the BTRFS world, the word volume corresponds to a storage pool (ZFS) or a volume group (LVM). Ref. http://www.rkeene.org/projects/info/wiki.cgi/165

The test bench uses disk images through loopback devices. Of course, in a real world case, you will use local drives or units though a SAN. To start with, 5 devices of 1 GiB are allocated:

# dd if=/dev/zero of=/tmp/btrfs-vol0.img bs=1G count=1
# dd if=/dev/zero of=/tmp/btrfs-vol1.img bs=1G count=1
# dd if=/dev/zero of=/tmp/btrfs-vol2.img bs=1G count=1
# dd if=/dev/zero of=/tmp/btrfs-vol3.img bs=1G count=1
# dd if=/dev/zero of=/tmp/btrfs-vol4.img bs=1G count=1

Then attached:

# losetup /dev/loop0 /tmp/btrfs-vol0.img
# losetup /dev/loop1 /tmp/btrfs-vol1.img
# losetup /dev/loop2 /tmp/btrfs-vol2.img
# losetup /dev/loop3 /tmp/btrfs-vol3.img
# losetup /dev/loop4 /tmp/btrfs-vol4.img

Creating the initial volume (pool)

BTRFS uses different strategies to store data and for the filesystem metadata (ref. https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices).

By default the behavior is:

  • metadata is replicated on all of the devices. If a single device is used the metadata is duplicated inside this single device (useful in case of corruption or bad sector, there is a higher chance that one of the two copies is clean). To tell btrfs to maintain a single copy of the metadata, just use single. Remember: dead metadata = dead volume with no chance of recovery.
  • data is spread amongst all of the devices (this means no redundancy; any data block left on a defective device will be inaccessible)

To create a BTRFS volume made of multiple devices with default options, use:

# mkfs.btrfs /dev/loop0 /dev/loop1 /dev/loop2 

To create a BTRFS volume made of a single device with a single copy of the metadata (dangerous!), use:

# mkfs.btrfs -m single /dev/loop0

To create a BTRFS volume made of multiple devices with metadata spread amongst all of the devices, use:

# mkfs.btrfs -m raid0 /dev/loop0 /dev/loop1 /dev/loop2 

To create a BTRFS volume made of multiple devices, with metadata spread amongst all of the devices and data mirrored on all of the devices (you probably don't want this in a real setup), use:

# mkfs.btrfs -m raid0 -d raid1 /dev/loop0 /dev/loop1 /dev/loop2 

To create a fully redundant BTRFS volume (data and metadata mirrored amongst all of the devices), use:

# mkfs.btrfs -d raid1 /dev/loop0 /dev/loop1 /dev/loop2 
Note

Technically you can use anything as a physical volume: you can have a volume composed of 2 local hard drives, 3 USB keys, 1 loopback device pointing to a file on a NFS share and 3 logical devices accessed through your SAN (you would be an idiot, but you can, nevertheless). Having different physical volume sizes would lead to issues, but it works :-).

Checking the initial volume

To verify the devices of which BTRFS volume is composed, just use btrfs-show device (old style) or btrfs filesystem show device (new style). You need to specify one of the devices (the metadata has been designed to keep a track of the what device is linked what other device). If the initial volume was set up like this:

# mkfs.btrfs /dev/loop0 /dev/loop1 /dev/loop2

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

adding device /dev/loop1 id 2
adding device /dev/loop2 id 3
fs created label (null) on /dev/loop0
        nodesize 4096 leafsize 4096 sectorsize 4096 size 3.00GB
Btrfs Btrfs v0.19

It can be checked with one of these commands (They are equivalent):

# btrfs filesystem show /dev/loop0
# btrfs filesystem show /dev/loop1
# btrfs filesystem show /dev/loop2

The result is the same for all commands:

Label: none  uuid: 0a774d9c-b250-420e-9484-b8f982818c09
        Total devices 3 FS bytes used 28.00KB
        devid    3 size 1.00GB used 263.94MB path /dev/loop2
        devid    1 size 1.00GB used 275.94MB path /dev/loop0
        devid    2 size 1.00GB used 110.38MB path /dev/loop1

To show all of the volumes that are present:

# btrfs filesystem show
Label: none  uuid: 0a774d9c-b250-420e-9484-b8f982818c09
        Total devices 3 FS bytes used 28.00KB
        devid    3 size 1.00GB used 263.94MB path /dev/loop2
        devid    1 size 1.00GB used 275.94MB path /dev/loop0
        devid    2 size 1.00GB used 110.38MB path /dev/loop1

Label: none  uuid: 1701af39-8ea3-4463-8a77-ec75c59e716a
        Total devices 1 FS bytes used 944.40GB
        devid    1 size 1.42TB used 1.04TB path /dev/sda2

Label: none  uuid: 01178c43-7392-425e-8acf-3ed16ab48813
        Total devices 1 FS bytes used 180.14GB
        devid    1 size 406.02GB used 338.54GB path /dev/sda4
Warning

BTRFS wiki mentions that btrfs device scan should be performed, consequence of not doing the incantation? Volume not seen?

Mounting the initial volume

BTRFS volumes can be mounted like any other filesystem. The cool stuff at the top on the sundae is that the design of the BTRFS metadata makes it possible to use any of the volume devices. The following commands are equivalent:

# mount /dev/loop0 /mnt
# mount /dev/loop1 /mnt
# mount /dev/loop2 /mnt

For every physical device used for mounting the BTRFS volume df -h reports the same (in all cases 3 GiB of "free" space is reported):

# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop1      3.0G   56K  1.8G   1% /mnt

The following command prints very useful information (like how the BTRFS volume has been created):

# btrfs filesystem df /mnt      
Data, RAID0: total=409.50MB, used=0.00
Data: total=8.00MB, used=0.00
System, RAID1: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, RAID1: total=204.75MB, used=28.00KB
Metadata: total=8.00MB, used=0.00

By the way, as you can see, for the btrfs command the mount point should be specified, not one of the physical devices.

Shrinking the volume

A common practice in system administration is to leave some head space, instead of using the whole capacity of a storage pool (just in case). With btrfs one can easily shrink volumes. Let's shrink the volume a bit (about 25%):

# btrfs filesystem resize -500m /mnt
# dh -h
/dev/loop1      2.6G   56K  1.8G   1% /mnt

And yes, it is an on-line resize, there is no need to umount/shrink/mount. So no downtimes! :-) However, a BTRFS volume requires a minimal size... if the shrink is too aggressive the volume won't be resized:

# btrfs filesystem resize -1g /mnt  
Resize '/mnt' of '-1g'
ERROR: unable to resize '/mnt'

Growing the volume

This is the opposite operation, you can make a BTRFS grow to reach a particular size (e.g. 150 more megabytes):

# btrfs filesystem resize +150m /mnt
Resize '/mnt' of '+150m'
# dh -h
/dev/loop1      2.7G   56K  1.8G   1% /mnt

You can also take an "all you can eat" approach via the max option, meaning all of the possible space will be used for the volume:

# btrfs filesystem resize max /mnt
Resize '/mnt' of 'max'
# dh -h
/dev/loop1      3.0G   56K  1.8G   1% /mnt

Adding a new device to the BTRFS volume

To add a new device to the volume:

# btrfs device add /dev/loop4 /mnt 
oxygen ~ # btrfs filesystem show /dev/loop4 
Label: none  uuid: 0a774d9c-b250-420e-9484-b8f982818c09
        Total devices 4 FS bytes used 28.00KB
        devid    3 size 1.00GB used 263.94MB path /dev/loop2
        devid    4 size 1.00GB used 0.00 path /dev/loop4
        devid    1 size 1.00GB used 275.94MB path /dev/loop0
        devid    2 size 1.00GB used 110.38MB path /dev/loop1 

Again, no need to umount the volume first as adding a device is an on-line operation (the device has no space used yet hence the '0.00'). The operation is not finished as we must tell btrfs to prepare the new device (i.e. rebalance/mirror the metadata and the data between all devices):

# btrfs filesystem balance /mnt
# btrfs filesystem show /dev/loop4
Label: none  uuid: 0a774d9c-b250-420e-9484-b8f982818c09
        Total devices 4 FS bytes used 28.00KB
        devid    3 size 1.00GB used 110.38MB path /dev/loop2
        devid    4 size 1.00GB used 366.38MB path /dev/loop4
        devid    1 size 1.00GB used 378.38MB path /dev/loop0
        devid    2 size 1.00GB used 110.38MB path /dev/loop1
Note

Depending on the sizes and what is in the volume a balancing operation could take several minutes or hours.

Removing a device from the BTRFS volume

# btrfs device delete /dev/loop2 /mnt
# btrfs filesystem show /dev/loop0   
Label: none  uuid: 0a774d9c-b250-420e-9484-b8f982818c09
        Total devices 4 FS bytes used 28.00KB
        devid    4 size 1.00GB used 264.00MB path /dev/loop4
        devid    1 size 1.00GB used 268.00MB path /dev/loop0
        devid    2 size 1.00GB used 0.00 path /dev/loop1
        *** Some devices missing
# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop1      3.0G   56K  1.5G   1% /mnt

Here again, removing a device is totally dynamic and can be done as an on-line operation! Note that when a device is removed, its content is transparently redistributed among the other devices.

Obvious points:

  • ** DO NOT UNPLUG THE DEVICE BEFORE THE END OF THE OPERATION, DATA LOSS WILL RESULT**
  • If you have used raid0 in either metadata or data at the BTRFS volume creation you will end in a unusable volume if one of the the devices fails before being properly removed from the volume as some stripes will be lost.

Once you add a new device to the BTRFS volume as a replacement for a removed one, you can cleanup the references to the missing device:

# btrfs device delete missing

Using a BTRFS volume in degraded mode

Warning

It is not possible to use a volume in degraded mode if raid0 has been used for data/metadata and the device had not been properly removed with btrfs device delete (some stripes will be missing). The situation is even worse if RAID0 is used for the the metadata: trying to mount a BTRFS volume in read/write mode while not all the devices are accessible will simply kill the remaining metadata, hence making the BTRFS volume totally unusable... you have been warned! :-)

If you use raid1 or raid10 for data AND metadata and you have a usable submirror accessible (consisting of 1 drive in case of RAID1 or the two drive of the same RAID0 array in case of RAID10), you can mount the array in degraded mode in the case of some devices are missing (e.g. dead SAN link or dead drive) :

# mount -o degraded /dev/loop0 /mnt

If you use RAID0 (and have one of your drives inaccessible) the metadata or RAID10 but not enough drives are on-line to even get a degraded mode possible, btrfs will refuse to mount the volume:

# mount /dev/loop0 /mnt
mount: wrong fs type, bad option, bad superblock on /dev/loop0,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

The situation is no better if you have used RAID1 for the metadata and RAID0 for the data, you can mount the drive in degraded mode but you will encounter problems while accessing your files:

# cp /mnt/test.dat /tmp 
cp: reading `/mnt/test.dat': Input/output error
cp: failed to extend `/tmp/test.dat': Input/output error

Playing with subvolumes and snapshots

A story of boxes....

When you think about subvolumes in BTRFS, think about boxes. Each one of those can contain items and other smaller boxes ("sub-boxes") which in turn can also contains items and boxes (sub-sub-boxes) and so on. Each box and items has a number and a name, except for the top level box, which has only a number (zero). Now imagine that all of the boxes are semi-opaque: you can see what they contain if you are outside the box but you can't see outside when you are inside the box. Thus, depending on the box you are in you can view either all of the items and sub-boxes (top level box) or only a part of them (any other box but the top level one). To give you a better idea of this somewhat abstract explanation let's illustrate a bit:

(0) --+-> Item A (1)
      |
      +-> Item B (2)
      |
      +-> Sub-box 1 (3) --+-> Item C (4)
      |                   |
      |                   +-> Sub-sub-box 1.1 (5) --+-> Item D (6)
      |                   |                         | 
      |                   |                         +-> Item E (7)
      |                   |                         |
      |                   |                         +-> Sub-Sub-sub-box 1.1.1 (8) ---> Item F (9)
      |                   +-> Item F (10)
      |
      +-> Sub-box 2 (11) --> Item G (12)                    

What you see in the hierarchy depends on where you are (note that the top level box numbered 0 doesn't have a name, you will see why later). So:

  • If you are in the node named top box (numbered 0) you see everything, i.e. things numbered 1 to 12
  • If you are in "Sub-sub-box 1.1" (numbered 5), you see only things 6 to 9
  • If you are in "Sub-box 2" (numbered 11), you only see what is numbered 12

Did you notice? We have two items named 'F' (respectively numbered 9 and 10). This is not a typographic error, this is just to illustrate the fact that every item lives its own peaceful existence in its own box. Although they have the same name, 9 and 10 are two distinct and unrelated objects (of course it is impossible to have two objects named 'F' in the same box, even they would be numbered differently).

... applied to BTRFS! (or, "What is a volume/subvolume?")

BTRFS subvolumes work in the exact same manner, with some nuances:

  • First, imagine a frame that surrounds the whole hierarchy (represented in dots below). This is your BTRFS volume. A bit abstract at first glance, but BTRFS volumes have no tangible existence, they are just an aggregation of devices tagged as being clustered together (that fellowship is created when you invoke mkfs.btrfs or btrfs device add).
  • Second, the first level of hierarchy contains only a single box numbered zero which can never be destroyed (because everything it contains would also be destroyed).

If in our analogy of a nested boxes structure we used the word "box", in the real BTRFS word we use the word "subvolume" (box => subvolume). Like in our boxes analogy, all subvolumes hold a unique number greater than zero and a name, with the exception of root subvolume located at the very first level of the hierarchy which is always numbered zero and has no name (BTRFS tools destroy subvolumes by their name not their number so no name = no possible destruction. This is a totally intentional architectural choice, not a flaw).

Here is a typical hierarchy:

.....BTRFS Volume................................................................................................................................
.
.  Root subvolume (0) --+-> Subvolume SV1 (258) ---> Directory D1 --+-> File F1
.                       |                                           |
.                       |                                           +-> File F2
.                       |
.                       +-> Directory D1 --+-> File F1
.                       |                  |
.                       |                  +-> File F2
.                       |                  |
.                       |                  +-> File F3
.                       |                  |
.                       |                  +-> Directory D11 ---> File F4
.                       +-> File F1
.                       |
.                       +-> Subvolume SV2 (259) --+-> Subvolume SV21 (260)
.                                                 |
.                                                 +-> Subvolume SV22 (261) --+-> Directory D2 ---> File F4
.                                                                            |
.                                                                            +-> Directory D3 --+-> Subvolume SV221 (262) ---> File F5
.                                                                            |                  |
.                                                                            |                  +-> File F6
.                                                                            |                  |
.                                                                            |                  +-> File F7
.                                                                            |
.                                                                            +-> File F8
.
.....................................................................................................................................

Maybe you have a question: "Okay, What is the difference between a directory and a subvolume? Both can can contain something!". To further confuse you, here is what users get if they reproduce the first level hierarchy on a real machine:

# ls -l
total 0
drwx------ 1 root root 0 May 23 12:48 SV1
drwxr-xr-x 1 root root 0 May 23 12:48 D1
-rw-r--r-- 1 root root 0 May 23 12:48 F1
drwx------ 1 root root 0 May 23 12:48 SV2

Although subvolumes SV1 and SV2 have been created with special BTRFS commands they appear just as if they were ordinary directories! A subtle nuance exists, however: think again at our boxes analogy we did before and map the following concepts in the following manner:

  • a subvolume : the semi-opaque box
  • a directory : a sort of item (that can contain something even another subvolume)
  • a file : another sort of item

So, in the internal filesystem metadata SV1 and SV2 are stored in a different manner than D1 (although this is transparently handled for users). You can, however see SV1 and SV2 for what they are (subvolumes) by running the following command (subvolume numbered (0) has been mounted on /mnt):

# btrfs subvolume list /mnt
ID 258 top level 5 path SV1
ID 259 top level 5 path SV2

What would we get if we create SV21 and SV22 inside of SV2? Let's try! Before going further you should be aware that a subvolume is created by invoking the magic command btrfs subvolume create:

# cd /mnt/SV2
# btrfs subvolume create SV21
Create subvolume './SV21'
# btrfs subvolume create SV22
Create subvolume './SV22'
# btrfs subvolume list /mnt  
ID 258 top level 5 path SV1
ID 259 top level 5 path SV2
ID 260 top level 5 path SV2/SV21
ID 261 top level 5 path SV2/SV22

Again, invoking ls in /mnt/SV2 will report the subvolumes as being directories:

# ls -l
total 0
drwx------ 1 root root 0 May 23 13:15 SV21
drwx------ 1 root root 0 May 23 13:15 SV22

Changing the point of view on the subvolumes hierarchy

At some point in our boxes analogy we have talked about what we see and what we don't see depending on our location in the hierarchy. Here lies a big important point: whereas most of the BTRFS users mount the root subvolume (subvolume id = 0, we will retain the root subvolume terminology) in their VFS hierarchy thus making visible the whole hierarchy contained in the BTRFS volume, it is absolutely possible to mount only a subset of it. How that could be possible? Simple: Just specify the subvolume number when you invoke mount. For example, to mount the hierarchy in the VFS starting at subvolume SV22 (261) do the following:

# mount -o subvolid=261 /dev/loop0 /mnt

Here lies an important notion not disclosed in the previous paragraph: although both directories and subvolumes can act as containers, only subvolumes can be mounted in a VFS hierarchy. It is a fundamental aspect to remember: you cannot mount a sub-part of a subvolume in the VFS; you can only mount the subvolume in itself. Considering the hierarchy schema in the previous section, if you want to access the directory D3 you have three possibilities:

  1. Mount the non-named subvolume (numbered 0) and access D3 through /mnt/SV2/SV22/D3 if the non-named subvolume is mounted in /mnt
  2. Mount the subvolume SV2 (numbered 259) and access D3 through /mnt/SV22/D3 if the the subvolume SV2 is mounted in /mnt
  3. Mount the subvolume SV22 (numbered 261) and access D3 through /mnt/D3 if the the subvolume SV22 is mounted in /mnt

This is accomplished by the following commands, respectively:

# mount -o subvolid=0 /dev/loop0 /mnt
# mount -o subvolid=259 /dev/loop0 /mnt
# mount -o subvolid=261 /dev/loop0 /mnt
Note

When a subvolume is mounted in the VFS, everything located "above" the subvolume is hidden. Concretely, if you mount the subvolume numbered 261 in /mnt, you only see what is under SV22, you won't see what is located above SV22 like SV21, SV2, D1, SV1, etc.

The default subvolume

$100 questions: 1. "If I don't put 'subvolid' in the command line, 1. how does the kernel know which one of the subvolumes it has to mount? 2. Does Omitting the 'subvolid' means automatically 'mount subvolume numbered 0'?". Answers: 1. BTRFS magic! ;-) 2. No, not necessarily, you can choose something other than the non-named subvolume.

When you create a brand new BTRFS filesystem, the system not only creates the initial the root subvolume (numbered 0) but also tags it as being the default subvolume. When you ask the operating system to mount a subvolume contained in a BTRFS volume without specifying a subvolume number, it determines which of the existing subvolumes has been tagged as "default subvolume" and mounts it. If none of the exiting subvolumes has the tag "default subvolume" (e.g. because the default subvolume has been deleted), the mount command gives up with a rather cryptic message:

# mount /dev/loop0 /mnt
mount: No such file or directory

It is also possible to change at any time which subvolume contained in a BTRFS volume is considered the default volume. This is accomplished with btrfs subvolume set-default. The following tags the subvolume 261 as being the default:

# btrfs subvolume set-default 261 /mnt

After that operation, doing the following is exactly the same:

# mount /dev/loop0 /mnt
# mount -o subvolid=261 /dev/loop0 /mnt
Note

The chosen new default subvolume must be visible in the VFS when you invoke btrfs subvolume set-default'

Deleting subvolumes

Question: "As subvolumes appear like directories, can I delete a subvolume by doing an rm -rf on it?". Answer: Yes, you can, but that way is not the most elegant, especially when it contains several gigabytes of data scattered on thousands of files, directories and maybe other subvolumes located in the one you want to remove. It isn't elegant because rm -rf could take several minutes (or even hours!) to complete whereas something else can do the same job in the fraction of a second.

"Huh?" Yes perfectly possible, and here is the cool goodie for the readers who arrived at this point: when you want to remove a subvolume, use btrfs subvolume delete instead of rm -rf. That btrfs command will remove the snapshots in a fraction of a second, even it contains several gigabytes of data!

Warning
  • You can never remove the root subvolume of a BTRFS volume as btrfs delete expects a subvolume name (again: this is not a flaw in the design of BTRFS; removing the subvolume numbered 0 would destroy the entirety of a BTRFS volume...too dangerous).
  • If the subvolume you delete was tagged as the default subvolume you will have to designate another default subvolume or explicitly tell the system which one of the subvolumes has to be mounted)

An example: considering our initial example given above and supposing you have mounted non-named subvolume numbered 0 in /mnt, you can remove SV22 by doing:

# btrfs subvolume delete /mnt/SV2/SV22

Obviously the BTRFS volume will look like this after the operation:

.....BTRFS Volume................................................................................................................................
.
.  (0) --+-> Subvolume SV1 (258) ---> Directory D1 --+-> File F1
.        |                                           |
.        |                                           +-> File F2
.        |
.        +-> Directory D1 --+-> File F1
.        |                  |
.        |                  +-> File F2
.        |                  |
.        |                  +-> File F3
.        |                  |
.        |                  +-> Directory D11 ---> File F4
.        +-> File F1
.        |
.        +-> Subvolume SV2 (259) --+-> Subvolume SV21 (260)
.....................................................................................................................................

Snapshot and subvolumes

If you have a good comprehension of what a subvolume is, understanding what a snapshot is won't be a problem: a snapshot is a subvolume with some initial contents. "Some initial contents" here means an exact copy.

When you think about snapshots, think about copy-on-write: the data blocks are not duplicated between a mounted subvolume and its snapshot unless you start to make changes to the files (a snapshot can occupy nearly zero extra space on the disk). At time goes on, more and more data blocks will be changed, thus making snapshots "occupy" more and more space on the disk. It is therefore recommended to keep only a minimal set of them and remove unnecessary ones to avoid wasting space on the volume.


The following illustrates how to take a snaphot of the VFS root:

# btrfs subvolume snapshot / /snap-2011-05-23
Create a snapshot of '/' in '//snap-2011-05-23'

Once created, the snapshot will persist in /snap-2011-05-23 as long as you don't delete it. Note that the snapshot contents will remain exactly the same it was at the time is was taken (as long as you don't make changes... BTRFS snapshots are writable!). A drawback of having snapshots: if you delete some files in the original filesystem, the snapshot still contains them and the disk blocks can't be claimed as free space. Remember to remove unwanted snapshots and keep a bare minimal set of them.

Listing and deleting snaphots

As there is no distinction between a snapshot and a subvolume, snapshots are managed with the exact same commands, especially when the time has come to delete some of them. An interesting feature in BTRFS is that snapshots are writable. You can take a snapshot and make changes in the files/directories it contains. A word of caution: there are no undo capbilities! What has been changed has been changed forever... If you need to do several tests just take several snapshots or, better yet, snapshot your snapshot then do whatever you need in this copy-of-the-copy :-).

Using snapshots for system recovery (aka Back to the Future)

Here is where BTRFS can literally be a lifeboat. Suppose you want to apply some updates via emerge -uaDN @world but you want to be sure that you can jump back into the past in case something goes seriously wrong after the system update (does libpng14 remind you of anything?!). Here is the "putting-things-together part" of the article!

The following only applies if your VFS root and system directories containing /sbin, /bin, /usr, /etc.... are located on a BTRFS volume. To make things simple, the whole structure is supposed to be located in the SAME subvolume of the same BTRFS volume.

To jump back into the past you have at least two options:

  1. Fiddle with the default subvolume numbers
  2. Use the kernel command line parameters in the bootloader configuration files

In all cases you must take a snapshot of your VFS root *before* updating the system:

# btrfs subvolume snapshot / /before-updating-2011-05-24
Create a snapshot of '/' in '//before-updating-2011-05-24'
Note

Hint: You can create an empty file at the root of your snapshot with the name of your choice to help you easily identify which subvolume is the currently mounted one (e.g. if the snapshot has been named before-updating-2011-05-24, you can use a slightly different name like current-is-before-updating-2011-05-24 => touch /before-updating-2011-05-24/current-is-before-updating-2011-05-24). This is extremly useful if you are dealing with several snapshots.

There is no "better" way; it's just a question of personal preference.

Way #1: Fiddle with the default subvolume number

Hypothesis:

  • Your "production" VFS root partition resides in the root subvolume (subvolid=0),
  • Your /boot partition (where the bootloader configuration files are stored) is on another standalone partition

First search for the newly created subvolume number:

# btrfs subvolume list / 
'''ID 256''' top level 5 path before-updating-2011-05-24

'256' is the ID to be retained (of course, this ID will differ in your case).

Now, change the default subvolume of the BTRFS volume to designate the subvolume (snapshot) before-updating and not the root subvolume then reboot:

# btrfs subvolume set-default 256 /

Once the system has rebooted, and if you followed the advice in the previous paragraph that suggests to create an empty file of the same name as the snapshot, you should be able to see if the mounted VFS root is the copy hold by the snapshot before-updating-2011-05-24:

# ls -l /
...
-rw-rw-rw-   1 root root    0 May 24 20:33 current-is-before-updating-2011-05-24
...

The correct subvolume has been used for mounting the VFS! Excellent! This is now the time to mount your "production" VFS root (remember the root subvolume can only be accessed via its identification number i.e 0):

# mount -o subvolid=0 /mnt
# mount
...
/dev/sda2 on /mnt type btrfs (rw,subvolid=0)

Oh by the way, as the root subvolume is now mounted in /mnt let's try something, just for the sake of the demonstration:

# ls /mnt
...
drwxr-xr-x   1 root root    0 May 24 20:33 current-is-before-updating-2011-05-24
...
# btrfs subvolume list /mnt
ID 256 top level 5 path before-updating-2011-05-24

No doubt possible :-) Time to rollback! For this rsync will be used in the following way:

# rsync --progress -aHAX --exclude=/proc --exclude=/dev --exclude=/sys --exclude=/mnt / /mnt

Basically we are asking rsync to:

  • preserve timestamps, hard and symbolic links, owner/group IDs, ACLs and any extended attributes (refer to the rsync manual page for further details on options used) and to report its progression
  • ignore the mount points where virtual filesystems are mounted (procfs, sysfs...)
  • avoid a re-recursion by reprocessing /mnt (you can speed up the process by adding some extra directories if you are sure they don't hold any important changes or any change at all like /var/tmp/portage for example).

Be patient! The resync may take several minutes or hours depending on the amount of data amount to process...

Once finished, you will have to set the default subvolume to be the root subvolume:

# btrfs subvolume set-default 0 /mnt
ID 256 top level 5 path before-updating-2011-05-24
Warning

DO NOT ENTER / instead of /mnt in the above command; it won't work and you will be under the snapshot before-updating-2011-05-24 the next time the machine reboots.

The reason is that subvolume number must be "visible" from the path given at the end of the btrfs subvolume set-default command line. Again refer the boxes analogy: in our context we are in a subbox numbered 256 which is located *inside* the box numbered 0 (so it can't see neither interfere with it). [TODO: better explain]

Now just reboot and you should be in business again! Once you have rebooted just check if you are really under the right subvolume:

# ls / 
...
drwxr-xr-x   1 root root    0 May 24 20:33 current-is-before-updating-2011-05-24
...
# btrfs subvolume list /
ID 256 top level 5 path before-updating-2011-05-24

At the right place? Excellent! You can now delete the snapshot if you wish, or better: keep it as a lifeboat of "last good known system state."

Way #2: Change the kernel command line in the bootloader configuration files

First search for the newly created subvolume number:

# btrfs subvolume list / 
'''ID 256''' top level 5 path before-updating-2011-05-24

'256' is the ID to be retained (can differ in your case).

Now with your favourite text editor, edit the adequate kernel command line in your bootloader configuration (/etc/boot.conf). This file contains is typically organized in several sections (one per kernel present on the system plus some global settings), like the excerpt below:

set timeout=5
set default=0

# Production kernel
menuentry "Funtoo Linux production kernel (2.6.39-gentoo x86/64)" {
   insmod part_msdos
   insmod ext2
   ...
   set root=(hd0,1)
   linux /kernel-x86_64-2.6.39-gentoo root=/dev/sda2 
   initrd /initramfs-x86_64-2.6.39-gentoo
}
...

Find the correct kernel line and add one of the following statements after root=/dev/sdX:

rootflags=subvol=before-updating-2011-05-24
   - Or -
rootflags=subvolid=256
Warning

If the kernel your want to use has been generated with Genkernel, you MUST use real_rootflags=subvol=... instead of rootflags=subvol=... at the penalty of not having your rootflags taken into consideration by the kernel on reboot.


Applied to the previous example you will get the following if you referred the subvolume by its name:

set timeout=5
set default=0

# Production kernel
menuentry "Funtoo Linux production kernel (2.6.39-gentoo x86/64)" {
   insmod part_msdos
   insmod ext2
   ...
   set root=(hd0,1)
   linux /kernel-x86_64-2.6.39-gentoo root=/dev/sda2 rootflags=subvol=before-updating-2011-05-24
   initrd /initramfs-x86_64-2.6.39-gentoo
}
...

Or you will get the following if you referred the subvolume by its identification number:

set timeout=5
set default=0

# Production kernel
menuentry "Funtoo Linux production kernel (2.6.39-gentoo x86/64)" {
   insmod part_msdos
   insmod ext2
   ...
   set root=(hd0,1)
   linux /kernel-x86_64-2.6.39-gentoo root=/dev/sda2 rootflags=subvolid=256
   initrd /initramfs-x86_64-2.6.39-gentoo
}
...

Once the modifications are done, save your changes and take the necessary extra steps to commit the configuration changes on the first sectors of the disk if needed (this mostly applies to the users of LILO; Grub and SILO do not need to be refreshed) and reboot.

Once the system has rebooted and if you followed the advice in the previous paragraph that suggests to create an empty file of the same name as the snapshot, you should be able to see if the mounted VFS root is the copy hold by the snapshot before-updating-2011-05-24:

# ls -l /
...
-rw-rw-rw-   1 root root    0 May 24 20:33 current-is-before-updating-2011-05-24
...

The correct subvolume has been used for mounting the VFS! Excellent! This is now the time to mount your "production" VFS root (remember the root subvolume can only be accessed via its identification number 0):

# mount -o subvolid=0 /mnt
# mount
...
/dev/sda2 on /mnt type btrfs (rw,subvolid=0)

Time to rollback! For this rsync will be used in the following way:

# rsync --progress -aHAX --exclude=/proc --exclude=/dev --exclude=/sys --exclude=/mnt / /mnt

Here, please refer to what has been said in Way #1 concerning the used options in rsync. Once everything is in place again, edit your bootloader configuration to remove the rootflags/real_rootflags kernel parameter, reboot and check if you are really under the right subvolume:

# ls / 
...
drwxr-xr-x   1 root root    0 May 24 20:33 current-is-before-updating-2011-05-24
...
# btrfs subvolume list /
ID 256 top level 5 path current-is-before-updating-2011-05-24

At the right place? Excellent! You can now delete the snapshot if you wish, or better: keep it as a lifeboat of "last good known system state."

Some BTRFS practices / returns of experience / gotchas

  • Although BTRFS is still evolving, at the date of writing it (still) is an experimental filesystem and should be not be used for production systems and for storing critical data (even if the data is non critical, having backups on a partition formatted with a "stable" filesystem like Reiser or ext3/4 is recommended).
  • From time to time some changes are brought to the metadata (BTRFS format is not definitive at date of writing) and a BTRFS partition could not be used with older Linux kernels (this happened with Linux 2.6.31).
  • More and more Linux distributions are proposing the filesystem as an alternative for ext4
  • Some reported gotchas: https://btrfs.wiki.kernel.org/index.php/Gotchas
  • Playing around with BTFRS can be a bit tricky especially when dealing with default volumes and mount point (again: the box analogy)
  • Using compression (e.g. LZO =>> mount -o compress=lzo) on the filesystem can improve the throughput performance, however many files nowadays are already compressed at application level (music, pictures, videos....).
  • Using space caching capabilities (mount -o space_cache) seems to brings some extra slight performance improvements.
  • There is very interesting discussion on BTRFS design limitations with B-Trees lying on LKML. We strongly encourage you to read about on

Deploying a Funtoo instance in a subvolume other than the root subvolume

Some Funtoo core devs have used BTRFS for many months and no major glitches have been reported so far (except an non-aligned memory access trap on SPARC64 in a checksum calculation routine; minor latest kernels may brought a correction) except a long time ago but this was more related to a kernel crash due to a bug that corrupted some internal data rather than the filesystem code in itself.

The following can simplify your life in case of recovery (not tested):

When you prepare the disk space that will hold the root of your future Funtoo instance (and so, will hold /usr /bin /sbin /etc etc...), don't use the root subvolume but take an extra step to define a subvolume like illustrated below:

# fdisk /dev/sda2
....
# mkfs.btrfs /dev/sda2
# mount /dev/sda2 /mnt/funtoo
# subvolume create /mnt/funtoo /mnt/funtoo/live-vfs-root-20110523
# chroot /mnt/funtoo/live-vfs-root-20110523 /bin/bash

Then either:

  • Set the default subvolume /live-vfs-root-20110523 as being the default subvolume (btrfs subvolume set-default.... remember to inspect the subvolume identification number)
  • Use rootflag / real_rootfsflags (always use real_rootfsflags for kernel generated with Genkernel) on the kernel command line in your bootloader configuration file

Technically speaking, it won't change your life BUT at system recovery: when you want to rollback to a functional VFS root copy because something happened (buggy system package, too aggressive cleanup that removed Python, dead compiling toolchain...) you can avoid a time costly rsync but at the cost of putting a bit of overhead over your shoulders when taking a snapshot.

Here again you have two ways to recover the system:

  • fiddling with the default subvolume:
    • Mount to the no named volume somewhere (e.g. mount -o subvolid=0 /dev/sdX /mnt)
    • Take a snapshot (remember to check its identification number) of your current subvolume and store it under the root volume you just have just mounted (btrfs snapshot create / /mnt/before-updating-20110524) -- (Where is the "frontier"? If 0 is monted does its contennts also appear in the taken snashot located on the same volume?)
    • Update your system or do whatever else "dangerous" operation
    • If you need to return to the latest good known system state, just set the default subvolume to use to the just taken snapshot (btrfs subvolume set-default <snapshotnumber here> /mnt)
    • Reboot
    • Once you have rebooted, just mount the root subvolume again and delete the subvolume that correspond to the failed system update (btrfs subvolume delete /mnt/<buggy VFS rootsnapshot name here>)
  • fiddling with the kernel command line:
    • Mount to the no named volume somewhere (e.g. mount -o subvolid=0 /dev/sdX /mnt)
    • Take a snapshot (remember to check its identification number) of your current subvolume and store it under the root volume you just have just mounted (btrfs snapshot create / /mnt/before-updating-20110524) -- (Where is the "frontier"? If 0 is mounted does its contents also appear in the taken snapshot located on the same volume?)
    • Update your system or do whatever else "dangerous" operation
    • If you need to return to the latest good known system state, just set the rootflags/real_rootflags as demonstrated in previous paragraphs in your loader configuration file
    • Reboot
    • Once you have rebooted, just mount the root subvolume again and delete the subvolume that correspond to the failed system update (btrfs subvolume delete /mnt/<buggy VFS rootsnapshot name here>)

Space recovery / defragmenting the filesystem

Tip

From time to time it is advised to ask for re-optimizing the filesystem structures and data blocks in a subvolume. In BTRFS terminology this is called a defragmentation and it only be performed when the subvolume is mounted in the VFS (online defragmentation):

# btrfs filesystem defrag /mnt

You can still access the subvolume, even change its contents, while a defragmentation is running.

It is also a good idea to remove the snapshots you don't use anymore especially if huge files and/or lots of files are changed because snapshots will still hold some blocks that could be reused.

SSE 4.2 boost

If your CPU supports hardware calculation of CRC32 (e.g. since Intel Nehalem series and later and AMD Bulldozer series), you are encouraged to enable that support in your kernel since BTRFS makes an aggressive use of those. Just check you have enabled CRC32c INTEL hardware acceleration in Cryptographic API either as a module or as a built-in feature

Recovering an apparent dead BTRFS filesystem

Dealing with a filesystem metadata coherence is a critical in a filesystem design. Losing some data blocks (i.e. having some corrupted files) is less critical than having a screwed-up and unmountable filesystem especially if you do backups on a regular basis (the rule with BTRFS is *do backups*, BTRFS has no mature filesystem repair tool and you *will* end up in having to re-create your filesystem from scratch again sooner or later).

Mounting with recovery option (Linux 3.2 and beyond)

If you are using Linux 3.2 and later (only!), you can use the recovery option to make BTRFS seek for a usable copy of tree root (several copies of it exists on the disk). Just mount your filesystem as:

# mount -o recovery /dev/yourBTFSvolume /mount/point

btrfs-select-super / btrfs-zero-log

Two other handy tools exist but they are not deployed by default by sys-fs/btrfs-progs (even btrfs-progs-9999) ebuilds because they only lie in the branch "next" of the btrfs-progs Git repository:

  • btrfs-select-super
  • btrfs-zero-log

Building the btrfs-progs goodies

The two tools this section is about are not build by default and Funtoo ebuilds does not build them as well for the moment. So you must build them manually:

# mkdir ~/src
# cd ~/src
# git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git 
# cd btrfs-progs
# make && make btrfs-select-super && make btrfs-zero-log
Note

In the past, btrfs-select-super and btrfs-zero-log were lying in the git-next branch, this is no longer the case and those tools are available in the master branch

Fixing dead superblock

In case of a corrupted superblock, start by asking btrfsck to use an alternate copy of the superblock instead of the superblock #0. This is achieved via the -s option followed by the number of the alternate copy you wish to use. In the following example we ask for using the superblock copy #2 of /dev/sda7:

# ./btrfsck --s 2 /dev/sd7

When btrfsck is happy, use btrfs-super-select to restore the default superblock (copy #0) with a clean copy. In the following example we ask for restoring the superblock of /dev/sda7 with its copy #2:

# ./btrfs-super-select -s 2  /dev/sda7

Note that this will overwrite all the other supers on the disk, which means you really only get one shot at it.

If you run btrfs-super-select prior prior to figuring out which one is good, you've lost your chance to find a good one.

Clearing the BTRFS journal

This will only help with one specific problem!

If you are unable to mount a BTRFS partition after a hard shutdown, crash or power loss, it may be due to faulty log playback in kernels prior to 3.2. The first thing to try is updating your kernel, and mounting. If this isn't possible, an alternate solution lies in truncating the BTRFS journal, but only if you see "replay_one_*" functions in the oops callstack.

To truncate the journal of a BTRFS partition (and thereby lose any changes that only exist in the log!), just give the filesystem to process to btrfs-zero-log:

# ./btrfs-zero-log /dev/sda7

This is not a generic technique, and works by permanently throwing away a small amount of potentially good data.

Using btrfsck

Warning

Extremely experimental...

If one thing is famous in the BTRFS world it would be the so-wished fully functional btrfsck. A read-only version of the tool was existing out there for years, however at the begining of 2012, BTRFS developers made a public and very experimental release: the secret jewel lies in the branch dangerdonteveruse of the BTRFS Git repository hold by Chris Mason on kernel.org.

# git clone git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git
# cd btrfs-progs
# git checkout dangerdonteveruse
# make

So far the tool can:

  • Fix errors in the extents tree and in blocks groups accounting
  • Wipe the CRC tree and create a brand new one (you can to mount the filesystem with CRC checking disabled )

To repair:

# btrfsck --repair /dev/''yourBTRFSvolume''

To wipe the CRC tree:

# btrfsck --init-csum-tree /dev/''yourBTRFSvolume''

Two other options exist in the source code: --super (equivalent of btrfs-select-super ?) and --init-extent-tree (clears out any extent?)

Final words

We give the great lines here but BTRFS can be very tricky especially when several subvolumes coming from several BTRFS volumes are used. And remember: BTRFS is still experimental at date of writing :)

Lessons learned

  • Very interesting but still lacks some important features present in ZFS like RAID-Z, virtual volumes, management by attributes, filesystem streaming, etc.
  • Extremly interesting for Gentoo/Funtoo systems partitions (snapshot/rollback capabilities). However not integrated in portage yet.
  • If possible, use a file monitoring tool like TripWire this is handy to see what file has been corrupted once the filesystem is recovered or if a bug happens
  • It is highly advised to not use the root subvolume when deploying a new Funtoo instance or put any kind of data on it in a more general case. Rolling back a data snapshot will be much easier and much less error prone (no copy process, just a matter of 'swapping' the subvolumes).
  • Backup, backup backup your data! ;)


Support Funtoo and help us grow! Donate $15 per month and get a free SSD-based Funtoo Virtual Container.

Got Funtoo?

Have you installed Funtoo Linux yet? Discover the power of a from-source meta-distribution optimized for your hardware! See our installation instructions and browse our CPU-optimized builds.

Funtoo News

Drobbins

RSS/Atom Support

You can now follow this news feed at http://www.funtoo.org/news/atom.xml .
10 February 2015 by Drobbins
Drobbins

Creating a Friendly Funtoo Culture

This news item details some recent steps that have been taken to help ensure that Funtoo is a friendly and welcoming place for our users.
2 February 2015 by Drobbins
Mgorny

CPU FLAGS X86

CPU_FLAGS_X86 are being introduced to group together USE flags managing CPU instruction sets.
31 January 2015 by Mgorny
View More News...

More Articles

Browse all our Linux-related articles, below:

A

B

F

G

K

L

M

O

P

S

T

W

X

Z