Difference between pages "Traffic Control" and "Linux Containers"

From Funtoo
(Difference between pages)
Jump to navigation Jump to search
 
 
Line 1: Line 1:
== Introduction ==
Linux Containers, or LXC, is a Linux feature that allows Linux to run one or more isolated virtual systems (with their own network interfaces, process namespace, user namespace, and power state) using a single Linux kernel on a single server.


Linux's traffic control functionality offers a lot of capabilities related to influencing the rate of flow, as well as latency, of primarily outgoing but also in some cases incoming network traffic. It is designed to be a "construction kit" rather than a turn-key system, where complex network traffic policing and shaping decisions can be made using a variety of algorithms. The Linux traffic control code is also often used by academia for research purposes, where is it can be a useful mechanism to simulate and explore the impact of a variety of different network behaviors. See [http://www.linuxfoundation.org/collaborate/workgroups/networking/netem netem] for an example of a simulation framework that can be used for this purpose.
== Status ==


Of course, Linux traffic control can also be extremely useful in an IT context, and this document is intended to focus on the practical, useful applications of Linux traffic control, where these capabilities can be applied to solve problems that are often experienced on modern networks.
As of Linux kernel 3.1.5, LXC is usable for isolating your own private workloads from one another. It is not yet ready to isolate potentially malicious users from one another or the host system. For a more mature containers solution that is appropriate for hosting environments, see [[OpenVZ]].


== Incoming and Outgoing Traffic ==
LXC containers don't yet have their own system uptime, and they see everything that's in the host's <tt>dmesg</tt> output, among other things. But in general, the technology works.


One common use of Linux traffic control is to configure a Linux system as a Linux router or bridge, so that the Linux system sits between two networks, or between the "inside" of the network and the real router, so that it can shape traffic going to local machines as well as out to the Internet. This provides a way to prioritize, shape and police both incoming (from the Internet) and outgoing (from local machines) network traffic, because it is easiest to create traffic control rules for traffic flowing ''out'' of an interface, since we can control when the system ''sends'' data, but controlling when we ''receive'' data requires an additional ''intermediate queue'' to be created to buffer incoming data. When a Linux system is configured as a firewall or router with a physical interface for each part of the network, we can avoid using intermediate queues.
== Basic Info ==


A simple way to set up a layer 2 bridge using Linux involves creating a bridge device with <tt>brctl</tt>, adding two Ethernet ports to this bridge (again using <tt>brctl</tt>), and then apply prioritization, shaping and policing rules to both interfaces. The rules will apply to ''outgoing'' traffic on each interface. One physical interface will be connected to an upstream router on the same network, while the other network port will be connected to a layer 2 access switch to which local machines are connected. This allows powerful egress shaping policies to be created on both interfaces, to control the flows in and out of the network.


== Recommended Resources ==
* Linux Containers are based on:
** Kernel namespaces for resource isolation
** CGroups for resource limitation and accounting


Resources you should take a look at, in order:
{{Package|app-emulation/lxc}} is the userspace tool for Linux containers


* [http://luxik.cdi.cz/~devik/qos/htb/manual/userg.htm HTB documentation] by Martin Devera. Best way to create different priority classes and bandwidth allocations.
== Control groups ==
* [http://www.opalsoft.net/qos/DS.htm Differentiated Services On Linux HOWTO] by Leonardo Balliache. Good general docs.
* [http://blog.edseek.com/~jasonb/articles/traffic_shaping/index.html A Practical Guide to Linux Traffic Control] by Jason Boxman. Good general docs.
* [http://www.linuxfoundation.org/collaborate/workgroups/networking/ifb IFB - replacement for Linux IMQ], with examples. This is the official best way to do ''inbound'' traffic control, when you don't have dedicated in/out interfaces.
* [http://seclists.org/fulldisclosure/2006/Feb/702 Use of iptables hashlimit] - Great functionality in iptables. There's a hashlimit example below as well.


Related Interesting Links:
* Control groups (cgroups) in kernel since 2.6.24
** Allows aggregation of tasks and their children
** Subsystems (cpuset, memory, blkio,...)
** accounting - to measure how much resources certain systems use
** resource limiting - groups can be set to not exceed a set memory limit
** prioritization - some groups may get a larger share of CPU
** control - freezing/unfreezing of cgroups, checkpointing and restarting
** No disk quota limitation ( -> image file, LVM, XFS, directory tree quota,...)


* [http://wiki.secondlife.com/wiki/BLT Second Life Bandwidth Testing Protocol] - example of Netem
== Subsystems ==
* [http://www.29west.com/docs/THPM/udp-buffer-sizing.html UDP Buffer Sizing], part of [http://www.29west.com/docs/THPM/index.html Topics in High Performance Messaging]
<br>
<console>
###i## cat /proc/cgroups
subsys_name hierarchy num_cgroups enabled
cpuset
cpu
cpuacct
memory
devices
freezer
blkio
perf_event
hugetlb
</console>


== Recommended Approaches ==
#cpuset    -> limits tasks to specific CPU/CPUs
#cpu        -> CPU shares
#cpuacct    -> CPU accounting
#memory    -> memory and swap limitation and accounting
#devices    -> device allow deny list
#freezer    -> suspend/resume tasks
#blkio      -> I/O priorization (weight, throttle, ...)
#perf_event -> support for per-cpu per-cgroup monitoring [http://lwn.net/Articles/421574/ perf_events]
#hugetlb    -> cgroup resource controller for HugeTLB pages  [http://lwn.net/Articles/499255/ hugetlb]


Daniel Robbins has had very good results with the [http://luxik.cdi.cz/~devik/qos/htb/ HTB queuing discipline] - it has very good features, and also has [http://luxik.cdi.cz/~devik/qos/htb/manual/userg.htm very good documentation], which is just as important, and is designed to deliver useful results in a production environment. And it works. If you use traffic control under Funtoo Linux, please use the HTB queuing discipline as the root queuing discipline because you will get good results in very little time. Avoid using any other queuing discipline under Funtoo Linux as the ''root'' queuing discipline on any interface. If you are creating a tree of classes and qdiscs, HTB should be at the top, and you should avoid hanging classes under any other qdisc unless you have plenty of time to experiment and verify that your QoS rules are working as expected. Please see [[#State_of_the_Code|State of the Code]] for more info on what Daniel Robbins considers to be the current state of the traffic control implementation in Linux.
== Configuring the Funtoo Host System ==


== State of the Code ==
=== Install LXC kernel ===
Any kernel beyond 3.1.5 will probably work. Personally I prefer {{Package|sys-kernel/gentoo-sources}} as these have support for all the namespaces without sacrificing the xfs, FUSE or NFS support for example. These checks were introduced later starting from kernel 3.5, this could also mean that the user namespace is not working optimally.


If you are using enterprise kernels, especially any RHEL5-based kernels, you must be aware that the traffic control code in these kernels is about 5 years old and contains many significant bugs. In general, it is possible to avoid these bugs by using HTB as your root queueing discipline and testing things carefully to ensure that you are getting the proper behavior. The <tt>prio</tt> queueing discipline is known to not work reliably in RHEL5 kernels. See [[Broken Traffic Control]] for more information on known bugs with older kernels.
* User namespace (EXPERIMENTAL) depends on EXPERIMENTAL and on UIDGID_CONVERTED
** config UIDGID_CONVERTED
*** True if all of the selected software components are known to have uid_t and gid_t converted to kuid_t and kgid_t where appropriate and are otherwise safe to use with the user namespace.
**** Networking - depends on NET_9P = n
**** Filesystems - 9P_FS = n, AFS_FS = n, AUTOFS4_FS = n, CEPH_FS = n, CIFS = n, CODA_FS = n, FUSE_FS = n, GFS2_FS = n, NCP_FS = n, NFSD = n, NFS_FS = n, OCFS2_FS = n, XFS_FS = n
**** Security options - Grsecurity - GRKERNSEC = n (if applicable)


If you are using a more modern kernel, Linux traffic control should be fairly robust. The examples below should work with RHEL5 as well as newer kernels.
** As of 3.10.xx kernel, all of the above options are safe to use with User namespaces, except for XFS_FS, therefore with kernel >=3.10.xx, you should answer XFS_FS = n, if you want User namespaces support.
** in your kernel source directory, you should check init/Kconfig and find out what UIDGID_CONVERTED depends on


== Inspect Your Rules ==
==== Kernel configuration ====
These options should be enable in your kernel to be able to take full advantage of LXC.


If you are implementing Linux traffic control, you should be running these commands frequently to monitor the behavior of your queuing discipline. Replace <tt>$wanif</tt> with the actual network interface name.
* General setup
** CONFIG_NAMESPACES
*** CONFIG_UTS_NS
*** CONFIG_IPC_NS
*** CONFIG_PID_NS
*** CONFIG_NET_NS
*** CONFIG_USER_NS
** CONFIG_CGROUPS
*** CONFIG_CGROUP_DEVICE
*** CONFIG_CGROUP_SCHED
*** CONFIG_CGROUP_CPUACCT
*** CONFIG_CGROUP_MEM_RES_CTLR (in 3.6+ kernels it's called CONFIG_MEMCG)
*** CONFIG_CGROUP_MEM_RES_CTLR_SWAP (in 3.6+ kernels it's called CONFIG_MEMCG_SWAP)
*** CONFIG_CPUSETS (on multiprocessor hosts)
* Networking support
** Networking options
*** CONFIG_VLAN_8021Q
* Device Drivers
** Character devices
*** Unix98 PTY support
**** CONFIG_DEVPTS_MULTIPLE_INSTANCES
** Network device support
*** Network core driver support
**** CONFIG_VETH
**** CONFIG_MACVLAN


<source lang="bash">
Once you have lxc installed, you can then check your kernel config with:
tc -s qdisc ls dev $wanif
<console>
tc -s class ls dev $wanif
# ##i##CONFIG=/path/to/config /usr/sbin/lxc-checkconfig
</source>
</console>


== Matching ==
=== Emerge lxc ===
<console>
# ##i##emerge app-emulation/lxc
</console>


Here are some examples you can use as the basis for your own filters/classifiers:
=== Configure Networking For Container ===


# <tt>protocol arp u32 match u32 0 0</tt> - match ARP packets
Typically, one uses a bridge to allow containers to connect to the network. This is how to do it under Funtoo Linux:
# <tt>protocol ip u32 match ip protocol 0x11 0xff</tt> - match UDP packets
# <tt>protocol ip u32 match ip protocol 17 0xff</tt> - (also) match UDP packets
# <tt>protocol ip u32 match ip protocol 0x6 0xff</tt> - match TCP packets
# <tt>protocol ip u32 match ip protocol 1 0xff</tt> - match ICMP (ping) packets
# <tt>protocol ip u32 match ip dst 4.3.2.1/32</tt> - match all IP traffic headed for IP 4.3.2.1
# <tt>protocol ip u32 match ip src 4.3.2.1/32 match ip sport 80 0xffff</tt> - match all IP traffic from 4.3.2.1 port 80
# <tt>protocol ip u32 match ip sport 53 0xffff</tt> - match originating DNS (both TCP and UDP)
# <tt>protocol ip u32 match ip dport 53 0xffff</tt> - match response DNS (both TCP and UDP)
# <tt>protocol ip u32 match ip protocol 6 0xff match u8 0x10 0xff at nexthdr+13</tt> - match packets with ACK bit set
# <tt>protocol ip u32 match ip protocol 6 0xff match u8 0x10 0xff at nexthdr+13 match u16 0x0000 0xffc0 at 2</tt> - packets less than 64 bytes in size with ACK bit set
# <tt>protocol ip u32 match ip tos 0x10 0xff</tt> - match IP packets with "type of service" set to "Minimize delay"/"Interactive"
# <tt>protocol ip u32 match ip tos 0x08 0xff</tt> - match IP packets with "type of service" set to "Maximize throughput"/"Bulk" (see "QDISC PARAMETERS" in <tt>tc-prio</tt> man page)
# <tt>protocol ip u32 match tcp dport 53 0xffff match ip protocol 0x6 0xff</tt> - match TCP packets heading for dest. port 53 (my not work)


== Sample Traffic Control Code ==
# create a bridge using the Funtoo network configuration scripts. Name the bridge something like <tt>brwan</tt> (using <tt>/etc/init.d/netif.brwan</tt>). Configure your bridge to have an IP address.
# Make your physical interface, such as <tt>eth0</tt>, an interface with no IP address (use the Funtoo <tt>interface-noip</tt> template.)
# Make <tt>netif.eth0</tt> a slave of <tt>netif.brwan</tt> in <tt>/etc/conf.d/netif.brwan</tt>.
# Enable your new bridged network and make sure it is functioning properly on the host.


<source lang="bash">
You will now be able to configure LXC to automatically add your container's virtual ethernet interface to the bridge when it starts, which will connect it to your network.
modemif=eth4


iptables -t mangle -A POSTROUTING -o $modemif -p tcp -m tos --tos Minimize-Delay -j CLASSIFY --set-class 1:10
== Setting up a Funtoo Linux LXC Container ==
iptables -t mangle -A POSTROUTING -o $modemif -p tcp --dport 53 -j CLASSIFY --set-class 1:10
iptables -t mangle -A POSTROUTING -o $modemif -p tcp --dport 80 -j CLASSIFY --set-class 1:10
iptables -t mangle -A POSTROUTING -o $modemif -p tcp --dport 443 -j CLASSIFY --set-class 1:10


tc qdisc add dev $modemif root handle 1: htb default 12
Here are the steps required to get Funtoo Linux running <i>inside</i> a container. The steps below show you how to set up a container using an existing Funtoo Linux OpenVZ template. It is now also possible to use [[Metro]] to build an lxc container tarball directly, which will save you manual configuration steps and will provide an <tt>/etc/fstab.lxc</tt> file that you can use for your host container config. See [[Metro Recipes]] for info on how to use Metro to generate an lxc container.
tc class add dev $modemif parent 1: classid 1:1 htb rate 1500kbit ceil 1500kbit burst 10k
tc class add dev $modemif parent 1:1 classid 1:10 htb rate 700kbit ceil 1500kbit prio 1 burst 10k
tc class add dev $modemif parent 1:1 classid 1:12 htb rate 800kbit ceil 800kbit prio 2
tc filter add dev $modemif protocol ip parent 1:0 prio 1 u32 match ip protocol 0x11 0xff flowid 1:10
tc qdisc add dev $modemif parent 1:10 handle 20: sfq perturb 10
tc qdisc add dev $modemif parent 1:12 handle 30: sfq perturb 10
</source>


The code above is a working traffic control script that is even compatible with RHEL5 kernels, for a 1500kbps outbound link (T1, Cable or similar.) In this example, <tt>eth4</tt> is part of a bridge. The code above should work regardless of whether <tt>eth4</tt> is in a bridge or not -- just make sure that <tt>modemif</tt> is set to the interface on which traffic is flowing ''out'' and you wish to apply traffic control.
=== Create and Configure Container Filesystem ===


=== <tt>tc</tt> code walkthrough ===
# Start with a Funtoo LXC template, and unpack it to a directory such as <tt>/lxc/funtoo0/rootfs/</tt>
# Create an empty <tt>/lxc/funtoo0/fstab</tt> file
# Ensure <tt>c1</tt> line is uncommented (enabled) and <tt>c2</tt> through <tt>c6</tt> lines are disabled in <tt>/lxc/funtoo0/rootfs/etc/inittab</tt>


This script uses the <tt>tc</tt> command to create two priority classes - 1:10 and 1:12. By default, all traffic goes into the low-priority class, 1:12. 1:10 has priority over 1:12 (<tt>prio 1</tt> vs. <tt>prio 2</tt>,) so if there is any traffic in 1:10 ready to be sent, it will be sent ahead of 1:12. 1:10 has a rate of 700kbit but can use up to the full outbound bandwidth of 1500kbit by borrowing from 1:12.  
That's almost all you need to get the container filesystem ready to start.


UDP traffic (traffic that matches <tt>ip protocol 0x11 0xff</tt>) will be put in the high priority class 1:10. This can be good for things like FPS games, to ensure that latency is low and not drowned out by lower-priority traffic.
=== Create Container Configuration Files ===


If we stopped here, however, we would get a bit worse results than if we didn't use <tt>tc</tt> at all. We have basically created two outgoing sub-channels of different priorities. The higher priority class ''can'' drown out the lower-priority class, and this is intentional so it isn't the issue -- in this case we ''want'' that functionality. The problem is that the high priority and low priority classes can both be dominated by high-bandwidth flows, causing other traffic flows of the same priority to be drowned out. To fix this, two <tt>sfq</tt> queuing disciplines are added to the high and low priority classes and will ensure that individual traffic flows are identified and each given a fair shot at sending data out of their respective classes. This should prevent starvation within the classes themselves.
Create the following files:


=== <tt>iptables</tt> code walkthrough ===
==== <tt>/lxc/funtoo0/config</tt> ====


First note that we are adding netfilter rules to the <tt>POSTROUTING</tt> chain, in the <tt>mangle</tt> table. This table allows us to modify the packets ''right before'' they are queued to be sent out of an interface, which is exactly what we want. At this point, these packets could have been locally-generated or forwarded -- as long as they are on their way to going out of <tt>modemif</tt> (eth4 in this case), the <tt>mangle</tt> <tt>POSTROUTING</tt> chain will see them and we can classify them and perform other useful tweaks.


The iptables code puts all traffic with the "minimize-delay" flag (interactive ssh traffic, for example) in the high priority traffic class. In addition, all HTTP, HTTPS and DNS TCP traffic will be classified as high-priority. Remember that all UDP traffic is being classified as high priority via the <tt>tc</tt> rule described above, so this will take care of DNS UDP traffic automatically.
and also create symlink from
==== <tt> /lxc/funtoo0/config to /etc/lxc/funtoo0.conf </tt> ====
<console>
###i## mkdir /etc/lxc/funtoo0
###i## ln -s /lxc/funtoo0/config /etc/lxc/funtoo0/config
</console>


=== Further optimizations ===
{{Fancynote| Daniel Robbins needs to update this config to be more in line with http://wiki.progress-linux.org/software/lxc/ -- this config appears to have nice, refined device node permissions and other goodies. // note by Havis to Daniel, this config is already superior.}}


==== SSH ====


<source lang="bash">
Read "man 5 lxc.conf" , to get more information about linux container configuration file.
iptables -t mangle -N tosfix
<pre>
iptables -t mangle -A tosfix -p tcp -m length --length 0:512 -j RETURN
## Container
#allow screen redraws under interactive SSH sessions to be fast:
lxc.utsname                            = funtoo0
iptables -t mangle -A tosfix -m hashlimit --hashlimit 20/sec --hashlimit-burst 20 \
lxc.rootfs                              = /lxc/funtoo0/rootfs/
--hashlimit-mode srcip,srcport,dstip,dstport --hashlimit-name minlat -j RETURN
lxc.arch                                = x86_64
iptables -t mangle -A tosfix -j TOS --set-tos Maximize-Throughput
#lxc.console                            = /var/log/lxc/funtoo0.console  # uncomment if you want to log containers console
iptables -t mangle -A tosfix -j RETURN
lxc.tty                                = 6  # if you plan to use container with physical terminals (eg F1..F6)
#lxc.tty                                = 0  # set to 0 if you dont plan to use the container with physical terminal, also comment out in your containers /etc/inittab  c1 to c6 respawns (e.g. c1:12345:respawn:/sbin/agetty 38400 tty1 linux)
lxc.pts                                = 1024


iptables -t mangle -A POSTROUTING -p tcp -m tos --tos Minimize-Delay -j tosfix
</source>


To use this code, place it ''near the top of the file'', just below the <tt>modemif="eth4"</tt> line, but ''before'' the main <tt>iptables</tt> and <tt>tc</tt> rules. These rules will apply to ''all'' packets about to get queued to any interface, but this is not necessarily a bad thing, since the TCP flags being set are not just specific to our traffic control functionality. To make these rules specific to <tt>modemif</tt>, add "-o $modemif" after "-A POSTROUTING" on the last line, above. As-is, the rules above will set the TCP flags on all packets flowing out of all interfaces, but the the traffic control rules will only take effect for <tt>modemif</tt>, because they are only configured for that interface.
## Capabilities
lxc.cap.drop                            = audit_control
lxc.cap.drop                            = audit_write
lxc.cap.drop                            = mac_admin
lxc.cap.drop                            = mac_override
lxc.cap.drop                            = mknod
lxc.cap.drop                            = setfcap
lxc.cap.drop                            = setpcap
lxc.cap.drop                            = sys_admin
#lxc.cap.drop                            = sys_boot # capability to reboot the container
#lxc.cap.drop                            = sys_chroot # required by SSH
lxc.cap.drop                            = sys_module
#lxc.cap.drop                            = sys_nice
lxc.cap.drop                            = sys_pacct
lxc.cap.drop                            = sys_rawio
lxc.cap.drop                            = sys_resource
lxc.cap.drop                            = sys_time
#lxc.cap.drop                            = sys_tty_config # required by getty


SSH is a tricky protocol. By default, all the outgoing SSH traffic is classified as "minimize-delay" traffic, which will cause it to all flow into our high-priority class, even if it is a bulk <tt>scp</tt> transfer running in the background. This code will grab all "minimize-delay" traffic such as SSH and telnet and route it through some special rules. Any individual keystrokes (small packets) will be left as "minimize-delay" packets. For anything else, we will run the <tt>hashlimit</tt> iptables module, which will identify individual outbound flows and allow small bursts of traffic (even big packets) to remain "minimize-delay" packets. These settings have been specifically tuned so that most <tt>GNU screen</tt> screen changes (^A^N) when logging into your server(s) remotely will be fast. Any traffic over these burst limits will be reclassified as "maximize-throughput" and thus will drop to our lower-priority class 1:12. Combined with the traffic control rules, this will allow you to have very responsive SSH sessions into your servers, even if they are doing some kind of bulk outbound copy, like rsync over SSH.
## Devices
#lxc.cgroup.devices.allow               = a # Allow access to all devices
lxc.cgroup.devices.deny                = a # Deny access to all devices


Code in our main <tt>iptables</tt> rules will ensure that any "minimize-delay" traffic is tagged to be in the high-priority 1:10 class.
# Allow to mknod all devices (but not using them)
lxc.cgroup.devices.allow                = c *:* m
lxc.cgroup.devices.allow                = b *:* m


What this does is keep interactive SSH and telnet keystrokes in the high-priority class, allow GNU screen full redraws and reasonable full-screen editor scrolling to remain in the high-priority class, while forcing bulk transfers into the lower-priority class.
lxc.cgroup.devices.allow                = c 1:3 rwm # /dev/null
lxc.cgroup.devices.allow               = c 1:5 rwm # /dev/zero
lxc.cgroup.devices.allow                = c 1:7 rwm # /dev/full
lxc.cgroup.devices.allow                = c 1:8 rwm # /dev/random
lxc.cgroup.devices.allow                = c 1:9 rwm # /dev/urandom
#lxc.cgroup.devices.allow                = c 4:0 rwm # /dev/tty0 ttys not required if you have lxc.tty = 0
#lxc.cgroup.devices.allow                = c 4:1 rwm # /dev/tty1 devices with major number 4 are "real" tty devices
#lxc.cgroup.devices.allow                = c 4:2 rwm # /dev/tty2
#lxc.cgroup.devices.allow                = c 4:3 rwm # /dev/tty3
lxc.cgroup.devices.allow                = c 5:0 rwm # /dev/tty
lxc.cgroup.devices.allow                = c 5:1 rwm # /dev/console
lxc.cgroup.devices.allow                = c 5:2 rwm # /dev/ptmx
lxc.cgroup.devices.allow                = c 10:229 rwm # /dev/fuse
lxc.cgroup.devices.allow                = c 136:* rwm # /dev/pts/* devices with major number 136 are pts
lxc.cgroup.devices.allow                = c 254:0 rwm # /dev/rtc0


==== ACKs ====
## Limits#
lxc.cgroup.cpu.shares                  = 1024
lxc.cgroup.cpuset.cpus                = 0        # limits container to CPU0
lxc.cgroup.memory.limit_in_bytes      = 512M
lxc.cgroup.memory.memsw.limit_in_bytes = 1G
#lxc.cgroup.blkio.weight                = 500      # requires cfq block scheduler


<source lang="bash">
## Filesystem
iptables -t mangle -N ack
#containers fstab should be outside it's rootfs dir (e.g. /lxc/funtoo0/fstab is ok, but /lxc/funtoo0/rootfs/etc/fstab is wrong!!!)
iptables -t mangle -A ack -m tos ! --tos Normal-Service -j RETURN
#lxc.mount                              = /lxc/funtoo0/fstab     
iptables -t mangle -A ack -p tcp -m length --length 0:128 -j TOS --set-tos Minimize-Delay
iptables -t mangle -A ack -p tcp -m length --length 128: -j TOS --set-tos Maximize-Throughput
iptables -t mangle -A ack -j RETURN


iptables -t mangle -A POSTROUTING -p tcp -m tcp --tcp-flags SYN,RST,ACK ACK -j ack
#lxc.mount.entry is prefered, because it supports relative paths
</source>
lxc.mount.entry                        = proc proc proc nosuid,nodev,noexec  0 0
lxc.mount.entry                        = sysfs sys sysfs nosuid,nodev,noexec,ro 0 0
lxc.mount.entry                        = devpts dev/pts devpts nosuid,noexec,mode=0620,ptmxmode=000,newinstance 0 0
lxc.mount.entry                        = tmpfs dev/shm tmpfs nosuid,nodev,mode=1777 0 0
lxc.mount.entry                        = tmpfs run tmpfs nosuid,nodev,noexec,mode=0755,size=128m 0 0
lxc.mount.entry                        = tmpfs tmp tmpfs nosuid,nodev,noexec,mode=1777,size=1g 0 0


To use this code, place it ''near the top of the file, just below the <tt>modemif="eth4"</tt> line, but ''before'' the main <tt>iptables</tt> and <tt>tc</tt> rules.
##Example of having /var/tmp/portage as tmpfs in container
#lxc.mount.entry                        = tmpfs var/tmp/portage tmpfs defaults,size=8g,uid=250,gid=250,mode=0775 0 0
##Example of bind mount
#lxc.mount.entry                        = /srv/funtoo0 /lxc/funtoo0/rootfs/srv/funtoo0 none defaults,bind 0 0


ACK optimization is another useful thing to do. If we prioritize small ACKs heading out to the modem, it will allow TCP traffic to flow more smoothly without unnecessary delay. The lines above accomplish this.
## Network
lxc.network.type                        = veth
lxc.network.flags                      = up
lxc.network.hwaddr                      = #put your MAC address here, otherwise you will get a random one
lxc.network.link                        = br0
lxc.network.name                        = eth0
#lxc.network.veth.pair                  = veth-example
</pre>


This code basically sets the "minimize-delay" flag on small ACKs. Code in our main <tt>iptables</tt> rules will then tag these packets so they enter high-priority traffic class 1:10.
Read "man 7 capabilities" to get more information aboout Linux capabilities.


== Other Links of Interest ==
Above, use the following command to generate a random MAC for <tt>lxc.network.hwaddr</tt>:
* http://manpages.ubuntu.com/manpages/maverick/en/man8/ufw.8.html
* https://help.ubuntu.com/community/UFW


[[Category:Investigations]]
<console>
[[Category:Articles]]
###i## openssl rand -hex 6 | sed 's/\(..\)/\1:/g; s/.$//'
[[Category:Featured]]
</console>
[[Category:Networking]]
 
It is a very good idea to assign a static MAC address to your container using <tt>lxc.network.hwaddr</tt>. If you don't, LXC will auto-generate a new random MAC every time your container starts, which may confuse network equipment that expects MAC addresses to remain constant.
 
It might happen from case to case that you aren't able to start your LXC Container with the above generated MAC address so for all these who run into that problem here is a little script that connects your IP for the container with the MAC address. Just save the following code as <tt>/etc/lxc/hwaddr.sh</tt>, make it executable and run it like <tt>/etc/lxc/hwaddr.sh xxx.xxx.xxx.xxx</tt> where xxx.xxx.xxx.xxx represents your Container IP. <br><tt>/etc/lxc/hwaddr.sh</tt>:
 
<pre>
#!/bin/sh
IP=$*
HA=`printf "02:00:%x:%x:%x:%x" ${IP//./ }`
echo $HA
</pre>
 
==== <tt>/lxc/funtoo0/fstab</tt> ====
{{fancynote| It is now preferable to have mount entries directly in config file instead of separate fstab:}}
Edit the file <tt>/lxc/funtoo0/fstab</tt>:
<pre>
none /lxc/funtoo0/dev/pts devpts defaults 0 0
none /lxc/funtoo0/proc proc defaults 0 0
none /lxc/funtoo0/sys sysfs defaults 0 0
none /lxc/funtoo0/dev/shm tmpfs nodev,nosuid,noexec,mode=1777,rw 0 0
</pre>
 
== LXC Networking ==
*veth - Virtual Ethernet (bridge)
*vlan - vlan interface (requires device able to do vlan tagging)
*macvlan (mac-address based virtual lan tagging) has 3 modes:
**private
**vepa (Virtual Ethernet Port Aggregator)
**bridge
*phys - dedicated host NIC
[https://blog.flameeyes.eu/2010/09/linux-containers-and-networking Linux Containers and Networking]
 
Enable routing on the host:
By default Linux workstations and servers have IPv4 forwarding disabled.
<console>
###i## echo "1" > /proc/sys/net/ipv4/ip_forward
###i## cat /proc/sys/net/ipv4/ip_forward
# 1
</console>
 
== Initializing and Starting the Container ==
 
You will probably need to set the root password for the container before you can log in. You can use chroot to do this quickly:
 
<console>
###i## chroot /lxc/funtoo0/rootfs
(chroot) ###i## passwd
New password: XXXXXXXX
Retype new password: XXXXXXXX
passwd: password updated successfully
(chroot) ###i## exit
</console>
 
Now that the root password is set, run:
 
<console>
###i## lxc-start -n funtoo0 -d
</console>
 
The <tt>-d</tt> option will cause it to run in the background.
 
To attach to the console:
 
<console>
###i## lxc-console -n funtoo0
</console>
 
You should now be able to log in and use the container. In addition, the container should now be accessible on the network.
 
To directly attach to container:
 
<console>
###i## lxc-attach -n funtoo0
</console>
 
To stop the container:
 
<console>
###i## lxc-stop -n funtoo0
</console>
 
Ensure that networking is working from within the container while it is running, and you're good to go!
 
== Starting LXC container during host boot ==
 
# You need to create symlink in <tt>/etc/init.d/</tt> to <tt>/etc/init.d/lxc</tt> so that it reflects your container.
# <tt>ln -s /etc/init.d/lxc /etc/init.d/lxc.funtoo0</tt>
# now you can add <tt>lxc.funtoo0</tt> to default runlevel
# <tt>rc-update add lxc.funtoo0 default</tt>
<console>
###i## rc
* Starting funtoo0 ...                  [ ok ]
</console>
 
== LXC Bugs/Missing Features ==
 
This section is devoted to documenting issues with the current implementation of LXC and its associated tools. We will be gradually expanding this section with detailed descriptions of problems, their status, and proposed solutions.
 
=== reboot ===
 
* By default, lxc does not support rebooting a container from within. It will simply stop and the host will not know to start it.
* If you want your container to reboot gracefully, you need sys_boot capability (comment out lxc.cap.drop = sys_boot in your container config)
 
=== PID namespaces ===
 
Process ID namespaces are functional, but the container can still see the CPU utilization of the host via the system load (ie. in <tt>top</tt>).
 
=== /dev/pts newinstance ===
 
* Some changes may be required to the host to properly implement "newinstance" <tt>/dev/pts</tt>. See [https://bugzilla.redhat.com/show_bug.cgi?id=501718 This Red Hat bug].
 
=== lxc-create and lxc-destroy ===
 
* LXC's shell scripts are badly designed and are sure way to destruction, avoid using lxc-create and lxc-destroy.
 
=== network initialization and cleanup ===
 
* If used network.type = phys after lxc-stop the interface will be renamed to value from lxc.network.link. It supposed to be fixed in 0.7.4, happens still on 0.7.5 - http://www.mail-archive.com/lxc-users@lists.sourceforge.net/msg01760.html
 
* Re-starting a container can result in a failure as network resource are tied up from the already-defunct instance: [http://www.mail-archive.com/lxc-devel@lists.sourceforge.net/msg00824.html]
 
=== graceful shutdown ===
 
* To gracefully shutdown a container, it's init system needs to properly handle kill -PWR signal
* For funtoo/gentoo make sure that you have:
** pf:12345:powerwait:/sbin/halt
** in your containers /etc/inittab
* For debian/ubuntu make sure that you have:
** pf::powerwait:/sbin/shutdown -t1 -a -h now
** in your container /etc/inittab
** and also comment out other line starting with pf:powerfail (such as pf::powerwait:/etc/init.d/powerfail start) <- these are used if you have UPS monitoring daemon installed!
* /etc/init.d/lxc seems to have broken support for graceful shutdown (it sends proper signal, but then also tries to kill the init with lxc-stop)
 
=== funtoo ===
 
* Our udev should be updated to contain <tt>-lxc</tt> in scripts. (This has been done as of 02-Nov-2011, so should be resolved. But not fixed in our openvz templates, so need to regen them in a few days.)
* Our openrc should be patched to handle the case where it cannot mount tmpfs, and gracefully handle this situation somehow. (Work-around in our docs above, which is to mount tmpfs to <tt>/libexec/rc/init.d</tt> using the container-specific <tt>fstab</tt> file (on the host.)
* Emerging udev within a container can/will fail when realdev is run, if a device node cannot be created (such as /dev/console) if there are no mknod capabilities within the container. This should be fixed.
 
== References ==
 
* <tt>man 7 capabilities</tt>
* <tt>man 5 lxc.conf</tt>
 
== Links ==
 
* There are a number of additional lxc features that can be enabled via patches: [http://lxc.sourceforge.net/patches/linux/3.0.0/3.0.0-lxc1/]
* [https://wiki.ubuntu.com/UserNamespace Ubuntu User Namespaces page]
* lxc-gentoo setup script [https://github.com/globalcitizen/lxc-gentoo on GitHub]
 
* '''IBM developerWorks'''
** [http://www.ibm.com/developerworks/linux/library/l-lxc-containers/index.html LXC: Linux Container Tools]
** [http://www.ibm.com/developerworks/linux/library/l-lxc-security/ Secure Linux Containers Cookbook]
 
* '''Linux Weekly News'''
** [http://lwn.net/Articles/244531/ Smack for simplified access control]
 
[[Category:Labs]]
[[Category:HOWTO]]
[[Category:Virtualization]]

Revision as of 18:52, January 28, 2015

Linux Containers, or LXC, is a Linux feature that allows Linux to run one or more isolated virtual systems (with their own network interfaces, process namespace, user namespace, and power state) using a single Linux kernel on a single server.

Status

As of Linux kernel 3.1.5, LXC is usable for isolating your own private workloads from one another. It is not yet ready to isolate potentially malicious users from one another or the host system. For a more mature containers solution that is appropriate for hosting environments, see OpenVZ.

LXC containers don't yet have their own system uptime, and they see everything that's in the host's dmesg output, among other things. But in general, the technology works.

Basic Info

  • Linux Containers are based on:
    • Kernel namespaces for resource isolation
    • CGroups for resource limitation and accounting

app-emulation/lxc is the userspace tool for Linux containers

Control groups

  • Control groups (cgroups) in kernel since 2.6.24
    • Allows aggregation of tasks and their children
    • Subsystems (cpuset, memory, blkio,...)
    • accounting - to measure how much resources certain systems use
    • resource limiting - groups can be set to not exceed a set memory limit
    • prioritization - some groups may get a larger share of CPU
    • control - freezing/unfreezing of cgroups, checkpointing and restarting
    • No disk quota limitation ( -> image file, LVM, XFS, directory tree quota,...)

Subsystems


root # cat /proc/cgroups 
subsys_name	hierarchy	num_cgroups	enabled
cpuset	
cpu	
cpuacct	
memory	
devices	
freezer	
blkio	
perf_event
hugetlb
  1. cpuset -> limits tasks to specific CPU/CPUs
  2. cpu -> CPU shares
  3. cpuacct -> CPU accounting
  4. memory -> memory and swap limitation and accounting
  5. devices -> device allow deny list
  6. freezer -> suspend/resume tasks
  7. blkio -> I/O priorization (weight, throttle, ...)
  8. perf_event -> support for per-cpu per-cgroup monitoring perf_events
  9. hugetlb -> cgroup resource controller for HugeTLB pages hugetlb

Configuring the Funtoo Host System

Install LXC kernel

Any kernel beyond 3.1.5 will probably work. Personally I prefer No results as these have support for all the namespaces without sacrificing the xfs, FUSE or NFS support for example. These checks were introduced later starting from kernel 3.5, this could also mean that the user namespace is not working optimally.

  • User namespace (EXPERIMENTAL) depends on EXPERIMENTAL and on UIDGID_CONVERTED
    • config UIDGID_CONVERTED
      • True if all of the selected software components are known to have uid_t and gid_t converted to kuid_t and kgid_t where appropriate and are otherwise safe to use with the user namespace.
        • Networking - depends on NET_9P = n
        • Filesystems - 9P_FS = n, AFS_FS = n, AUTOFS4_FS = n, CEPH_FS = n, CIFS = n, CODA_FS = n, FUSE_FS = n, GFS2_FS = n, NCP_FS = n, NFSD = n, NFS_FS = n, OCFS2_FS = n, XFS_FS = n
        • Security options - Grsecurity - GRKERNSEC = n (if applicable)
    • As of 3.10.xx kernel, all of the above options are safe to use with User namespaces, except for XFS_FS, therefore with kernel >=3.10.xx, you should answer XFS_FS = n, if you want User namespaces support.
    • in your kernel source directory, you should check init/Kconfig and find out what UIDGID_CONVERTED depends on

Kernel configuration

These options should be enable in your kernel to be able to take full advantage of LXC.

  • General setup
    • CONFIG_NAMESPACES
      • CONFIG_UTS_NS
      • CONFIG_IPC_NS
      • CONFIG_PID_NS
      • CONFIG_NET_NS
      • CONFIG_USER_NS
    • CONFIG_CGROUPS
      • CONFIG_CGROUP_DEVICE
      • CONFIG_CGROUP_SCHED
      • CONFIG_CGROUP_CPUACCT
      • CONFIG_CGROUP_MEM_RES_CTLR (in 3.6+ kernels it's called CONFIG_MEMCG)
      • CONFIG_CGROUP_MEM_RES_CTLR_SWAP (in 3.6+ kernels it's called CONFIG_MEMCG_SWAP)
      • CONFIG_CPUSETS (on multiprocessor hosts)
  • Networking support
    • Networking options
      • CONFIG_VLAN_8021Q
  • Device Drivers
    • Character devices
      • Unix98 PTY support
        • CONFIG_DEVPTS_MULTIPLE_INSTANCES
    • Network device support
      • Network core driver support
        • CONFIG_VETH
        • CONFIG_MACVLAN

Once you have lxc installed, you can then check your kernel config with:

root # CONFIG=/path/to/config /usr/sbin/lxc-checkconfig

Emerge lxc

root # emerge app-emulation/lxc

Configure Networking For Container

Typically, one uses a bridge to allow containers to connect to the network. This is how to do it under Funtoo Linux:

  1. create a bridge using the Funtoo network configuration scripts. Name the bridge something like brwan (using /etc/init.d/netif.brwan). Configure your bridge to have an IP address.
  2. Make your physical interface, such as eth0, an interface with no IP address (use the Funtoo interface-noip template.)
  3. Make netif.eth0 a slave of netif.brwan in /etc/conf.d/netif.brwan.
  4. Enable your new bridged network and make sure it is functioning properly on the host.

You will now be able to configure LXC to automatically add your container's virtual ethernet interface to the bridge when it starts, which will connect it to your network.

Setting up a Funtoo Linux LXC Container

Here are the steps required to get Funtoo Linux running inside a container. The steps below show you how to set up a container using an existing Funtoo Linux OpenVZ template. It is now also possible to use Metro to build an lxc container tarball directly, which will save you manual configuration steps and will provide an /etc/fstab.lxc file that you can use for your host container config. See Metro Recipes for info on how to use Metro to generate an lxc container.

Create and Configure Container Filesystem

  1. Start with a Funtoo LXC template, and unpack it to a directory such as /lxc/funtoo0/rootfs/
  2. Create an empty /lxc/funtoo0/fstab file
  3. Ensure c1 line is uncommented (enabled) and c2 through c6 lines are disabled in /lxc/funtoo0/rootfs/etc/inittab

That's almost all you need to get the container filesystem ready to start.

Create Container Configuration Files

Create the following files:

/lxc/funtoo0/config

and also create symlink from

/lxc/funtoo0/config to /etc/lxc/funtoo0.conf

root # mkdir /etc/lxc/funtoo0
root # ln -s /lxc/funtoo0/config /etc/lxc/funtoo0/config
   Note
Daniel Robbins needs to update this config to be more in line with http://wiki.progress-linux.org/software/lxc/ -- this config appears to have nice, refined device node permissions and other goodies. // note by Havis to Daniel, this config is already superior.


Read "man 5 lxc.conf" , to get more information about linux container configuration file.

## Container
lxc.utsname                             = funtoo0
lxc.rootfs                              = /lxc/funtoo0/rootfs/
lxc.arch                                = x86_64
#lxc.console                            = /var/log/lxc/funtoo0.console  # uncomment if you want to log containers console
lxc.tty                                 = 6  # if you plan to use container with physical terminals (eg F1..F6)
#lxc.tty                                = 0  # set to 0 if you dont plan to use the container with physical terminal, also comment out in your containers /etc/inittab  c1 to c6 respawns (e.g. c1:12345:respawn:/sbin/agetty 38400 tty1 linux)
lxc.pts                                 = 1024


## Capabilities
lxc.cap.drop                            = audit_control
lxc.cap.drop                            = audit_write
lxc.cap.drop                            = mac_admin
lxc.cap.drop                            = mac_override
lxc.cap.drop                            = mknod
lxc.cap.drop                            = setfcap
lxc.cap.drop                            = setpcap
lxc.cap.drop                            = sys_admin
#lxc.cap.drop                            = sys_boot # capability to reboot the container
#lxc.cap.drop                            = sys_chroot # required by SSH
lxc.cap.drop                            = sys_module
#lxc.cap.drop                            = sys_nice
lxc.cap.drop                            = sys_pacct
lxc.cap.drop                            = sys_rawio
lxc.cap.drop                            = sys_resource
lxc.cap.drop                            = sys_time
#lxc.cap.drop                            = sys_tty_config # required by getty

## Devices
#lxc.cgroup.devices.allow               = a # Allow access to all devices
lxc.cgroup.devices.deny                 = a # Deny access to all devices

# Allow to mknod all devices (but not using them)
lxc.cgroup.devices.allow                = c *:* m
lxc.cgroup.devices.allow                = b *:* m

lxc.cgroup.devices.allow                = c 1:3 rwm # /dev/null
lxc.cgroup.devices.allow                = c 1:5 rwm # /dev/zero
lxc.cgroup.devices.allow                = c 1:7 rwm # /dev/full
lxc.cgroup.devices.allow                = c 1:8 rwm # /dev/random
lxc.cgroup.devices.allow                = c 1:9 rwm # /dev/urandom
#lxc.cgroup.devices.allow                = c 4:0 rwm # /dev/tty0 ttys not required if you have lxc.tty = 0
#lxc.cgroup.devices.allow                = c 4:1 rwm # /dev/tty1 devices with major number 4 are "real" tty devices
#lxc.cgroup.devices.allow                = c 4:2 rwm # /dev/tty2
#lxc.cgroup.devices.allow                = c 4:3 rwm # /dev/tty3
lxc.cgroup.devices.allow                = c 5:0 rwm # /dev/tty
lxc.cgroup.devices.allow                = c 5:1 rwm # /dev/console
lxc.cgroup.devices.allow                = c 5:2 rwm # /dev/ptmx
lxc.cgroup.devices.allow                = c 10:229 rwm # /dev/fuse
lxc.cgroup.devices.allow                = c 136:* rwm # /dev/pts/* devices with major number 136 are pts
lxc.cgroup.devices.allow                = c 254:0 rwm # /dev/rtc0

## Limits#
lxc.cgroup.cpu.shares                  = 1024
lxc.cgroup.cpuset.cpus                 = 0        # limits container to CPU0
lxc.cgroup.memory.limit_in_bytes       = 512M
lxc.cgroup.memory.memsw.limit_in_bytes = 1G
#lxc.cgroup.blkio.weight                = 500      # requires cfq block scheduler

## Filesystem
#containers fstab should be outside it's rootfs dir (e.g. /lxc/funtoo0/fstab is ok, but /lxc/funtoo0/rootfs/etc/fstab is wrong!!!)
#lxc.mount                               = /lxc/funtoo0/fstab       

#lxc.mount.entry is prefered, because it supports relative paths
lxc.mount.entry                         = proc proc proc nosuid,nodev,noexec  0 0
lxc.mount.entry                         = sysfs sys sysfs nosuid,nodev,noexec,ro 0 0
lxc.mount.entry                         = devpts dev/pts devpts nosuid,noexec,mode=0620,ptmxmode=000,newinstance 0 0
lxc.mount.entry                         = tmpfs dev/shm tmpfs nosuid,nodev,mode=1777 0 0
lxc.mount.entry                         = tmpfs run tmpfs nosuid,nodev,noexec,mode=0755,size=128m 0 0
lxc.mount.entry                         = tmpfs tmp tmpfs nosuid,nodev,noexec,mode=1777,size=1g 0 0

##Example of having /var/tmp/portage as tmpfs in container 
#lxc.mount.entry                         = tmpfs var/tmp/portage tmpfs defaults,size=8g,uid=250,gid=250,mode=0775 0 0
##Example of bind mount
#lxc.mount.entry                        = /srv/funtoo0 /lxc/funtoo0/rootfs/srv/funtoo0 none defaults,bind 0 0

## Network
lxc.network.type                        = veth
lxc.network.flags                       = up
lxc.network.hwaddr                      = #put your MAC address here, otherwise you will get a random one
lxc.network.link                        = br0
lxc.network.name                        = eth0
#lxc.network.veth.pair                   = veth-example

Read "man 7 capabilities" to get more information aboout Linux capabilities.

Above, use the following command to generate a random MAC for lxc.network.hwaddr:

root # openssl rand -hex 6 | sed 's/\(..\)/\1:/g; s/.$//'

It is a very good idea to assign a static MAC address to your container using lxc.network.hwaddr. If you don't, LXC will auto-generate a new random MAC every time your container starts, which may confuse network equipment that expects MAC addresses to remain constant.

It might happen from case to case that you aren't able to start your LXC Container with the above generated MAC address so for all these who run into that problem here is a little script that connects your IP for the container with the MAC address. Just save the following code as /etc/lxc/hwaddr.sh, make it executable and run it like /etc/lxc/hwaddr.sh xxx.xxx.xxx.xxx where xxx.xxx.xxx.xxx represents your Container IP.
/etc/lxc/hwaddr.sh:

#!/bin/sh
IP=$*
HA=`printf "02:00:%x:%x:%x:%x" ${IP//./ }`
echo $HA

/lxc/funtoo0/fstab

   Note
It is now preferable to have mount entries directly in config file instead of separate fstab:

Edit the file /lxc/funtoo0/fstab:

none /lxc/funtoo0/dev/pts devpts defaults 0 0
none /lxc/funtoo0/proc proc defaults 0 0
none /lxc/funtoo0/sys sysfs defaults 0 0
none /lxc/funtoo0/dev/shm tmpfs nodev,nosuid,noexec,mode=1777,rw 0 0

LXC Networking

  • veth - Virtual Ethernet (bridge)
  • vlan - vlan interface (requires device able to do vlan tagging)
  • macvlan (mac-address based virtual lan tagging) has 3 modes:
    • private
    • vepa (Virtual Ethernet Port Aggregator)
    • bridge
  • phys - dedicated host NIC

Linux Containers and Networking

Enable routing on the host: By default Linux workstations and servers have IPv4 forwarding disabled.

root # echo "1" > /proc/sys/net/ipv4/ip_forward
root # cat /proc/sys/net/ipv4/ip_forward
root # 1

Initializing and Starting the Container

You will probably need to set the root password for the container before you can log in. You can use chroot to do this quickly:

root # chroot /lxc/funtoo0/rootfs
(chroot) # passwd
New password: XXXXXXXX
Retype new password: XXXXXXXX
passwd: password updated successfully
(chroot) # exit

Now that the root password is set, run:

root # lxc-start -n funtoo0 -d

The -d option will cause it to run in the background.

To attach to the console:

root # lxc-console -n funtoo0

You should now be able to log in and use the container. In addition, the container should now be accessible on the network.

To directly attach to container:

root # lxc-attach -n funtoo0

To stop the container:

root # lxc-stop -n funtoo0

Ensure that networking is working from within the container while it is running, and you're good to go!

Starting LXC container during host boot

  1. You need to create symlink in /etc/init.d/ to /etc/init.d/lxc so that it reflects your container.
  2. ln -s /etc/init.d/lxc /etc/init.d/lxc.funtoo0
  3. now you can add lxc.funtoo0 to default runlevel
  4. rc-update add lxc.funtoo0 default
root # rc
 * Starting funtoo0 ...                  [ ok ]

LXC Bugs/Missing Features

This section is devoted to documenting issues with the current implementation of LXC and its associated tools. We will be gradually expanding this section with detailed descriptions of problems, their status, and proposed solutions.

reboot

  • By default, lxc does not support rebooting a container from within. It will simply stop and the host will not know to start it.
  • If you want your container to reboot gracefully, you need sys_boot capability (comment out lxc.cap.drop = sys_boot in your container config)

PID namespaces

Process ID namespaces are functional, but the container can still see the CPU utilization of the host via the system load (ie. in top).

/dev/pts newinstance

  • Some changes may be required to the host to properly implement "newinstance" /dev/pts. See This Red Hat bug.

lxc-create and lxc-destroy

  • LXC's shell scripts are badly designed and are sure way to destruction, avoid using lxc-create and lxc-destroy.

network initialization and cleanup

  • Re-starting a container can result in a failure as network resource are tied up from the already-defunct instance: [1]

graceful shutdown

  • To gracefully shutdown a container, it's init system needs to properly handle kill -PWR signal
  • For funtoo/gentoo make sure that you have:
    • pf:12345:powerwait:/sbin/halt
    • in your containers /etc/inittab
  • For debian/ubuntu make sure that you have:
    • pf::powerwait:/sbin/shutdown -t1 -a -h now
    • in your container /etc/inittab
    • and also comment out other line starting with pf:powerfail (such as pf::powerwait:/etc/init.d/powerfail start) <- these are used if you have UPS monitoring daemon installed!
  • /etc/init.d/lxc seems to have broken support for graceful shutdown (it sends proper signal, but then also tries to kill the init with lxc-stop)

funtoo

  • Our udev should be updated to contain -lxc in scripts. (This has been done as of 02-Nov-2011, so should be resolved. But not fixed in our openvz templates, so need to regen them in a few days.)
  • Our openrc should be patched to handle the case where it cannot mount tmpfs, and gracefully handle this situation somehow. (Work-around in our docs above, which is to mount tmpfs to /libexec/rc/init.d using the container-specific fstab file (on the host.)
  • Emerging udev within a container can/will fail when realdev is run, if a device node cannot be created (such as /dev/console) if there are no mknod capabilities within the container. This should be fixed.

References

  • man 7 capabilities
  • man 5 lxc.conf

Links