Control Groups in Linux

Control groups (cgroups) is a kernel mechanism for grouping tasks.

cgroups started as "process containers", and was developed by two Google engineers in 2006. It was merged into the Linux kernel in 2007 and renamed "control groups".

When talking about control groups, the term task is synonymous with a (system) process.

Overview

Hierarchy

Control groups are grouped into a tree-like hierarchy. For example, you might have two control groups users and system, and the users control group might have admin and normal sub-control groups. Every hierarchy stems from a single root control group.

For our example above, the group hierarchy would look like this:

(root)
├── system
└── users
    ├── admin
    └── normal

A task can belong to one, and only one, group within the same hierarchy. This means a task cannot be both in the system group as well as the normal group.

Subsystems

This grouping is useful because it allows subsystems to attach onto the group in order to track and limit the resources usage tasks within those control groups.

For example, the cpu subsystem can attach onto the hierarchy above, and limit the usage of the users control group to 40% of the total CPU time available, ensuring the system control group have enough resources to run its processes.

There are many subsystems, each managing a different set of resources:

  • blkio - Block devices (e.g. hard drives) input / output
  • cpu - ability to schedule tasks
  • cpuacct - CPU usage accounting
  • cpuset - CPUs and memory nodes
  • devices - ability of tasks can create or use device nodes
  • freezer - activity of a control groups. Tasks in frozen groups would not be scheduled
  • hugetlb - Large Page support (HugeTLB) usage
  • memory - Memory, kernel memory, swap memory
  • net_cls - ability ot tag packets based on control group. These tags can be used by a traffic controller to assign priorities
  • net_prio - ability to set network traffic priority
  • perf_event - ability to monitor threads

There can be multiple control group hierarchies in a system, and different subsystems can attach onto them. For example, you may have a hierarchy that you use to control CPU and memory resources, whilst having another hierarchy to allocate network bandwidth.

In our example above, this might be useful as our system tasks requires a lot of CPU and memory, but very little bandwidth. User-initiated process in the user control group requires little CPU and memory, but a lot of bandwidth. By attaching different subsystems to different control group hierarchies, the system can better manage resources.

There are several rules dictating the relationship between subsystems, hierarchies, control groups and tasks. The best explanation we've found was from the Redhat Enterprise Linux documentation.

N.B. Although subsystems and control groups are usually related to each other, there are no inherent relations. You can have a control group hierarchy that is not bound to any subsystems.

Real-World Example

There isn't really a control group called user and normal, we just made those up so it's easier to understand. So let's go over a real example to solidify our understanding.

Control groups have been implemented in the Linux kernel since 2007, and all major distributions supports it. We will be using Ubuntu from hereon; if you use other distributions, the commands are likely to be different.

For CentOS, Fedora and RHEL, there's the libcgroup package. You can install it using:

# yum install libcgroup

Although control groups are an internal feature of the Linux kernel, there are packages which allows you to manipulate, control, administer and monitor control groups. For Ubuntu, there was the libcgroup1 package, which has been superseded by cgmanager since 14.04.

Let's install them now.

$ sudo apt install libcgroup1 cgmanager cgroup-tools

First, let's take a look at the available subsystems in our system:

$ lssubsys
cpuset  
cpu,cpuacct  
blkio  
memory  
devices  
freezer  
net_cls,net_prio  
perf_event  
hugetlb  
pids  

Likewise, we can list out all the control groups in our system:

$ lscgroup
cpu,cpuacct:/  
freezer:/  
net_cls,net_prio:/  
devices:/  
devices:/init.scope  
devices:/system.slice  
devices:/system.slice/avahi-daemon.service  
devices:/system.slice/dev-sda5.swap  
devices:/system.slice/thermald.service  
...
devices:/user.slice  
hugetlb:/  
memory:/  
cpuset:/  
blkio:/  
perf_event:/  
pids:/  
pids:/init.scope  
pids:/system.slice  
pids:/user.slice  
pids:/user.slice/user-1000.slice  

You may notice that the first part of each line before the colon corresponds to a subsystem, and the / immediately after the colon corresponds to the root of the control group hierarchy for which the subsystem is attached to.

For example, if you look at the devices entries, you can see that we have the devices:/ root control group, which has the system.slice control group as a child, which, in turn have avahi-daemon.service, dev-sda5.swap, thermald.service etc as further child control groups.

The root control group has all resources of that type available to it. For example, the cpuset:/ group has all the CPUs and memory nodes available to the system.

If, for instance, we want to run an application and want to reserve for it 50% of the system's CPU and memory, we can create a new control group cpuset:/app, and allocate it 50% of the CPU and memory resources. Then, all tasks that is attached to the cpuset:/app group could not (cumulatively) use more than 50% of the CPU / memory nodes.

We can create more child control groups within the cpuset:/app group (e.g. cpuset:/app/system and cpuset:/app/cleanup). Because cpuset:/app only has access to 50% of the resources, that 50% is split between the child groups. Therefore, cpuset:/app/system and cpuset:/app/cleanup could not, cumulatively, use more than 50% of the host system's resources.

Diving Deeper

So far, we understand:

  • Control groups are a mechanism for host systems to group tasks (processes)
  • A subsystem is a part of the host system responsible for tracking and allocating particular resources, such as CPU, memory and network bandwidth
  • A subsystem can attach onto a control group and use the control group hierarchy to segregate resources
  • A control group only has access to resources that are allocated to it
  • A child control group can never access resources that its parent cannot access

Now that we understand what control groups are, let's dive a little deeper and understand how it's implemented, and how you can create a new control group.

Control Groups are implemented as a Filesystem

Control groups are implemented as a temporary file storage (tmpfs) filesystem, located at /sys/fs/cgroup/. Each subsystems are mounted under /sys/fs/cgroup/ as a cgroup filesystem.

We can check this in two ways:

  1. Simply checking for files and directories located at /sys/fs/cgroup/

    $ cd /sys/fs/cgroup/; ls -ahl
    total 0
    drwxr-xr-x 14 root root 360 Feb  3 12:07 .
    drwxr-xr-x 10 root root   0 Feb  3 13:30 ..
    dr-xr-xr-x  2 root root   0 Feb  3 13:30 blkio
    drwxr-xr-x  2 root root  60 Feb  3 12:07 cgmanager
    lrwxrwxrwx  1 root root  11 Jan 31 09:53 cpu -> cpu,cpuacct
    lrwxrwxrwx  1 root root  11 Jan 31 09:53 cpuacct -> cpu,cpuacct
    dr-xr-xr-x  2 root root   0 Feb  3 13:30 cpu,cpuacct
    dr-xr-xr-x  2 root root   0 Feb  3 13:30 cpuset
    dr-xr-xr-x  5 root root   0 Feb  3 13:30 devices
    dr-xr-xr-x  2 root root   0 Feb  3 13:30 freezer
    dr-xr-xr-x  2 root root   0 Feb  3 13:30 hugetlb
    dr-xr-xr-x  2 root root   0 Feb  3 13:30 memory
    lrwxrwxrwx  1 root root  16 Jan 31 09:53 net_cls -> net_cls,net_prio
    dr-xr-xr-x  2 root root   0 Feb  3 13:30 net_cls,net_prio
    lrwxrwxrwx  1 root root  16 Jan 31 09:53 net_prio -> net_cls,net_prio
    dr-xr-xr-x  2 root root   0 Feb  3 13:30 perf_event
    dr-xr-xr-x  5 root root   0 Feb  3 13:30 pids
    dr-xr-xr-x  5 root root   0 Feb  3 13:30 systemd
    
  2. Run mount to see a list of mounted filesystems (irrelevant lines were omitted)

    $ mount
    tmpfs on /sys/fs/cgroup type tmpfs (rw,mode=755)
    cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
    cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
    cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
    cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
    cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
    cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb,release_agent=/run/cgmanager/agents/cgm-release-agent.hugetlb)
    cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
    cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset,clone_children)
    cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
    cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event,release_agent=/run/cgmanager/agents/cgm-release-agent.perf_event)
    cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids,release_agent=/run/cgmanager/agents/cgm-release-agent.pids)
    

The first part of each line tells you the type of the filesystem mounted (e.g. cgroup). Next, we get the location it is mounted at (e.g. /sys/fs/cgroups/pids). Lastly, we get a list of options (e.g. rw,nosuid,nodev,noexec,relatime,memory).

Although we can manage control groups by manually manipulating the filesystem using shell commands, we should use the tools provided by existing packages.

Inside the Control Group

Different control groups mounted on different subsystems uses different sets of files to manage its resources. For example, the freezer control group only requires 6 files:

# cd /sys/fs/cgroup/freezer; ls -ahl
total 0  
dr-xr-xr-x  2 root root   0 Feb 15 11:29 ./  
drwxr-xr-x 15 root root 380 Feb 15 11:46 ../  
-rw-r--r--  1 root root   0 Feb 15 09:28 cgroup.clone_children
-rw-r--r--  1 root root   0 Feb 15 09:28 cgroup.procs
-r--r--r--  1 root root   0 Feb 15 09:28 cgroup.sane_behavior
-rw-r--r--  1 root root   0 Feb 15 09:28 notify_on_release
-rw-r--r--  1 root root   0 Feb 15 09:28 release_agent
-rw-r--r--  1 root root   0 Feb 15 09:08 tasks

The cpuset control group, on the other hand, has many more files:

# cd /sys/fs/cgroup/cpuset; ls -ahl
total 0  
dr-xr-xr-x  2 root root   0 Feb 15 11:29 ./  
drwxr-xr-x 15 root root 380 Feb 15 11:46 ../  
-rw-r--r--  1 root root   0 Feb 15 09:08 cgroup.clone_children
-rw-r--r--  1 root root   0 Feb 15 09:28 cgroup.procs
-r--r--r--  1 root root   0 Feb 15 09:28 cgroup.sane_behavior
-rw-r--r--  1 root root   0 Feb 15 09:28 cpuset.cpu_exclusive
-rw-r--r--  1 root root   0 Feb 15 09:28 cpuset.cpus
-r--r--r--  1 root root   0 Feb 15 09:28 cpuset.effective_cpus
-r--r--r--  1 root root   0 Feb 15 09:28 cpuset.effective_mems
-rw-r--r--  1 root root   0 Feb 15 09:28 cpuset.mem_exclusive
-rw-r--r--  1 root root   0 Feb 15 09:28 cpuset.mem_hardwall
-rw-r--r--  1 root root   0 Feb 15 09:28 cpuset.memory_migrate
-r--r--r--  1 root root   0 Feb 15 09:28 cpuset.memory_pressure
-rw-r--r--  1 root root   0 Feb 15 09:28 cpuset.memory_pressure_enabled
-rw-r--r--  1 root root   0 Feb 15 09:28 cpuset.memory_spread_page
-rw-r--r--  1 root root   0 Feb 15 09:28 cpuset.memory_spread_slab
-rw-r--r--  1 root root   0 Feb 15 09:28 cpuset.mems
-rw-r--r--  1 root root   0 Feb 15 09:28 cpuset.sched_load_balance
-rw-r--r--  1 root root   0 Feb 15 09:28 cpuset.sched_relax_domain_level
-rw-r--r--  1 root root   0 Feb 15 09:28 notify_on_release
-rw-r--r--  1 root root   0 Feb 15 09:28 release_agent
-rw-r--r--  1 root root   0 Feb 15 09:08 tasks

In Ubuntu, there are 6 files which are used for every control group:

  • tasks - a list of tasks' PIDs attached to this control group
  • release_agent - commands to run when all tasks in group, and all child groups have been removed, subject to the notify_on_release flag
  • notify_on_release - Whether to run commands specified in release_agent when all tasks and child groups have been removed
  • cgroup.sane_behavior - was used to implement new behaviours whilst maintaining backwards compatibility[1]
  • cgroup.procs - list of thread group IDs in the control group
  • cgroup.clone_children - a flag to indicate that children should copy its parents configuration during initialization

The files inside the control group is dependent on the subsystem it is mounted onto. For example, if we mount a new hierarchy brew_test with the cpuset subsystem, there will be cpuset-related files in the /sys/fs/cgroup/brew_test/ directory.

# mkdir /sys/fs/cgroup/brew_test/
# mount -t cgroup -o cpuset brew_test /sys/fs/cgroup/brew_test/
# ls -ahl
total 0  
dr-xr-xr-x  2 root root   0 Feb 15 11:29 ./  
drwxr-xr-x 15 root root 380 Feb 15 11:46 ../  
-rw-r--r--  1 root root   0 Feb 15 09:08 cgroup.clone_children
-rw-r--r--  1 root root   0 Feb 15 12:06 cgroup.procs
-r--r--r--  1 root root   0 Feb 15 12:06 cgroup.sane_behavior
-rw-r--r--  1 root root   0 Feb 15 12:06 cpuset.cpu_exclusive
-rw-r--r--  1 root root   0 Feb 15 12:06 cpuset.cpus
-r--r--r--  1 root root   0 Feb 15 12:06 cpuset.effective_cpus
-r--r--r--  1 root root   0 Feb 15 12:06 cpuset.effective_mems
-rw-r--r--  1 root root   0 Feb 15 12:06 cpuset.mem_exclusive
-rw-r--r--  1 root root   0 Feb 15 12:06 cpuset.mem_hardwall
-rw-r--r--  1 root root   0 Feb 15 12:06 cpuset.memory_migrate
-r--r--r--  1 root root   0 Feb 15 12:06 cpuset.memory_pressure
-rw-r--r--  1 root root   0 Feb 15 12:06 cpuset.memory_pressure_enabled
-rw-r--r--  1 root root   0 Feb 15 12:06 cpuset.memory_spread_page
-rw-r--r--  1 root root   0 Feb 15 12:06 cpuset.memory_spread_slab
-rw-r--r--  1 root root   0 Feb 15 12:06 cpuset.mems
-rw-r--r--  1 root root   0 Feb 15 12:06 cpuset.sched_load_balance
-rw-r--r--  1 root root   0 Feb 15 12:06 cpuset.sched_relax_domain_level
-rw-r--r--  1 root root   0 Feb 15 12:06 notify_on_release
-rw-r--r--  1 root root   0 Feb 15 12:06 release_agent
-rw-r--r--  1 root root   0 Feb 15 09:08 tasks

Creating a New Control Group

We can create new control groups by creating a new directory inside the directory of the parent control group.

# lscgroup | grep memory
memory:/  
# mkdir /sys/fs/cgroup/memory/brew_mem_test
# lscgroup | grep memory
memory:/  
memory:/brew_mem_test  

Adding a process to a control group

Since the tasks file holds all the tasks PIDs that belongs to a particular control group, we can add/move a process to a control group by simply writing to the tasks file.

# echo 5465 > /sys/fs/cgroup/memory/brew_mem_test/tasks

For example, we can create a dummy process that gets placed into the brew_mem_test control group.

# sleep 100 & echo $! | tee -a /sys/fs/cgroup/memory/brew_mem_test/tasks 
[1] 12086
12086  
# cat brew_mem_test/tasks | grep 12086
12086  

Further Reading

Daniel Li

Full-stack Web Developer in Hong Kong. Founder of Brew.

Hong Kong http://danyll.com