ubuntu:kernel:control_groups
Differences
This shows you the differences between two versions of the page.
ubuntu:kernel:control_groups [2019/11/30 12:30] – created peter | ubuntu:kernel:control_groups [2019/12/15 21:46] (current) – removed peter | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Ubuntu - Kernel - Control Groups ====== | ||
- | |||
- | Control Groups provide a mechanism for aggregating/ | ||
- | |||
- | Definitions: | ||
- | |||
- | * A **cgroup** associates a set of tasks with a set of parameters for one or more subsystems. | ||
- | |||
- | * A **subsystem** is a module that makes use of the task grouping facilities provided by cgroups to treat groups of tasks in particular ways. A subsystem is typically a " | ||
- | |||
- | * A **hierarchy** is a set of cgroups arranged in a tree, such that every task in the system is in exactly one of the cgroups in the hierarchy, and a set of subsystems; each subsystem has system-specific state attached to each cgroup in the hierarchy. | ||
- | |||
- | At any one time there may be multiple active hierarchies of task cgroups. | ||
- | |||
- | User level code may create and destroy cgroups by name in an instance of the cgroup virtual file system, specify and query to which cgroup a task is assigned, and list the task pids assigned to a cgroup. Those creations and assignments only affect the hierarchy associated with that instance of the cgroup file system. | ||
- | |||
- | On their own, the only use for cgroups is for simple job tracking. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/ | ||
- | |||
- | |||
- | ===== Why are cgroups needed? ===== | ||
- | |||
- | There are multiple efforts to provide process aggregations in the Linux kernel, mainly for resource tracking purposes. Such efforts include [[Kernel: | ||
- | |||
- | The kernel cgroup patch provides the minimum essential kernel mechanisms required to efficiently implement such groups. | ||
- | |||
- | Multiple hierarchy support is provided to allow for situations where the division of tasks into cgroups is distinctly different for different subsystems - having parallel hierarchies allows each hierarchy to be a natural division of tasks, without having to handle complex combinations of tasks that would be present if several unrelated subsystems needed to be forced into the same tree of cgroups. | ||
- | |||
- | At one extreme, each resource controller or subsystem could be in a separate hierarchy; at the other extreme, all subsystems would be attached to the same hierarchy. | ||
- | |||
- | As an example of a scenario (originally proposed by vatsa@in.ibm.com) that can benefit from multiple hierarchies, | ||
- | |||
- | < | ||
- | CPU : Top cpuset | ||
- | / | ||
- | | ||
- | | | | ||
- | | ||
- | |||
- | In addition (system tasks) are attached to topcpuset (so | ||
- | that they can run anywhere) with a limit of 20% | ||
- | |||
- | | ||
- | |||
- | Disk : Prof (50%), students (30%), system (20%) | ||
- | |||
- | | ||
- | / \ | ||
- | Prof (15%) students (5%) | ||
- | </ | ||
- | |||
- | Browsers like Firefox/ | ||
- | |||
- | At the same time Firefox/ | ||
- | |||
- | With the ability to classify tasks differently for different resources (by putting those resource subsystems in different hierarchies) then | ||
- | the admin can easily set up a script which receives exec notifications and depending on who is launching the browser he can | ||
- | |||
- | < | ||
- | echo browser_pid > / | ||
- | </ | ||
- | |||
- | With only a single hierarchy, the admin would potentially have to create a separate cgroup for every browser launched and associate it with approp network and other resource class. | ||
- | |||
- | Also lets say that the administrator would like to give enhanced network access temporarily to a student' | ||
- | |||
- | <code bash> | ||
- | echo pid > / | ||
- | |||
- | (after some time) | ||
- | |||
- | echo pid > / | ||
- | </ | ||
- | |||
- | Without this ability, they would have to split the cgroup into multiple separate ones and then associate the new cgroups with the new resource classes. | ||
- | |||
- | ===== How are cgroups implemented? | ||
- | |||
- | Control Groups extends the kernel as follows: | ||
- | |||
- | * Each task in the system has a reference-counted pointer to a **css_set**. | ||
- | |||
- | * A **css_set** contains a set of reference-counted pointers to **cgroup_subsys_state** objects, one for each cgroup subsystem registered in the system. | ||
- | |||
- | * A cgroup hierarchy filesystem can be mounted | ||
- | |||
- | * You can list all the tasks (by pid) attached to any cgroup. | ||
- | |||
- | The implementation of cgroups requires a few, simple hooks into the rest of the kernel, none in performance critical paths: | ||
- | |||
- | * in init/ | ||
- | |||
- | * in fork and exit, to attach and detach a task from its css_set. | ||
- | |||
- | In addition a new file system, of type " | ||
- | |||
- | If an active hierarchy with exactly the same set of subsystems already exists, it will be reused for the new mount. | ||
- | matches, and any of the requested subsystems are in use in an existing hierarchy, the mount will fail with **-EBUSY**. | ||
- | |||
- | It's not currently possible to bind a new subsystem to an active cgroup hierarchy, or to unbind a subsystem from an active cgroup hierarchy. | ||
- | |||
- | When a cgroup filesystem is unmounted, if there are any child cgroups created below the top-level cgroup, that hierarchy will remain active even though unmounted; if there are no child cgroups then the hierarchy will be deactivated. | ||
- | |||
- | No new system calls are added for cgroups - all support for querying and modifying cgroups is via this cgroup file system. | ||
- | |||
- | Each task under /proc has an added file named ' | ||
- | |||
- | Each cgroup is represented by a directory in the cgroup file system containing the following files describing that cgroup: | ||
- | |||
- | * **tasks**: list of tasks (by pid) attached to that cgroup. | ||
- | * **cgroup.procs**: | ||
- | * **notify_on_release** flag: run the release agent on exit? | ||
- | * **release_agent**: | ||
- | |||
- | Other subsystems such as [[Kernel: | ||
- | |||
- | New cgroups are created using the mkdir system call or shell command. | ||
- | |||
- | The named hierarchical structure of nested cgroups allows partitioning a large system into nested, dynamically changeable, " | ||
- | |||
- | The attachment of each task, automatically inherited at fork by any children of that task, to a cgroup allows organizing the work load on a system into related sets of tasks. | ||
- | |||
- | When a task is moved from one cgroup to another, it gets a new **css_set** pointer - if there' | ||
- | |||
- | To allow access from a cgroup to the css_sets (and hence tasks) that comprise it, a set of **cg_cgroup_link** objects form a lattice; each cg_cgroup_link is linked into a list of **cg_cgroup_links** for a single cgroup on its **cgrp_link_list** field, and a list of cg_cgroup_links for a single css_set on its **cg_link_list**. | ||
- | |||
- | Thus the set of tasks in a cgroup can be listed by iterating over each css_set that references the cgroup, and sub-iterating over each css_set' | ||
- | |||
- | The use of a Linux virtual file system (vfs) to represent the cgroup hierarchy provides for a familiar permission and name space for cgroups, with a minimum of additional kernel code. | ||
- | |||
- | |||
- | ===== What does notify_on_release do? ===== | ||
- | |||
- | If the **notify_on_release** flag is enabled (1) in a cgroup, then whenever the last task in the cgroup leaves (exits or attaches to some other cgroup) and the last child cgroup of that cgroup is removed, then the kernel runs the command specified by the contents of the **" | ||
- | |||
- | |||
- | ===== What does clone_children do? ===== | ||
- | |||
- | If the **clone_children** flag is enabled (1) in a cgroup, then all cgroups created beneath will call the post_clone callbacks for each subsystem of the newly created cgroup. Usually when this callback is implemented for a subsystem, it copies the values of the parent subsystem, this is the case for the cpuset. | ||
- | |||
- | |||
- | ===== How do I use cgroups? ===== | ||
- | |||
- | To start a new job that is to be contained within a cgroup, using the " | ||
- | |||
- | - mkdir /dev/cgroup | ||
- | - mount -t cgroup -ocpuset cpuset /dev/cgroup | ||
- | - Create the new cgroup by doing mkdir' | ||
- | - Start a task that will be the " | ||
- | - Attach that task to the new cgroup by writing its pid to the /dev/cgroup tasks file for that cgroup. | ||
- | - fork, exec or clone the job tasks from this founding father task. | ||
- | |||
- | For example, the following sequence of commands will setup a cgroup named " | ||
- | |||
- | <code bash> | ||
- | mount -t cgroup cpuset -ocpuset /dev/cgroup | ||
- | cd /dev/cgroup | ||
- | mkdir Charlie | ||
- | cd Charlie | ||
- | /bin/echo 2-3 > cpuset.cpus | ||
- | /bin/echo 1 > cpuset.mems | ||
- | /bin/echo $$ > tasks | ||
- | sh | ||
- | # The subshell ' | ||
- | # The next line should display '/ | ||
- | cat / | ||
- | </ | ||
- | |||
- | |||
- | ===== Usage Examples and Syntax ===== | ||
- | |||
- | ==== 2.1 Basic Usage ==== | ||
- | |||
- | Creating, modifying, using the cgroups can be done through the cgroup virtual filesystem. | ||
- | |||
- | To mount a cgroup hierarchy with all available subsystems, type: | ||
- | |||
- | <code bash> | ||
- | mount -t cgroup xxx /dev/cgroup | ||
- | </ | ||
- | |||
- | The " | ||
- | |||
- | To mount a cgroup hierarchy with just the cpuset and memory subsystems, type: | ||
- | |||
- | <code bash> | ||
- | mount -t cgroup -o cpuset, | ||
- | </ | ||
- | |||
- | To change the set of subsystems bound to a mounted hierarchy, just remount with different options: | ||
- | |||
- | <code bash> | ||
- | mount -o remount, | ||
- | </ | ||
- | |||
- | Now memory is removed from the hierarchy and ns is added. | ||
- | |||
- | Note this will add ns to the hierarchy but won't remove memory or cpuset, because the new options are appended to the old ones: | ||
- | |||
- | <code bash> | ||
- | mount -o remount,ns /dev/cgroup | ||
- | </ | ||
- | |||
- | To Specify a hierarchy' | ||
- | |||
- | <code bash> | ||
- | mount -t cgroup -o cpuset, | ||
- | xxx /dev/cgroup | ||
- | </ | ||
- | |||
- | Note that specifying ' | ||
- | |||
- | Note that changing the set of subsystems is currently only supported when the hierarchy consists of a single (root) cgroup. | ||
- | |||
- | Then under **/ | ||
- | |||
- | If you want to change the value of release_agent: | ||
- | |||
- | <code bash> | ||
- | echo "/ | ||
- | </ | ||
- | |||
- | It can also be changed via remount. | ||
- | |||
- | If you want to create a new cgroup under / | ||
- | |||
- | <code bash> | ||
- | cd /dev/cgroup | ||
- | mkdir my_cgroup | ||
- | </ | ||
- | |||
- | Now you want to do something with this cgroup. | ||
- | |||
- | <code bash> | ||
- | cd my_cgroup | ||
- | </ | ||
- | |||
- | In this directory you can find several files: | ||
- | |||
- | <code bash> | ||
- | ls | ||
- | |||
- | cgroup.procs notify_on_release tasks | ||
- | (plus whatever files added by the attached subsystems) | ||
- | </ | ||
- | |||
- | Now attach your shell to this cgroup: | ||
- | |||
- | <code bash> | ||
- | /bin/echo $$ > tasks | ||
- | </ | ||
- | |||
- | You can also create cgroups inside your cgroup by using mkdir in this directory. | ||
- | |||
- | <code bash> | ||
- | mkdir my_sub_cs | ||
- | </ | ||
- | |||
- | To remove a cgroup, just use rmdir: | ||
- | |||
- | <code bash> | ||
- | rmdir my_sub_cs | ||
- | </ | ||
- | |||
- | This will fail if the cgroup is in use (has cgroups inside, or has processes attached, or is held alive by other subsystem-specific reference). | ||
- | |||
- | |||
- | |||
- | |||
- | ==== Attaching processes ==== | ||
- | |||
- | <code bash> | ||
- | /bin/echo PID > tasks | ||
- | </ | ||
- | |||
- | <WRAP info> | ||
- | **NOTE: | ||
- | </ | ||
- | |||
- | <code bash> | ||
- | /bin/echo PID1 > tasks | ||
- | /bin/echo PID2 > tasks | ||
- | ... | ||
- | /bin/echo PIDn > tasks | ||
- | </ | ||
- | |||
- | You can attach the current shell task by echoing 0: | ||
- | |||
- | <code bash> | ||
- | echo 0 > tasks | ||
- | </ | ||
- | |||
- | |||
- | ==== Mounting hierarchies by name ==== | ||
- | |||
- | Passing the **name=< | ||
- | |||
- | The name should match [\w.-]+ | ||
- | |||
- | When passing a name=< | ||
- | |||
- | The name of the subsystem appears as part of the hierarchy description in **/ | ||
- | |||
- | ==== Notification API ==== | ||
- | |||
- | There is mechanism which allows to get notifications about changing status of a cgroup. | ||
- | |||
- | To register new notification handler you need: | ||
- | |||
- | * Create a file descriptor for event notification using eventfd(2); | ||
- | * Open a control file to be monitored (e.g. memory.usage_in_bytes); | ||
- | * Write "< | ||
- | * Interpretation of args is defined by control file implementation; | ||
- | |||
- | eventfd will be woken up by control file implementation or when the cgroup is removed. | ||
- | |||
- | To unregister notification handler just close eventfd. | ||
- | |||
- | <WRAP info> | ||
- | **NOTE**: | ||
- | </ | ||
- | |||
- | ===== Kernel API ===== | ||
- | |||
- | ==== Overview ==== | ||
- | |||
- | Each kernel subsystem that wants to hook into the generic cgroup system needs to create a **cgroup_subsys** object. | ||
- | |||
- | Other fields in the cgroup_subsys object include: | ||
- | |||
- | * **subsys_id**: | ||
- | * **name**: should be initialized to a unique subsystem name. Should be no longer than **MAX_CGROUP_TYPE_NAMELEN**. | ||
- | * **early_init**: | ||
- | |||
- | Each cgroup object created by the system has an array of pointers, indexed by subsystem id; this pointer is entirely managed by the subsystem; the generic cgroup code will never touch this pointer. | ||
- | |||
- | |||
- | ==== Synchronization ==== | ||
- | |||
- | There is a global mutex, **cgroup_mutex**, | ||
- | |||
- | See kernel/ | ||
- | |||
- | Subsystems can take/ | ||
- | |||
- | Accessing a task's cgroup pointer may be done in the following ways: | ||
- | |||
- | * while holding cgroup_mutex | ||
- | * while holding the task's alloc_lock (via task_lock()) | ||
- | * inside an rcu_read_lock() section via rcu_dereference() | ||
- | |||
- | |||
- | ==== Subsystem API ==== | ||
- | |||
- | Each subsystem should: | ||
- | |||
- | * add an entry in linux/ | ||
- | * define a cgroup_subsys object called < | ||
- | |||
- | If a subsystem can be compiled as a module, it should also have in its module initcall a call to **cgroup_load_subsys()**, | ||
- | |||
- | Each subsystem may export the following methods. | ||
- | |||
- | <code c> | ||
- | struct cgroup_subsys_state *create(struct cgroup_subsys *ss, | ||
- | | ||
- | (cgroup_mutex held by caller) | ||
- | </ | ||
- | |||
- | Called to create a subsystem state object for a cgroup. | ||
- | |||
- | <WRAP info> | ||
- | **NOTE: | ||
- | </ | ||
- | |||
- | <code c> | ||
- | void destroy(struct cgroup_subsys *ss, struct cgroup *cgrp) | ||
- | (cgroup_mutex held by caller) | ||
- | </ | ||
- | |||
- | The cgroup system is about to destroy the passed cgroup; the subsystem should do any necessary cleanup and free its subsystem state object. | ||
- | |||
- | <WRAP info> | ||
- | **NOTE: | ||
- | </ | ||
- | |||
- | <code c> | ||
- | int pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp); | ||
- | </ | ||
- | |||
- | Called before checking the reference count on each subsystem. | ||
- | |||
- | <code c> | ||
- | int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | ||
- | | ||
- | (cgroup_mutex held by caller) | ||
- | </ | ||
- | |||
- | Called prior to moving a task into a cgroup; if the subsystem returns an error, this will abort the attach operation. | ||
- | |||
- | <code c> | ||
- | void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | ||
- | | ||
- | (cgroup_mutex held by caller) | ||
- | </ | ||
- | |||
- | Called when a task attach operation has failed after **can_attach()** has succeeded. | ||
- | |||
- | <code c> | ||
- | void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | ||
- | struct cgroup *old_cgrp, struct task_struct *task, | ||
- | bool threadgroup) | ||
- | (cgroup_mutex held by caller) | ||
- | </ | ||
- | |||
- | Called after the task has been attached to the cgroup, to allow any post-attachment activity that requires memory allocations or blocking. | ||
- | |||
- | <code c> | ||
- | void fork(struct cgroup_subsy *ss, struct task_struct *task) | ||
- | </ | ||
- | |||
- | Called when a task is forked into a cgroup. | ||
- | |||
- | <code c> | ||
- | void exit(struct cgroup_subsys *ss, struct task_struct *task) | ||
- | </ | ||
- | |||
- | Called during task exit. | ||
- | |||
- | <code c> | ||
- | int populate(struct cgroup_subsys *ss, struct cgroup *cgrp) | ||
- | (cgroup_mutex held by caller) | ||
- | </ | ||
- | |||
- | Called after creation of a cgroup to allow a subsystem to populate the cgroup directory with file entries. | ||
- | |||
- | <code c> | ||
- | void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp) | ||
- | (cgroup_mutex held by caller) | ||
- | </ | ||
- | |||
- | Called at the end of **cgroup_clone()** to do any parameter initialization which might be required before a task could attach. | ||
- | |||
- | <code c> | ||
- | void bind(struct cgroup_subsys *ss, struct cgroup *root) | ||
- | (cgroup_mutex and ss-> | ||
- | </ | ||
- | |||
- | Called when a cgroup subsystem is rebound to a different hierarchy and root cgroup. | ||
- | |||
- | |||
- | ===== Questions ===== | ||
- | |||
- | Q: What's up with this '/ | ||
- | |||
- | A: bash's builtin ' | ||
- | |||
- | Q: When I attach processes, only the first of the line gets really attached ! | ||
- | |||
- | A: We can only return one error code per call to write(). So you should also put only ONE pid. | ||
- | |||
- | |||
- | ===== References ===== | ||
- | |||
- | https:// | ||
ubuntu/kernel/control_groups.1575117052.txt.gz · Last modified: 2020/07/15 09:30 (external edit)