Differences

This shows you the differences between two versions of the page.

--- kernel:control_groups [2020/07/22 18:05] – old revision restored (2017/04/06 10:22) 207.244.157.10
+++ kernel:control_groups [2020/07/22 18:05] (current) – old revision restored (2020/07/20 16:09) 207.244.157.10
@@ Line 15: / Line 15: @@
 User level code may create and destroy cgroups by name in an instance of the cgroup virtual file system, specify and query to which cgroup a task is assigned, and list the task pids assigned to a cgroup. Those creations and assignments only affect the hierarchy associated with that instance of the cgroup file system.
-On their own, the only use for cgroups is for simple job tracking. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can access.  For example, [[Kernel:CPU Sets|cpusets]] allows you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup.
+On their own, the only use for cgroups is for simple job tracking. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can access.  For example, [[Kernel:CPU Sets|cpusets]] allow you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup.
 ===== Why are cgroups needed? =====
-There are multiple efforts to provide process aggregations in the Linux kernel, mainly for resource tracking purposes. Such efforts include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server namespaces. These all require the basic notion of a grouping/partitioning of processes, with newly forked processes ending in the same group (cgroup) as their parent process.
+There are multiple efforts to provide process aggregations in the Linux kernel, mainly for resource tracking purposes. Such efforts include [[Kernel:CPU Sets|cpusets]], CKRM/ResGroups, UserBeanCounters, and virtual server namespaces. These all require the basic notion of a grouping/partitioning of processes, with newly forked processes ending in the same group (cgroup) as their parent process.
-The kernel cgroup patch provides the minimum essential kernel mechanisms required to efficiently implement such groups.  It has minimal impact on the system fast paths, and provides hooks for specific subsystems such as cpusets to provide additional behaviour as desired.
+The kernel cgroup patch provides the minimum essential kernel mechanisms required to efficiently implement such groups.  It has minimal impact on the system fast paths, and provides hooks for specific subsystems such as [[Kernel:CPU Sets|cpusets]] to provide additional behaviour as desired.
 Multiple hierarchy support is provided to allow for situations where the division of tasks into cgroups is distinctly different for different subsystems - having parallel hierarchies allows each hierarchy to be a natural division of tasks, without having to handle complex combinations of tasks that would be present if several unrelated subsystems needed to be forced into the same tree of cgroups.
@@ Line 60: / Line 60: @@
 </code>
-With only a single hierarchy, he now would potentially have to create a separate cgroup for every browser launched and associate it with approp network and other resource class.  This may lead to proliferation of such cgroups.
+With only a single hierarchy, the admin would potentially have to create a separate cgroup for every browser launched and associate it with approp network and other resource class.  This may lead to proliferation of such cgroups.
-Also lets say that the administrator would like to give enhanced network access temporarily to a student's browser (since it is night and the user wants to do online gaming :))  OR give one of the students simulation apps enhanced CPU power,
+Also lets say that the administrator would like to give enhanced network access temporarily to a student's browser (since it is night and the user wants to do online gaming :))  OR give one of the students simulation apps enhanced CPU power, with the ability to write pids directly to resource classes, it's just a matter of :
-With ability to write pids directly to resource classes, it's just a matter of :
 <code bash>
@@ Line 74: / Line 72: @@
 </code>
-Without this ability, he would have to split the cgroup into multiple separate ones and then associate the new cgroups with the new resource classes.
+Without this ability, they would have to split the cgroup into multiple separate ones and then associate the new cgroups with the new resource classes.
 ===== How are cgroups implemented? =====
@@ Line 81: / Line 78: @@
 Control Groups extends the kernel as follows:
-  * Each task in the system has a reference-counted pointer to a css_set.
+  * Each task in the system has a reference-counted pointer to a **css_set**.
-  * A css_set contains a set of reference-counted pointers to cgroup_subsys_state objects, one for each cgroup subsystem registered in the system.  There is no direct link from a task to the cgroup of which it's a member in each hierarchy, but this can be determined by following pointers through the cgroup_subsys_state objects. This is because accessing the subsystem state is something that's expected to happen frequently and in performance-critical code, whereas operations that require a task's actual cgroup assignments (in particular, moving between cgroups) are less common. A linked list runs through the cg_list field of each task_struct using the css_set, anchored at css_set->tasks.
+  * A **css_set** contains a set of reference-counted pointers to **cgroup_subsys_state** objects, one for each cgroup subsystem registered in the system.  There is no direct link from a task to the cgroup of which it's a member in each hierarchy, but this can be determined by following pointers through the cgroup_subsys_state objects. This is because accessing the subsystem state is something that's expected to happen frequently and in performance-critical code, whereas operations that require a task's actual cgroup assignments (in particular, moving between cgroups) are less common. A linked list runs through the cg_list field of each task_struct using the css_set, anchored at css_set->tasks.
   * A cgroup hierarchy filesystem can be mounted  for browsing and manipulation from user space.
@@ Line 98: / Line 95: @@
 If an active hierarchy with exactly the same set of subsystems already exists, it will be reused for the new mount.  If no existing hierarchy
-matches, and any of the requested subsystems are in use in an existing hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
+matches, and any of the requested subsystems are in use in an existing hierarchy, the mount will fail with **-EBUSY**.  Otherwise, a new hierarchy is activated, associated with the requested subsystems.
-is activated, associated with the requested subsystems.
-It's not currently possible to bind a new subsystem to an active cgroup hierarchy, or to unbind a subsystem from an active cgroup hierarchy. This may be possible in future, but is fraught with nasty error-recovery issues.
+It's not currently possible to bind a new subsystem to an active cgroup hierarchy, or to unbind a subsystem from an active cgroup hierarchy.  This may be possible in future, but is fraught with nasty error-recovery issues.
 When a cgroup filesystem is unmounted, if there are any child cgroups created below the top-level cgroup, that hierarchy will remain active even though unmounted; if there are no child cgroups then the hierarchy will be deactivated.
@@ Line 114: / Line 110: @@
   * **cgroup.procs**: list of tgids in the cgroup.  This list is not guaranteed to be sorted or free of duplicate tgids, and userspace should sort/uniquify the list if this property is required.  This is a read-only file, for now.
   * **notify_on_release** flag: run the release agent on exit?
-  * **release_agent**: the path to use for release notifications (this file exists in the top cgroup only)
+  * **release_agent**: the path to use for release notifications (this file exists in the top cgroup only).
-Other subsystems such as cpusets may add additional files in each cgroup dir.
+Other subsystems such as [[Kernel:CPU Sets|cpusets]] may add additional files in each cgroup dir.
 New cgroups are created using the mkdir system call or shell command.  The properties of a cgroup, such as its flags, are modified by writing to the appropriate file in that cgroups directory, as listed above.
@@ Line 124: / Line 120: @@
 The attachment of each task, automatically inherited at fork by any children of that task, to a cgroup allows organizing the work load on a system into related sets of tasks.  A task may be re-attached to any other cgroup, if allowed by the permissions on the necessary cgroup file system directories.
-When a task is moved from one cgroup to another, it gets a new css_set pointer - if there's an already existing css_set with the desired collection of cgroups then that group is reused, else a new css_set is allocated. The appropriate existing css_set is located by looking into a hash table.
+When a task is moved from one cgroup to another, it gets a new **css_set** pointer - if there's an already existing css_set with the desired collection of cgroups then that group is reused, else a new css_set is allocated. The appropriate existing css_set is located by looking into a hash table.
-To allow access from a cgroup to the css_sets (and hence tasks) that comprise it, a set of cg_cgroup_link objects form a lattice; each cg_cgroup_link is linked into a list of cg_cgroup_links for a single cgroup on its cgrp_link_list field, and a list of cg_cgroup_links for a single css_set on its cg_link_list.
+To allow access from a cgroup to the css_sets (and hence tasks) that comprise it, a set of **cg_cgroup_link** objects form a lattice; each cg_cgroup_link is linked into a list of **cg_cgroup_links** for a single cgroup on its **cgrp_link_list** field, and a list of cg_cgroup_links for a single css_set on its **cg_link_list**.
 Thus the set of tasks in a cgroup can be listed by iterating over each css_set that references the cgroup, and sub-iterating over each css_set's task set.
@@ Line 135: / Line 131: @@
 ===== What does notify_on_release do? =====
-If the notify_on_release flag is enabled (1) in a cgroup, then whenever the last task in the cgroup leaves (exits or attaches to some other cgroup) and the last child cgroup of that cgroup is removed, then the kernel runs the command specified by the contents of the "release_agent" file in that hierarchy's root directory, supplying the pathname (relative to the mount point of the cgroup file system) of the abandoned cgroup.  This enables automatic removal of abandoned cgroups.  The default value of notify_on_release in the root cgroup at system boot is disabled (0).  The default value of other cgroups at creation is the current value of their parents notify_on_release setting. The default value of a cgroup hierarchy's release_agent path is empty.
+If the **notify_on_release** flag is enabled (1) in a cgroup, then whenever the last task in the cgroup leaves (exits or attaches to some other cgroup) and the last child cgroup of that cgroup is removed, then the kernel runs the command specified by the contents of the **"release_agent"** file in that hierarchy's root directory, supplying the pathname (relative to the mount point of the cgroup file system) of the abandoned cgroup.  This enables automatic removal of abandoned cgroups.  The default value of notify_on_release in the root cgroup at system boot is disabled (0).  The default value of other cgroups at creation is the current value of their parents notify_on_release setting. The default value of a cgroup hierarchy's release_agent path is empty.
 ===== What does clone_children do? =====
-If the clone_children flag is enabled (1) in a cgroup, then all cgroups created beneath will call the post_clone callbacks for each subsystem of the newly created cgroup. Usually when this callback is implemented for a subsystem, it copies the values of the parent subsystem, this is the case for the cpuset.
+If the **clone_children** flag is enabled (1) in a cgroup, then all cgroups created beneath will call the post_clone callbacks for each subsystem of the newly created cgroup. Usually when this callback is implemented for a subsystem, it copies the values of the parent subsystem, this is the case for the cpuset.
@@ Line 147: / Line 143: @@
 To start a new job that is to be contained within a cgroup, using the "cpuset" cgroup subsystem, the steps are something like:
-) mkdir /dev/cgroup
+  - mkdir /dev/cgroup
-) mount -t cgroup -ocpuset cpuset /dev/cgroup
+  - mount -t cgroup -ocpuset cpuset /dev/cgroup
-) Create the new cgroup by doing mkdir's and write's (or echo's) in
+  - Create the new cgroup by doing mkdir's and write's (or echo's) in the /dev/cgroup virtual file system.
-    the /dev/cgroup virtual file system.
+  - Start a task that will be the "founding father" of the new job.
-) Start a task that will be the "founding father" of the new job.
+  - Attach that task to the new cgroup by writing its pid to the /dev/cgroup tasks file for that cgroup.
-) Attach that task to the new cgroup by writing its pid to the
+  - fork, exec or clone the job tasks from this founding father task.
-    /dev/cgroup tasks file for that cgroup.
-) fork, exec or clone the job tasks from this founding father task.
 For example, the following sequence of commands will setup a cgroup named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, and then start a subshell 'sh' in that cgroup:
@@ Line 185: / Line 179: @@
 </code>
-The "xxx" is not interpreted by the cgroup code, but will appear in /proc/mounts so may be any useful identifying string that you like.
+The "xxx" is not interpreted by the cgroup code, but will appear in **/proc/mounts** so may be any useful identifying string that you like.
 To mount a cgroup hierarchy with just the cpuset and memory subsystems, type:
@@ Line 218: / Line 212: @@
 Note that changing the set of subsystems is currently only supported when the hierarchy consists of a single (root) cgroup.  Supporting the ability to arbitrarily bind/unbind subsystems from an existing cgroup hierarchy is intended to be implemented in the future.
-Then under /dev/cgroup you can find a tree that corresponds to the tree of the cgroups in the system. For instance, /dev/cgroup is the cgroup that holds the whole system.
+Then under **/dev/cgroup** you can find a tree that corresponds to the tree of the cgroups in the system. For instance, /dev/cgroup is the cgroup that holds the whole system.
 If you want to change the value of release_agent:
@@ Line 245: / Line 239: @@
 <code bash>
 ls
-<
 cgroup.procs notify_on_release tasks
 (plus whatever files added by the attached subsystems)
@@ Line 279: / Line 273: @@
 </code>
-Note that it is PID, not PIDs.  You can only attach ONE task at a time.  If you have several tasks to attach, you have to do it one after another:
+<WRAP info>
+**NOTE:**  It is PID, not PIDs.  You can only attach ONE task at a time.  If you have several tasks to attach, you have to do it one after another:
+</WRAP>
 <code bash>
@@ Line 303: / Line 299: @@
 When passing a name=<x> option for a new hierarchy, you need to specify subsystems manually; the legacy behaviour of mounting all subsystems when none are explicitly specified is not supported when you give a subsystem a name.
-The name of the subsystem appears as part of the hierarchy description in /proc/mounts and /proc/<pid>/cgroups.
+The name of the subsystem appears as part of the hierarchy description in **/proc/mounts** and **/proc/<pid>/cgroups**.
 ==== Notification API ====
@@ Line 320: / Line 316: @@
 To unregister notification handler just close eventfd.
-NOTE: Support of notifications should be implemented for the control file.  See documentation for the subsystem.
+<WRAP info>
+**NOTE**:  Support of notifications should be implemented for the control file.  See documentation for the subsystem.
+</WRAP>
 ===== Kernel API =====
@@ Line 327: / Line 324: @@
 ==== Overview ====
-Each kernel subsystem that wants to hook into the generic cgroup system needs to create a cgroup_subsys object.  This contains various methods, which are callbacks from the cgroup system, along with a subsystem id which will be assigned by the cgroup system.
+Each kernel subsystem that wants to hook into the generic cgroup system needs to create a **cgroup_subsys** object.  This contains various methods, which are callbacks from the cgroup system, along with a subsystem id which will be assigned by the cgroup system.
 Other fields in the cgroup_subsys object include:
@@ Line 340: / Line 337: @@
 ==== Synchronization ====
-There is a global mutex, cgroup_mutex, used by the cgroup system.  This should be taken by anything that wants to modify a cgroup.  It may also be taken to prevent cgroups from being modified, but more specific locks may be more appropriate in that situation.
+There is a global mutex, **cgroup_mutex**, used by the cgroup system.  This should be taken by anything that wants to modify a cgroup.  It may also be taken to prevent cgroups from being modified, but more specific locks may be more appropriate in that situation.
 See kernel/cgroup.c for more details.
-Subsystems can take/release the cgroup_mutex via the functions cgroup_lock()/cgroup_unlock().
+Subsystems can take/release the cgroup_mutex via the functions **cgroup_lock()**/**cgroup_unlock()**.
 Accessing a task's cgroup pointer may be done in the following ways: