ubuntu:kernel:huge_page_table
Differences
This shows you the differences between two versions of the page.
ubuntu:kernel:huge_page_table [2019/11/30 12:37] – created peter | ubuntu:kernel:huge_page_table [2019/12/15 21:52] (current) – removed peter | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Ubuntu - Kernel - Huge Page Table ====== | ||
- | |||
- | Here is a brief summary of **hugetlbpage** support in the Linux kernel. | ||
- | |||
- | Operating systems try to make best use of limited number of TLB resources. | ||
- | |||
- | Users can use the huge page support in Linux kernel by either using the **mmap** system call or standard SYSV shared memory system calls (shmget, shmat). | ||
- | |||
- | First the Linux kernel needs to be built with the **CONFIG_HUGETLBFS** (present under "File systems" | ||
- | |||
- | The **/ | ||
- | |||
- | The output of "cat / | ||
- | |||
- | <code bash> | ||
- | ..... | ||
- | HugePages_Total: | ||
- | HugePages_Free: | ||
- | HugePages_Rsvd: | ||
- | HugePages_Surp: | ||
- | Hugepagesize: | ||
- | </ | ||
- | |||
- | where: | ||
- | |||
- | * **HugePages_Total** is the size of the pool of huge pages. | ||
- | * **HugePages_Free** is the number of huge pages in the pool that are not yet allocated. | ||
- | * **HugePages_Rsvd** is short for " | ||
- | * **HugePages_Surp** is short for " | ||
- | |||
- | **/ | ||
- | |||
- | **/ | ||
- | |||
- | Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. | ||
- | memory pressure. | ||
- | |||
- | Once a number of huge pages have been pre-allocated to the kernel huge page pool, a user with appropriate privilege can use either the mmap system call or shared memory system calls to use the huge pages. | ||
- | |||
- | The administrator can allocate persistent huge pages on the kernel boot command line by specifying the " | ||
- | number of huge pages requested. | ||
- | |||
- | Some platforms support multiple huge page sizes. | ||
- | |||
- | When multiple huge page sizes are supported, **/ | ||
- | |||
- | <code bash> | ||
- | echo 20 > / | ||
- | </ | ||
- | |||
- | This command will try to adjust the number of default sized huge pages in the huge page pool to 20, allocating or freeing huge pages, as required. | ||
- | |||
- | On a NUMA platform, the kernel will attempt to distribute the huge page pool over all the set of allowed nodes specified by the NUMA memory policy of the task that modifies nr_hugepages. | ||
- | |||
- | The success or failure of huge page allocation depends on the amount of physically contiguous memory that is present in system at the time of the allocation attempt. | ||
- | |||
- | System administrators may want to put this command in one of the local rc init files. | ||
- | |||
- | <code bash> | ||
- | cat / | ||
- | </ | ||
- | |||
- | **/ | ||
- | |||
- | When increasing the huge page pool size via **nr_hugepages**, | ||
- | |||
- | The administrator may shrink the pool of persistent huge pages for the default huge page size by setting the **nr_hugepages** sysctl to a smaller value. | ||
- | |||
- | Caveat: Shrinking the persistent huge page pool via nr_hugepages such that it becomes less than the number of huge pages in use will convert the balance of the in-use huge pages to surplus huge pages. | ||
- | |||
- | With support for multiple huge page pools at run-time available, much of the huge page userspace interface in **/ | ||
- | |||
- | <code bash> | ||
- | / | ||
- | </ | ||
- | |||
- | For each huge page size supported by the running kernel, a subdirectory will exist, of the form: | ||
- | |||
- | <code bash> | ||
- | hugepages-${size}kB | ||
- | </ | ||
- | |||
- | Inside each of these directories, | ||
- | |||
- | <code bash> | ||
- | nr_hugepages | ||
- | nr_hugepages_mempolicy | ||
- | nr_overcommit_hugepages | ||
- | free_hugepages | ||
- | resv_hugepages | ||
- | surplus_hugepages | ||
- | </ | ||
- | |||
- | which function as described above for the default huge page-sized case. | ||
- | |||
- | ---- | ||
- | |||
- | ===== Interaction of Task Memory Policy with Huge Page Allocation/ | ||
- | |||
- | Whether huge pages are allocated and freed via the /proc interface or the /sysfs interface using the **nr_hugepages_mempolicy** attribute, the NUMA nodes from which huge pages are allocated or freed are controlled by the NUMA memory policy of the task that modifies the nr_hugepages_mempolicy sysctl or attribute. | ||
- | |||
- | The recommended method to allocate or free huge pages to/from the kernel huge page pool, using the nr_hugepages example above, is: | ||
- | |||
- | <code bash> | ||
- | numactl --interleave < | ||
- | </ | ||
- | |||
- | or, more succinctly: | ||
- | |||
- | <code bash> | ||
- | numactl -m < | ||
- | </ | ||
- | |||
- | This will allocate or free abs(20 - nr_hugepages) to or from the nodes specified in < | ||
- | |||
- | When adjusting the persistent hugepage count via **nr_hugepages_mempolicy**, | ||
- | |||
- | 1) Regardless of mempolicy mode [see [[Kernel: | ||
- | |||
- | 2) One or more nodes may be specified with the bind or interleave policy. | ||
- | |||
- | 3) The nodes allowed mask will be derived from any non-default task mempolicy, whether this policy was set explicitly by the task itself or one of its ancestors, such as numactl. | ||
- | |||
- | 4) Any task mempolicy specifed--e.g., | ||
- | |||
- | 5) Boot-time huge page allocation attempts to distribute the requested number of huge pages over all on-lines nodes with memory. | ||
- | |||
- | ---- | ||
- | |||
- | ===== Per Node Hugepages Attributes ===== | ||
- | |||
- | A subset of the contents of the root huge page control directory in sysfs, described above, will be replicated under each the system device of each NUMA node with memory in: | ||
- | |||
- | <code bash> | ||
- | / | ||
- | </ | ||
- | |||
- | Under this directory, the subdirectory for each supported huge page size contains the following attribute files: | ||
- | |||
- | <code bash> | ||
- | nr_hugepages | ||
- | free_hugepages | ||
- | surplus_hugepages | ||
- | </ | ||
- | |||
- | The **free_** and **surplus_** attribute files are read-only. | ||
- | |||
- | The **nr_hugepages** attribute returns the total number of huge pages on the specified node. When this attribute is written, the number of persistent huge pages on the parent node will be adjusted to the specified value, if sufficient resources exist, regardless of the task's mempolicy or cpuset constraints. | ||
- | |||
- | Note that the number of overcommit and reserve pages remain global quantities, as we don't know until fault time, when the faulting task's mempolicy is applied, from which node the huge page allocation will be attempted. | ||
- | |||
- | ---- | ||
- | |||
- | ===== Using Huge Pages ===== | ||
- | |||
- | If the user applications are going to request huge pages using **mmap** system call, then it is required that system administrator mount a file system of type hugetlbfs: | ||
- | |||
- | <code bash> | ||
- | mount -t hugetlbfs \ | ||
- | -o uid=< | ||
- | none /mnt/huge | ||
- | </ | ||
- | |||
- | This command mounts a (pseudo) filesystem of type hugetlbfs on the directory / | ||
- | |||
- | By default the value 0755 is picked. The size option sets the maximum value of memory (huge pages) allowed for that filesystem (/ | ||
- | |||
- | While read system calls are supported on files that reside on hugetlb file systems, write system calls are not. | ||
- | |||
- | Regular chown, chgrp, and chmod commands (with right permissions) could be used to change the file attributes on hugetlbfs. | ||
- | |||
- | Also, it is important to note that no such mount command is required if the applications are going to use only shmat/ | ||
- | |||
- | |||
- | ---- | ||
- | |||
- | ===== hugepage-shm ===== | ||
- | |||
- | hugepage-shm.c | ||
- | |||
- | <code c> | ||
- | /* | ||
- | * hugepage-shm: | ||
- | * | ||
- | * Example of using huge page memory in a user application using Sys V shared | ||
- | * memory system calls. | ||
- | * memory that is backed by huge pages. | ||
- | * SHM_HUGETLB in the shmget system call to inform the kernel that it is | ||
- | * requesting huge pages. | ||
- | * | ||
- | * For the ia64 architecture, | ||
- | * huge pages. | ||
- | * aligned address starting with 0x800000... will be required. | ||
- | * address is not required, the kernel will select an address in the proper | ||
- | * range. | ||
- | * Other architectures, | ||
- | * | ||
- | * Note: The default shared memory limit is quite low on many kernels, | ||
- | * you may need to increase it via: | ||
- | * | ||
- | * echo 268435456 > / | ||
- | * | ||
- | * This will increase the maximum size per shared memory segment to 256MB. | ||
- | * The other limit that you will hit eventually is shmall which is the | ||
- | * total amount of shared memory in pages. To set it to 16GB on a system | ||
- | * with a 4kB pagesize do: | ||
- | * | ||
- | * echo 4194304 > / | ||
- | */ | ||
- | |||
- | #include < | ||
- | #include < | ||
- | #include < | ||
- | #include < | ||
- | #include < | ||
- | #include < | ||
- | |||
- | #ifndef SHM_HUGETLB | ||
- | #define SHM_HUGETLB 04000 | ||
- | #endif | ||
- | |||
- | #define LENGTH (256UL*1024*1024) | ||
- | |||
- | #define dprintf(x) | ||
- | |||
- | /* Only ia64 requires this */ | ||
- | #ifdef __ia64__ | ||
- | #define ADDR (void *)(0x8000000000000000UL) | ||
- | #define SHMAT_FLAGS (SHM_RND) | ||
- | #else | ||
- | #define ADDR (void *)(0x0UL) | ||
- | #define SHMAT_FLAGS (0) | ||
- | #endif | ||
- | |||
- | int main(void) | ||
- | { | ||
- | int shmid; | ||
- | unsigned long i; | ||
- | char *shmaddr; | ||
- | |||
- | if ((shmid = shmget(2, LENGTH, | ||
- | SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) < 0) { | ||
- | perror(" | ||
- | exit(1); | ||
- | } | ||
- | printf(" | ||
- | |||
- | shmaddr = shmat(shmid, | ||
- | if (shmaddr == (char *)-1) { | ||
- | perror(" | ||
- | shmctl(shmid, | ||
- | exit(2); | ||
- | } | ||
- | printf(" | ||
- | |||
- | dprintf(" | ||
- | for (i = 0; i < LENGTH; i++) { | ||
- | shmaddr[i] = (char)(i); | ||
- | if (!(i % (1024 * 1024))) | ||
- | dprintf(" | ||
- | } | ||
- | dprintf(" | ||
- | |||
- | dprintf(" | ||
- | for (i = 0; i < LENGTH; i++) | ||
- | if (shmaddr[i] != (char)i) | ||
- | printf(" | ||
- | dprintf(" | ||
- | |||
- | if (shmdt((const void *)shmaddr) != 0) { | ||
- | perror(" | ||
- | shmctl(shmid, | ||
- | exit(3); | ||
- | } | ||
- | |||
- | shmctl(shmid, | ||
- | |||
- | return 0; | ||
- | } | ||
- | </ | ||
- | |||
- | ---- | ||
- | |||
- | ===== hugepage-mmap ===== | ||
- | |||
- | hugepage-mmap.c | ||
- | |||
- | <code c> | ||
- | /* | ||
- | * hugepage-mmap: | ||
- | * | ||
- | * Example of using huge page memory in a user application using the mmap | ||
- | * system call. Before running this application, | ||
- | * administrator has mounted the hugetlbfs filesystem (on some directory | ||
- | * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this | ||
- | * example, the app is requesting memory of size 256MB that is backed by | ||
- | * huge pages. | ||
- | * | ||
- | * For the ia64 architecture, | ||
- | * huge pages. | ||
- | * aligned address starting with 0x800000... will be required. | ||
- | * address is not required, the kernel will select an address in the proper | ||
- | * range. | ||
- | * Other architectures, | ||
- | */ | ||
- | |||
- | #include < | ||
- | #include < | ||
- | #include < | ||
- | #include < | ||
- | #include < | ||
- | |||
- | #define FILE_NAME "/ | ||
- | #define LENGTH (256UL*1024*1024) | ||
- | #define PROTECTION (PROT_READ | PROT_WRITE) | ||
- | |||
- | /* Only ia64 requires this */ | ||
- | #ifdef __ia64__ | ||
- | #define ADDR (void *)(0x8000000000000000UL) | ||
- | #define FLAGS (MAP_SHARED | MAP_FIXED) | ||
- | #else | ||
- | #define ADDR (void *)(0x0UL) | ||
- | #define FLAGS (MAP_SHARED) | ||
- | #endif | ||
- | |||
- | static void check_bytes(char *addr) | ||
- | { | ||
- | printf(" | ||
- | } | ||
- | |||
- | static void write_bytes(char *addr) | ||
- | { | ||
- | unsigned long i; | ||
- | |||
- | for (i = 0; i < LENGTH; i++) | ||
- | *(addr + i) = (char)i; | ||
- | } | ||
- | |||
- | static void read_bytes(char *addr) | ||
- | { | ||
- | unsigned long i; | ||
- | |||
- | check_bytes(addr); | ||
- | for (i = 0; i < LENGTH; i++) | ||
- | if (*(addr + i) != (char)i) { | ||
- | printf(" | ||
- | break; | ||
- | } | ||
- | } | ||
- | |||
- | int main(void) | ||
- | { | ||
- | void *addr; | ||
- | int fd; | ||
- | |||
- | fd = open(FILE_NAME, | ||
- | if (fd < 0) { | ||
- | perror(" | ||
- | exit(1); | ||
- | } | ||
- | |||
- | addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, fd, 0); | ||
- | if (addr == MAP_FAILED) { | ||
- | perror(" | ||
- | unlink(FILE_NAME); | ||
- | exit(1); | ||
- | } | ||
- | |||
- | printf(" | ||
- | check_bytes(addr); | ||
- | write_bytes(addr); | ||
- | read_bytes(addr); | ||
- | |||
- | munmap(addr, | ||
- | close(fd); | ||
- | unlink(FILE_NAME); | ||
- | |||
- | return 0; | ||
- | } | ||
- | </ | ||
- | |||
- | ---- | ||
- | |||
- | ===== References ===== | ||
- | |||
- | https:// | ||
- | |||
ubuntu/kernel/huge_page_table.1575117463.txt.gz · Last modified: 2020/07/15 09:30 (external edit)