Site icon Ryadel

Analysis of Linux Memory Management Mechanism

Linux - Come inviare E-Mail con sSMTP (con configurazioni-tipo per GMail, Aruba e Yahoo)
[TOC]

This article makes a simple analysis of the Linux memory management mechanism, trying to make you quickly understand some of the Linux memory management concepts and effective use of some management methods.

Linux 2.6 started to support the NUMA (Non-Uniform Memory Access) memory management mode. In a multi-CPU system, memory is divided into different Nodes by CPU. Each CPU has a Node. The speed of accessing a local Node is much faster than accessing Nodes on other CPUs.

View NUMA hardware information by numactl -H , you can see the size of the two nodes and the corresponding CPU core, and the distances of the CPU access node. As shown below, the distances from the CPU to the remote node are more than twice that of the local node.

View the NUMA statistics through numastat, including the number of memory allocation hits, misses, the number of local distribution, and the number of remote distributions.

Zone

The following node is divided into one or more zones, why there are zones, two reasons: 1. DMA devices can access a limited range of memory (ISA device can only access 16MB); 2.x86-32bit system address space is limited (32 The bit can only be 4GB at most), in order to use more memory, need to use HIGHMEM mechanism.

ZONE_DMA

The lowest memory area in the address segment is used for DMA (Industry Standard Architecture) device DMA access. In the x86 architecture, the Zone size is limited to 16MB.

ZONE_DMA32

The Zone is used for DMA devices that support the 32-bit address bus and is only available in 64-bit systems.

ZONE_NORMAL

The Zone's memory is directly mapped to a linear address by the kernel and can be used directly. In the X86-32 architecture, the address range of the zone is 16MB~896MB. In the X86-64 architecture, all the memory except the DMA and DMA32 is managed in the NORMAL zone.

ZONE_HIGHMEM

The Zone is only available on 32-bit systems and maps over 896MB of memory space by creating temporary page tables. That is, when the access is needed, the mapping relationship between the address space and the memory is established. After the access ends, the mapping relationship is released and the address space can be used for memory mapping of other HIGHMEMs.

Zone related information can be viewed through /proc/zoneinfo . As shown below, there are two Nodes on the X86-64 system. Node0 has three zones: DMA, DMA32, and Normal. There is only one Normal Zone on Node1.

Page

Page is the basic unit of Linux underlying memory management and its size is 4KB. A Page is mapped to a contiguous piece of physical memory, and memory allocation and release must be done in Page units. The mapping of the process virtual address to the physical address is also performed through the Page Table page table. Each entry in the page table records the physical address corresponding to the virtual address of a Page.

TLB

Memory access needs to search for the Page structure corresponding to the address. This data is recorded in the page table. All accesses to the memory address must first query the page table, so the page table has the highest number of accesses.

In order to increase the speed of accessing the page table, the TLB (Translation Lookaside Buffer) mechanism was introduced to cache more pages in the CPU cache. Therefore, an important item in the performance statistics of the CPU is the TLB miss statistic item of the L1/L2 cache. In a large-memory system, for example, 256 GB of memory has a total of 256 GB/4 KB=67,108,864 page table entries.

If each entry occupies 16 bytes, 1 GB is required. It is apparent that the CPU cache cannot be fully cached. At this time, if the accessed memory is wide, it is easy for the TLB miss to increase the access delay.

Hugepages

In order to reduce the probability of TLB miss, Linux introduced the Hugepages mechanism, which can set the Page size to 2MB or 1GB. Under the 2MB Hugepages mechanism, the same 256GB memory page entry is reduced to 256GB/2MB=131072, which only requires 2MB. So Hugepages page table can be cached in the CPU cache.

By sysctl -w vm.nr_hugepages=1024  you can set the number of hugepages to 1024 and the total size to 4GB. It should be noted that setting the hugepages will request 2MB of memory blocks from the system and keep them (cannot be used for normal memory requests). If the system runs for a period of time and the memory fragmentation is high, then applying hugepages will fail.

The settings and mount methods for hugepages are shown below. After mount, the application needs to use mmap for file mapping in the mount path to use these hugepages.

Buddy System

The Linux Buddy System is designed to solve the memory fragmentation problem caused by the memory allocation in the unit of Page: that is, the system lacks consecutive Page Pages and memory requests that require continuous Page Pages cannot be satisfied.

The principle is very simple, different numbers of continuous Pages are combined into Block to allocate, Block is divided into 11 Block lists according to the power of two Pages, corresponding to 1,2,4,8,16,32,64,128 , 256, 512, and 1024 consecutive Pages. When calling the Buddy System for memory allocation, find the most suitable block based on the requested size.

The following shows the Buddy System basic information on each Zone. The last 11 columns are the number of available Blocks in the 11 Block List.

Slab

Buddy System memory is a large application, but most applications require very little memory, such as the common hundreds of Bytes data structure, if you also apply for a Page, it will be very wasteful. In order to meet the small and irregular memory allocation requirements, Linux designed Slab distributor.

The principle is simply to create a memcache for a specific data structure, apply Pages from the Buddy System, and divide each Page into multiple objects according to the size of the data structure. The user allocates an Object when requesting a data structure from memcache.

The following shows how to view slab information in Linux:

Usually we use the slabtop command to view the sorted slab information:

kmalloc

Like glibc's malloc(), the kernel also provides kmalloc() for allocating memory of any size. Similarly, if you let an application randomly apply any size of memory from a Page, it will also cause memory fragmentation in Page.

In order to solve the internal fragmentation problem, Linux uses the Slab mechanism to achieve kmalloc memory allocation. The principle is similar to that of the Buddy System, which is to create a power-of-two Slab pool for kmalloc allocation based on the best-sized Slab.

The following are the Slabs for kmalloc allocation:

Kernel parameters

Linux provides some memory management related kernel parameters, which can be viewed in the /proc/sys/vm directory or viewed via sysctl -a |grep vm:

vm.drop_caches

Vm.drop_caches is the most commonly used parameter because Linux's Page cache mechanism causes a large amount of memory to be used for file system caching, including data caching and metadata (dentry, inode) caching. When the memory is insufficient, we can quickly release the file system cache with this parameter:

To free pagecache:

To free reclaimable slab objects (includes dentries and inodes):

To free slab objects and pagecache:

vm.min_free_kbytes

Vm.min_free_kbytes is used to determine when the memory is less than the start of the memory recovery mechanism (including the file system cache mentioned above and the recyclable Slab mentioned below), the value of the default is smaller, in the system settings more memory A large value (such as 1GB) can automatically trigger memory reclamation when memory is not too low.

However, it cannot be set too large, resulting in frequent applications being often killed by OOM.

vm.min_slab_ratio

Vm.min_slab_ratio is used to determine the amount of Slab space that can be recycled in the Slab pool when the percentage of the area is reached. The default is 5%. However, after the author's experiment, Slab recovery will not be triggered when there is sufficient memory, and Slab recovery will only be triggered when the memory water level reaches min_free_kbytes above. The minimum value can be set to 1%:

Conclusion

The above article briefly describes about the Linux memory management mechanism and several commonly used memory management kernel parameters. We hope that you understood the concept clearly. If you have any questions please drop us your comment in the below comment box: we will get back to you as soon as possible.

Happy learning!

Exit mobile version