This article makes a simple analysis of the Linux memory management mechanism, trying to make you quickly understand some of the Linux memory management concepts and effective use of some management methods.
Linux 2.6 started to support the NUMA (Non-Uniform Memory Access) memory management mode. In a multi-CPU system, memory is divided into different Nodes by CPU. Each CPU has a Node. The speed of accessing a local Node is much faster than accessing Nodes on other CPUs.
View NUMA hardware information by numactl -H , you can see the size of the two nodes and the corresponding CPU core, and the distances of the CPU access node. As shown below, the distances from the CPU to the remote node are more than twice that of the local node.
1 |
[root@localhost ~]# numactl -H |
1 2 3 4 5 6 7 8 9 10 11 |
available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 15870 MB node 0 free: 13780 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 16384 MB node 1 free: 15542 MB node distances: node 0 1 0: 10 21 1: 21 10 |
View the NUMA statistics through numastat, including the number of memory allocation hits, misses, the number of local distribution, and the number of remote distributions.
1 |
[root@localhost ~]# numastat |
1 2 3 4 5 6 7 |
node0 node1 numa_hit 2351854045 3021228076 numa_miss 22736854 2976885 numa_foreign 2976885 22736854 interleave_hit 14144 14100 local_node 2351844760 3021220020 other_node 22746139 2984941 |
Zone
The following node is divided into one or more zones, why there are zones, two reasons: 1. DMA devices can access a limited range of memory (ISA device can only access 16MB); 2.x86-32bit system address space is limited (32 The bit can only be 4GB at most), in order to use more memory, need to use HIGHMEM mechanism.
ZONE_DMA
The lowest memory area in the address segment is used for DMA (Industry Standard Architecture) device DMA access. In the x86 architecture, the Zone size is limited to 16MB.
ZONE_DMA32
The Zone is used for DMA devices that support the 32-bit address bus and is only available in 64-bit systems.
ZONE_NORMAL
The Zone's memory is directly mapped to a linear address by the kernel and can be used directly. In the X86-32 architecture, the address range of the zone is 16MB~896MB. In the X86-64 architecture, all the memory except the DMA and DMA32 is managed in the NORMAL zone.
ZONE_HIGHMEM
The Zone is only available on 32-bit systems and maps over 896MB of memory space by creating temporary page tables. That is, when the access is needed, the mapping relationship between the address space and the memory is established. After the access ends, the mapping relationship is released and the address space can be used for memory mapping of other HIGHMEMs.
Zone related information can be viewed through /proc/zoneinfo . As shown below, there are two Nodes on the X86-64 system. Node0 has three zones: DMA, DMA32, and Normal. There is only one Normal Zone on Node1.
1 |
[root@localhost ~]# cat /proc/zoneinfo |grep -E "zone| free|managed" |
1 2 3 4 5 6 7 8 9 10 11 12 |
Node 0, zone DMA pages free 3700 managed 3975 Node 0, zone DMA32 pages free 291250 managed 326897 Node 0, zone Normal pages free 3232166 managed 3604347 Node 1, zone Normal pages free 3980110 managed 4128056 |
Page
Page is the basic unit of Linux underlying memory management and its size is 4KB. A Page is mapped to a contiguous piece of physical memory, and memory allocation and release must be done in Page units. The mapping of the process virtual address to the physical address is also performed through the Page Table page table. Each entry in the page table records the physical address corresponding to the virtual address of a Page.
TLB
Memory access needs to search for the Page structure corresponding to the address. This data is recorded in the page table. All accesses to the memory address must first query the page table, so the page table has the highest number of accesses.
In order to increase the speed of accessing the page table, the TLB (Translation Lookaside Buffer) mechanism was introduced to cache more pages in the CPU cache. Therefore, an important item in the performance statistics of the CPU is the TLB miss statistic item of the L1/L2 cache. In a large-memory system, for example, 256 GB of memory has a total of 256 GB/4 KB=67,108,864 page table entries.
If each entry occupies 16 bytes, 1 GB is required. It is apparent that the CPU cache cannot be fully cached. At this time, if the accessed memory is wide, it is easy for the TLB miss to increase the access delay.
Hugepages
In order to reduce the probability of TLB miss, Linux introduced the Hugepages mechanism, which can set the Page size to 2MB or 1GB. Under the 2MB Hugepages mechanism, the same 256GB memory page entry is reduced to 256GB/2MB=131072, which only requires 2MB. So Hugepages page table can be cached in the CPU cache.
By sysctl -w vm.nr_hugepages=1024 you can set the number of hugepages to 1024 and the total size to 4GB. It should be noted that setting the hugepages will request 2MB of memory blocks from the system and keep them (cannot be used for normal memory requests). If the system runs for a period of time and the memory fragmentation is high, then applying hugepages will fail.
The settings and mount methods for hugepages are shown below. After mount, the application needs to use mmap for file mapping in the mount path to use these hugepages.
1 2 3 |
sysctl -w vm.nr_hugepages=1024 mkdir -p /mnt/hugepages mount -t hugetlbfs hugetlbfs /mnt/hugepages |
Buddy System
The Linux Buddy System is designed to solve the memory fragmentation problem caused by the memory allocation in the unit of Page: that is, the system lacks consecutive Page Pages and memory requests that require continuous Page Pages cannot be satisfied.
The principle is very simple, different numbers of continuous Pages are combined into Block to allocate, Block is divided into 11 Block lists according to the power of two Pages, corresponding to 1,2,4,8,16,32,64,128 , 256, 512, and 1024 consecutive Pages. When calling the Buddy System for memory allocation, find the most suitable block based on the requested size.
The following shows the Buddy System basic information on each Zone. The last 11 columns are the number of available Blocks in the 11 Block List.
1 |
[root@localhost ~]# cat /proc/buddyinfo |
1 2 3 4 |
Node 0, zone DMA 0 0 1 0 1 1 1 0 0 1 3 Node 0, zone DMA32 102 79 179 229 230 166 251 168 107 78 169 Node 0, zone Normal 1328 900 1985 1920 2261 1388 798 972 539 324 2578 Node 1, zone Normal 466 1476 2133 7715 6026 4737 2883 1532 778 490 2760 |
Slab
Buddy System memory is a large application, but most applications require very little memory, such as the common hundreds of Bytes data structure, if you also apply for a Page, it will be very wasteful. In order to meet the small and irregular memory allocation requirements, Linux designed Slab distributor.
The principle is simply to create a memcache for a specific data structure, apply Pages from the Buddy System, and divide each Page into multiple objects according to the size of the data structure. The user allocates an Object when requesting a data structure from memcache.
The following shows how to view slab information in Linux:
1 |
[root@localhost ~]# cat /proc/slabinfo |
1 2 3 4 5 6 7 8 9 10 |
slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> fat_inode_cache 90 90 720 45 8 : tunables 0 0 0 : slabdata 2 2 0 fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0 kvm_vcpu 0 0 16576 1 8 : tunables 0 0 0 : slabdata 0 0 0 kvm_mmu_page_header 0 0 168 48 2 : tunables 0 0 0 : slabdata 0 0 0 ext4_groupinfo_4k 4440 4440 136 30 1 : tunables 0 0 0 : slabdata 148 148 0 ext4_inode_cache 63816 65100 1032 31 8 : tunables 0 0 0 : slabdata 2100 2100 0 ext4_xattr 1012 1012 88 46 1 : tunables 0 0 0 : slabdata 22 22 0 ext4_free_data 16896 17600 64 64 1 : tunables 0 0 0 : slabdata 275 275 0 |
Usually we use the slabtop command to view the sorted slab information:
1 2 3 4 5 6 7 8 9 |
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 352014 352014 100% 0.10K 9026 39 36104K buffer_head 93492 93435 99% 0.19K 2226 42 17808K dentry 65100 63816 98% 1.01K 2100 31 67200K ext4_inode_cache 48128 47638 98% 0.06K 752 64 3008K kmalloc-64 47090 43684 92% 0.05K 554 85 2216K shared_policy_node 44892 44892 100% 0.11K 1247 36 4988K sysfs_dir_cache 43624 43177 98% 0.07K 779 56 3116K Acpi-ParseExt 43146 42842 99% 0.04K 423 102 1692K ext4_extent_status |
kmalloc
Like glibc's malloc(), the kernel also provides kmalloc() for allocating memory of any size. Similarly, if you let an application randomly apply any size of memory from a Page, it will also cause memory fragmentation in Page.
In order to solve the internal fragmentation problem, Linux uses the Slab mechanism to achieve kmalloc memory allocation. The principle is similar to that of the Buddy System, which is to create a power-of-two Slab pool for kmalloc allocation based on the best-sized Slab.
The following are the Slabs for kmalloc allocation:
1 |
[root@localhost ~]# cat /proc/slabinfo |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> kmalloc-8192 196 200 8192 4 8 : tunables 0 0 0 : slabdata 50 50 0 kmalloc-4096 1214 1288 4096 8 8 : tunables 0 0 0 : slabdata 161 161 0 kmalloc-2048 2861 2928 2048 16 8 : tunables 0 0 0 : slabdata 183 183 0 kmalloc-1024 7993 8320 1024 32 8 : tunables 0 0 0 : slabdata 260 260 0 kmalloc-512 6030 6144 512 32 4 : tunables 0 0 0 : slabdata 192 192 0 kmalloc-256 7813 8576 256 32 2 : tunables 0 0 0 : slabdata 268 268 0 kmalloc-192 15542 15750 192 42 2 : tunables 0 0 0 : slabdata 375 375 0 kmalloc-128 16814 16896 128 32 1 : tunables 0 0 0 : slabdata 528 528 0 kmalloc-96 17507 17934 96 42 1 : tunables 0 0 0 : slabdata 427 427 0 kmalloc-64 48590 48704 64 64 1 : tunables 0 0 0 : slabdata 761 761 0 kmalloc-32 7296 7296 32 128 1 : tunables 0 0 0 : slabdata 57 57 0 kmalloc-16 14336 14336 16 256 1 : tunables 0 0 0 : slabdata 56 56 0 kmalloc-8 21504 21504 8 512 1 : tunables 0 0 0 : slabdata 42 42 0 |
Kernel parameters
Linux provides some memory management related kernel parameters, which can be viewed in the /proc/sys/vm directory or viewed via sysctl -a |grep vm:
1 |
[root@localhost vm]# sysctl -a |grep vm |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
vm.admin_reserve_kbytes = 8192 vm.block_dump = 0 vm.dirty_background_bytes = 0 vm.dirty_background_ratio = 10 vm.dirty_bytes = 0 vm.dirty_expire_centisecs = 3000 vm.dirty_ratio = 20 vm.dirty_writeback_centisecs = 500 vm.drop_caches = 1 vm.extfrag_threshold = 500 vm.hugepages_treat_as_movable = 0 vm.hugetlb_shm_group = 0 vm.laptop_mode = 0 vm.legacy_va_layout = 0 vm.lowmem_reserve_ratio = 256 256 32 vm.max_map_count = 65530 vm.memory_failure_early_kill = 0 vm.memory_failure_recovery = 1 vm.min_free_kbytes = 1024000 vm.min_slab_ratio = 1 vm.min_unmapped_ratio = 1 vm.mmap_min_addr = 4096 vm.nr_hugepages = 0 vm.nr_hugepages_mempolicy = 0 vm.nr_overcommit_hugepages = 0 vm.nr_pdflush_threads = 0 vm.numa_zonelist_order = default vm.oom_dump_tasks = 1 vm.oom_kill_allocating_task = 0 vm.overcommit_kbytes = 0 vm.overcommit_memory = 0 vm.overcommit_ratio = 50 vm.page-cluster = 3 vm.panic_on_oom = 0 vm.percpu_pagelist_fraction = 0 vm.stat_interval = 1 vm.swappiness = 60 vm.user_reserve_kbytes = 131072 vm.vfs_cache_pressure = 100 vm.zone_reclaim_mode = 0 |
vm.drop_caches
Vm.drop_caches is the most commonly used parameter because Linux's Page cache mechanism causes a large amount of memory to be used for file system caching, including data caching and metadata (dentry, inode) caching. When the memory is insufficient, we can quickly release the file system cache with this parameter:
To free pagecache:
1 |
echo 1 > /proc/sys/vm/drop_caches |
To free reclaimable slab objects (includes dentries and inodes):
1 |
echo 2 > /proc/sys/vm/drop_caches |
To free slab objects and pagecache:
1 |
echo 3 > /proc/sys/vm/drop_caches |
vm.min_free_kbytes
Vm.min_free_kbytes is used to determine when the memory is less than the start of the memory recovery mechanism (including the file system cache mentioned above and the recyclable Slab mentioned below), the value of the default is smaller, in the system settings more memory A large value (such as 1GB) can automatically trigger memory reclamation when memory is not too low.
However, it cannot be set too large, resulting in frequent applications being often killed by OOM.
1 |
sysctl -w vm.min_free_kbytes=1024000 |
vm.min_slab_ratio
Vm.min_slab_ratio is used to determine the amount of Slab space that can be recycled in the Slab pool when the percentage of the area is reached. The default is 5%. However, after the author's experiment, Slab recovery will not be triggered when there is sufficient memory, and Slab recovery will only be triggered when the memory water level reaches min_free_kbytes above. The minimum value can be set to 1%:
1 |
sysctl -w vm.min_slab_ratio=1 |
Conclusion
The above article briefly describes about the Linux memory management mechanism and several commonly used memory management kernel parameters. We hope that you understood the concept clearly. If you have any questions please drop us your comment in the below comment box: we will get back to you as soon as possible.
Happy learning!