2014/03/22

C語言的記憶體洩漏(Memory Leak)偵測 - Valgrind

寫 C 的人對於記憶體管理要非常的精確,不像其他高階語言有 garbage collection,好習慣可以減少錯誤發生,讓 malloc() 與 free() 對等的出現,而且寫在同一個層級。但不免還是會有疏忽的時候,像我自己在寫 daemon,這種長期服務的程式,如果遇到 memroy leak,就會看到系統的記憶體一點一點的被吃光,最後免不了要重開的命運,如果是線上的服務器豈不就要重斷服務了!

好在有強大的 Valgrind 可以幫助我們真測出問題。


Case 1: Memory Leak

最常見的錯誤,就是 allocate 的記憶體忘記 free,而且已經沒有任何指標指著它。
我們舉個簡單的例子,並使用 valgrind 來協助偵測:
#include <stdlib.h>

void func (void)
{
    char *buff = malloc(10);
}

int main (void)
{
    func();
    return 0;
}

編譯並執行 (記得編譯時要加 -g 的參數,valgrind 的報告才會指明有問題的行數)
# gcc -g -o mem_test mem_test.c

Valgrind 使用方法很簡單,不需要修改程式碼,用 valgrind 把自己的 program 帶起來即可,如果有參數就加在後面,程式執行結束就會產生報告。
valgrind [-valgrind parameter] ./my_program [-program parameter]
# valgrind --leak-check=full ./mem_test

==59095== Memcheck, a memory error detector
==59095== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==59095== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==59095== Command: ./mem_test
==59095==
==59095==
==59095== HEAP SUMMARY:
==59095==     in use at exit: 10 bytes in 1 blocks
==59095==   total heap usage: 1 allocs, 0 frees, 10 bytes allocated
==59095==
==59095== 10 bytes in 1 blocks are definitely lost in loss record 1 of 1
==59095==    at 0x4C2C857: malloc (vg_replace_malloc.c:291)
==59095==    by 0x400505: func (mem_test.c:5)
==59095==    by 0x400514: main (mem_test.c:10)
==59095==
==59095== LEAK SUMMARY:
==59095==    definitely lost: 10 bytes in 1 blocks
==59095==    indirectly lost: 0 bytes in 0 blocks
==59095==      possibly lost: 0 bytes in 0 blocks
==59095==    still reachable: 0 bytes in 0 blocks
==59095==         suppressed: 0 bytes in 0 blocks
==59095==
==59095== For counts of detected and suppressed errors, rerun with: -v
==59095== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
報告中指明有一個 10 bytes 的 block 被偵測為 "definitely lost",位置在 mem_test.c 的第 5 行

可以看到 memory lost 分成幾種類型:
  • definitely lost: 真的 memory leak 了
  • indirectly lost: 間接的 memory leak,structure 本身發生 memory leak,而內部的 member 如果是 allocate 的出來的,一樣會 memory leak,但是只要修好前面的問題,後面的問題也會跟著修復。
  • possibly lost: allocate 一塊記憶體,並且放到指標 ptr,但事後又改變 ptr 指到這會計一體的中間 (這一點我目前也不是很清楚,建議看原文說明)
  • still reachable: 程式結束時有未釋放的記憶體,不過卻還有指標指著,通常會發生在 global 變數
就算是 library 所 malloc 的記憶體,如 evbuffer_new(),也都可以偵測的到。



Case 2: Invalid Memory Access

Invalid memory access 有時候並不會立即造成 segmentation fault,所以不會有 core dump可以查詢,需要借助像 valgrind 這類的工具來偵測。
一般情況可能是用了 allocate 的 memory 之外的地方,或是用了已經 free 的 memory,請看下面幾個例子:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main (void)
{
    // 1. Invalid write
    char *str = malloc(4);
    strcpy(str, "Brian");
    free(str);
    
    // 2. Invalid read
    int *arr = malloc(3);
    printf("%d", arr[4]);
    free(arr);
    
    // 3. Invalid read
    printf("%d", arr[0]);

    // 4. Invalid free
    free(arr);

    return 0;
}
編譯並執行
==60019== Memcheck, a memory error detector
==60019== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==60019== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==60019== Command: ./mem_test
==60019==
==60019== Invalid write of size 2
==60019==    at 0x4005AB: main (mem_test.c:9)
==60019==  Address 0x51f4044 is 0 bytes after a block of size 4 alloc'd
==60019==    at 0x4C2C857: malloc (vg_replace_malloc.c:291)
==60019==    by 0x400595: main (mem_test.c:8)
==60019==
==60019== Invalid read of size 4
==60019==    at 0x4005D1: main (mem_test.c:14)
==60019==  Address 0x51f40a0 is 13 bytes after a block of size 3 alloc'd
==60019==    at 0x4C2C857: malloc (vg_replace_malloc.c:291)
==60019==    by 0x4005C4: main (mem_test.c:13)
==60019==
==60019== Invalid read of size 4
==60019==    at 0x4005F7: main (mem_test.c:18)
==60019==  Address 0x51f4090 is 0 bytes inside a block of size 3 free'd
==60019==    at 0x4C2B75D: free (vg_replace_malloc.c:468)
==60019==    by 0x4005F2: main (mem_test.c:15)
==60019==
==60019== Invalid free() / delete / delete[] / realloc()
==60019==    at 0x4C2B75D: free (vg_replace_malloc.c:468)
==60019==    by 0x400618: main (mem_test.c:21)
==60019==  Address 0x51f4090 is 0 bytes inside a block of size 3 free'd
==60019==    at 0x4C2B75D: free (vg_replace_malloc.c:468)
==60019==    by 0x4005F2: main (mem_test.c:15)
==60019==
00==60019==
==60019== HEAP SUMMARY:
==60019==     in use at exit: 0 bytes in 0 blocks
==60019==   total heap usage: 2 allocs, 3 frees, 7 bytes allocated
==60019==
==60019== All heap blocks were freed -- no leaks are possible
==60019==
==60019== For counts of detected and suppressed errors, rerun with: -v
==60019== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)
錯誤一:Invalid write of size 2,試圖寫入一個非法的區域,valgrind 還好心告訴你這個地方是在 mem_test.c:6 allocate 出來的 memory 之後的 3 byte,通常遇到這種情況都是忘記檢查 buffer 的 size 就去用。
錯誤二:Invalid read of size 4,試圖讀取一個非法的區域。
錯誤三:Invalid read of size 4,讀取的區域已經被 free 了,free 的位置 valgrind 也幫你指出來 mem_test.c:12。
錯誤四:Invalid free,也就是 free 一個不存在的地方,或是 double free



其他提醒

遇到 Conditional jump or move depends on uninitialised value(s) 的錯誤
可能是用了沒有結束字元 (Null-terminated string) 的字串

有一種錯誤不太好找,如 A function 用到用到 B function 產生出來的 memory,對於系統本身並無任何違法,但卻會造成程式出現不預期的值。

對於 local variable 的存取如果超過範圍,還有可能造成 stack corrupt,然後噴出這類的錯誤:
*** stack smashing detected ***: ./mem_test terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x64f8f47]
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x0)[0x64f8f10]
./mem_test(func+0x2db)[0x407f22]
======= Memory map: ========
00400000-00420000 r-xp 00000000 08:01 280804      /tmp/mem_test
0061f000-00620000 r--p 0001f000 08:01 280804      /tmp/mem_test
00620000-00622000 rw-p 00020000 08:01 280804      /tmp/mem_test
00622000-00623000 rw-p 00000000 00:00 0
04000000-04022000 r-xp 00000000 08:01 160861      /lib/x86_64-linux-gnu/ld-2.15.so

如果不想 show possibly lost,可以加下面的參數
3.9版 --show-leak-kinds=definite
3.7版 --show-possibly-lost=no

另外 file descriptor 開了沒關也可以偵測,只要加上 --track-fds=yes

之前有遇過 valgrind 在 ubuntu 12.04 一直有錯誤訊息,記得要安裝額外的套件 sudo apt-get install libc6-dbg
12.04 - Valgrind does debug error - Ask Ubuntu