Fast capture stack trace on windows / 64-bit / mixed mode

落花浮王杯 提交于 2019-12-18 13:28:10

问题


Like most of you probably know there exists plenty of different mechanisms to walk stack traces, starting from windows api, and continuing further into depth of magical assembly world - let me list here some of links I've already studied.

For of all let me mention that I want to have mechanism of memory leak analysis of mixed mode (managed and unmanaged) / 64-bit + AnyCPU application and from all windows api's CaptureStackBackTrace is most suitable for my needs, but as I have analyzed - it does not support managed code stack walking. But that function API is closest to what I need (Since it also calculates back trace hash - unique identifier of particular call stack).

I've excluded different approaches of locating memory leaks - most of software which I have tried is either crashing or does not work unreliably, or gives bad results.

Also I don't want to recompile existing software and override malloc / new other mechanism - because it's heavy task (we have huge code base with a lot of dll's). Also I suspect this is not one time job which I need to perform - issue comes back with 1-2 year cycle, depending of who and what was coding, so I would prefer to have built-in memory leaks detection in application itself (memory api hooking) instead of fighting with this problem over and over again.

http://www.codeproject.com/Articles/11132/Walking-the-callstack

Uses StackWalk64 windows API function, but does not work with managed code. Also 64 bit support is not fully clear to me - I have seen some walkaround for 64 bit problem - I suspect that this code does not fully work when stack walk is done within same thread.

Then there exists process hacker: http://processhacker.sourceforge.net/

Which also use StackWalk64, but extends it's call back function (7th and 8-th parameters) to support mixed mode stack walking. After a lot of complexities with 7/8 call back functions I've managed to reach also support of StackWalk64 with mixed mode support (Catching stack trace as vector - where each pointer refers to assembly / dll location where call went by). But as you may guess - the performance of StackWalk64 is insufficient for my needs - even with simple Message box from C# side, application simply "hangs" for a while until it starts up correctly.

I haven't seen such heavy delays with CaptureStackBackTrace function call, so I assume that performance of StackWalk64 is insufficient for my needs.

There exists also COM based approach of stack trace determination - like this: http://www.codeproject.com/Articles/371137/A-Mixed-Mode-Stackwalk-with-the-IDebugClient-Inter

http://blog.steveniemitz.com/building-a-mixed-mode-stack-walker-part-1/

but what I'm afraid of - it requires COM , and thread needs to be com initialized, and due to memory api hooking I should not touch com state within any thread, because it can results in heavier problems (e.g. incorrect apartment initialization, other glitches)

Now I've reached the point when windows API becomes insufficient for my own needs, and I need to walk through call stack manually. Such examples can be found for example:

http://www.codeproject.com/Articles/11221/Easy-Detection-of-Memory-Leaks See function FillStackInfo / 32 bit only, does not support managed code.

There are couple of mentions about reversing stack trace - for example on following links:

  1. http://blog.airesoft.co.uk/2009/02/grabbing-kernel-thread-contexts-the-process-explorer-way/
  2. http://cbloomrants.blogspot.fi/2009/01/01-30-09-stack-tracing-on-windows.html
  3. http://www.gamedev.net/topic/364861-stack-dump-on-win32-how-to-get-api-addresses/
  4. http://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64/

Especially 1, 3, 4 links give some interesting night reading. :-)

But even thus they are rather interesting mechanisms, there is no fully working demo example on any of them.

I guess one of examples is Wine's dbghelp implementation (Windows "emulator" for linux), which also shows how exactly StackWalk64 works at the end, but I suspect that it's heavily bound to DWARF2 file format executable, so it's not identical to current windows PE executable file format.

Can someone pintpoint me to good implementation of stack walking, working on 64-bit architecture, with mixed mode support (can track native and managed memory allocations), which is bound purely in register / call stack / code analysis. (Combined implementations of 1, 3, 4)

Does someone has any good contact from Microsoft development team, who could potentially answer to this question ?


回答1:


9-1-2015 - I've located original function which gets called by process hacker, and that one was

C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll OutOfProcessFunctionTableCallback

it's source code - which was here: https://github.com/dotnet/coreclr/blob/master/src/debug/daccess/fntableaccess.cpp

From there I have owner of most of changes in that source code - Jan Kotas (jkotas@microsoft.com) and contacted him about this problem.

From: Jan Kotas <jkotas@microsoft.com>
To: Tarmo Pikaro <tapika@yahoo.com> 
Sent: Friday, January 8, 2016 3:27 PM
Subject: RE: Fast capture stack trace on windows 64 bit / mixed mode...

...

The mscordacwks.dll is called mscordaccore.dll in CoreCLR / github repro. The VS project 
files are auto-generated for it during the build 
(\coreclr\bin\obj\Windows_NT.x64.Debug\src\dlls\mscordac\mscordaccore.vcxproj).
You should be able to build and debug CoreCLR to understand how it works.
...

From: Jan Kotas <jkotas@microsoft.com>
To: Tarmo Pikaro <tapika@yahoo.com> 
Sent: Saturday, January 9, 2016 2:02 AM
Subject: RE: Fast capture stack trace on windows 64 bit / mixed mode...

> I've tried to replace 
> C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll dll loading 
> with C:\Prototyping\dotNet\coreclr-master\bin\obj\Windows_NT.x64.Debug\src\dlls\mscordac\Debug\mscordaccore.dll
> loading (just compiled), but if previously I could get mixed mode stack trace correctly:
> ...

mscordacwks.dll is tightly coupled with the runtime. You cannot mix and match them between runtimes.
What I meant is that you can use CoreCLR to understand how this works.

But then he recommended this solution which was working for me:

int CaptureStackBackTrace3(int FramesToSkip, int nFrames, PVOID* BackTrace, PDWORD pBackTraceHash)
{
    CONTEXT ContextRecord;
    RtlCaptureContext(&ContextRecord);

    UINT iFrame;
    for (iFrame = 0; iFrame < nFrames; iFrame++)
    {
        DWORD64 ImageBase;
        PRUNTIME_FUNCTION pFunctionEntry = RtlLookupFunctionEntry(ContextRecord.Rip, &ImageBase, NULL);

        if (pFunctionEntry == NULL)
            break;

        PVOID HandlerData;
        DWORD64 EstablisherFrame;
        RtlVirtualUnwind(UNW_FLAG_NHANDLER,
            ImageBase,
            ContextRecord.Rip,
            pFunctionEntry,
            &ContextRecord,
            &HandlerData,
            &EstablisherFrame,
            NULL);

        BackTrace[iFrame] = (PVOID)ContextRecord.Rip;
    }

    return iFrame;
}

This code snipet still is missing backtrace hash calculation, but this is something can can be added afterwards.

It's very import also to note that when debugging this code snipet you should use native debugging, not mixed mode (C# project by default use mixed mode), because it somehow disturbs stack trace in debugger. (Something to figure out how and why such distortion happens)

There is still one missing piece of puzzle - how to make symbol resolution fully resistant to FreeLibrary / Jit code dispose, but this is something I need to figure out still.

Please note that RtlVirtualUnwind will most probably work only on 64-bit architecture, not on arm or 32-bit.

One more funny thing is that there exists function RtlCaptureStackBackTrace which somehow resembles windows api function CaptureStackBackTrace - but they somehow differ - at least by naming. Also if you check RtlCaptureStackBackTrace - it calls eventually RtlVirtualUnwind - you can check it from Windows Research Kernel source codes

RtlCaptureStackBackTrace
>
RtlWalkFrameChain
>
RtlpWalkFrameChain
>
RtlVirtualUnwind

But what I have tested RtlCaptureStackBackTrace does not works correctly. Unlike function RtlVirtualUnwind above.

It's a kinda magic. :-)

I'll continue this questionnaire with phase 2 question - in here:

Resolve managed and native stack trace - which API to use?




回答2:


x64 Stack walking is complicated as you have already found out. A simple alternative is to simply not do it but leave the hard things to the OS ETW stackwalker. That works and it is much faster than you will ever get.

You can take advantage of it by emitting your own ETW event. Before that you need to start an ETW session for your event provider and enable stack walking for your provider. There is a catch on Windows 7 where it does not work unless the managed stack frames are all NGenned because the x64 ETW Stackwalker will stop if he finds a stack frame which is not in any loaded module which is true for JITed code.

Starting with Windows 8 the ETW Stackwalker will always walk the first MB of the stack for stack frames which fixes the JIT problem. The JIT compiler emits Unwind Infos for the generated code if ETW tracing is on and registers it via RtlAddGrowableFunctionTable which makes it possible to walk the stack fast from within the kernel in the first place. Things work differently when ETW tracing is not enabled for compatibility reasons.

If you are after malloc/free new/delete memory leaks you can also use the OS bultin capabilities of Heap allocation tracing which already exists since Windows 7. See xperf -help start and https://randomascii.wordpress.com/2015/04/27/etw-heap-tracingevery-allocation-recorded/ for more information about heap allocation tracing. You can enable it for an already running process with no problems. The downside is that for any real world application the generated data is huge. But if you are after big allocations only then it can help to track only VirtualAlloc calls which can also be enabled with minimal overhead.

Managed code since .NET 4.5 has also its own ETW allocation tracing provider with full stack walking even on x64 Windows 7 because it does a full managed stackwalk by itself. More infos can be found in the CoreClr Sources at: ETW::SamplingLog::SendStackTrace in https://github.com/dotnet/coreclr/blob/master/src/inc/eventtracebase.h for many more details.

That is only a rough outline what is possible. To really get all of the necessary details would take a whole book I fear. And I am still learning new things every day.

Here is a heapalloc.cmd script which you can use to track heap allocations. By default it records into a 500MB ring buffer if your leak builds up over longer periods of time recording all allocation stacks without condensing them at runtime will not work out with WPA. But you could post process a huge ETL file and write your own viewer for it.

@echo off 
setlocal enabledelayedexpansion
REM consider using a different drive for ETL output to prevent slowing down 
REM your application and to prevent lost buffers
set OUTDIR=C:\TEMP
set OUTFILENAME=HeapTracing.etl
REM Final output file
set OUTFILE=!OUTDIR!\!OUTFILENAME!
set CLRUNDOWNFILE=!OUTDIR!\clr_HeapDCend.etl
set KERNELFILE=!OUTDIR!\kernel.etl
set CLRSESSIONFILE=!OUTDIR!\clrHeapSession.etl
set HEAPUSERFILE=!OUTDIR!\HeapUserSession.etl
REM Default is allocation and realloc to track memory leaks
REM HeapFree is the other option to track double free calls
set HEAPTRACINGFLAGS=HeapAlloc+HeapRealloc 

if "%3" NEQ "" (
echo Overriding Heap Tracing Flags with: %3
set HEAPTRACINGFLAGS=%3
)


if "%1" EQU "-start" ( 
    call :StartTracing -PidNewProcess %2
    goto :Exit 
) 

if "%1" EQU "-attachPid" ( 
    call :StartTracing -Pids %2
    goto :Exit 
) 

if "%1" EQU "-startNext" (
    reg add "HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\%~nx2" /v TracingFlags /t REG_DWORD /d 1 /f
    if not %errorlevel% == 0 goto failure
    call :StartTracing -Pids 0
    goto :Exit
)

if "%1" EQU "-stop" ( 
    set XPERF_CreateNGenPdbs=1
    xperf -start ClrRundownSession -on e13c0d23-ccbc-4e12-931b-d9cc2eee27e4:0x118:5+a669021c-c450-4609-a035-5af59af4df18:0x118:5 -f "!CLRUNDOWNFILE!" -buffersize 256 -minbuffers 256 -maxbuffers 512 
    call :WaitUntilRundownCompleted "!CLRUNDOWNFILE!"
    xperf -stop -stop ClrSession ClrRundownSession HeapSession | findstr /V identifiable 2> NUL

    echo Merging profiles
    REM Reset symbol path to create the pdbs files in the output directory with in the directory with the same name like our etl file
    set TMPSYMBOLPATH=!_NT_SYMBOL_PATH!
    REM Each tool is using a different pdb cache folder. If you are using them side by side 
    REM you have to wait a long time to refresh the pdb cache. To spare the waiting time we use 
    REM the pdb cache folder from WPR

    mkdir C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache 2> NUL
    set _NT_SYMBOL_PATH=srv*C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache 
    mklink /D "!OUTFILE!.NGENPDB" C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache  2> NUL

    echo Managed PDBs are stored at: !OUTFILE!.NGENPDB. If you want to transfer the etl do not forget to copy this directory with the pdbs as well. 
    echo Merging ETL files and generating native pdbs

    xperf -merge  "!KERNELFILE!" "!CLRSESSIONFILE!" "!CLRUNDOWNFILE!" "!HEAPUSERFILE!" "!OUTFILE!"
    set _NT_SYMBOL_PATH=!TMPSYMBOLPATH!
    echo !OUTFILE! was created

    if "%2" NEQ "" reg delete "HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\%~nx2" /v TracingFlags /f 2> NUL
    goto :Exit 
) 

goto Usage:

:StartTracing
xperf -start ClrSession -on  Microsoft-Windows-DotNETRuntime:5 -f "!CLRSESSIONFILE!" -buffersize 128 -minbuffers 256 -maxbuffers 512 
xperf -on PROC_THREAD+LOADER+latency+virt_alloc -stackwalk VirtualAlloc  -f "%KERNELFILE%"
xperf -start HeapSession -heap %1 %2 -BufferSize 1024 -MinBuffers 128 -MaxBuffers 1024 -stackwalk %HEAPTRACINGFLAGS% -f "!HEAPUSERFILE!" -FileMode Circular -MaxFile 1024
exit /B

REM Wait until writing to ETL file has stopped by checking its file size
:WaitUntilRundownCompleted
:StillWriting
    for %%F in (%1) do set "size=%%~zF"
    timeout /T 1  > nul
    for %%F in (%1) do set "size2=%%~zF"
    if "!size!" EQU "" goto :EndWriting
    if "!size!" NEQ "!size2!" goto StillWriting
:EndWriting
timeout /T 1  > nul
exit /B


:Usage
    echo Usage: 
    echo HeapAlloc.cmd -start [executable] or -stop
    echo               -start [executable] Start a trace session 
    echo               -startNext [executable] Start heap tracing for all subsequent calls to executable.
    echo               -attachPid ddd Start a trace session for specified process
    echo               -stop  [executable] Stop a trace session 
    echo Examples
    echo     HeapAlloc.cmd -startNext devenv.exe
    echo     HeapAlloc.cmd -stop      devenv.exe
    echo To attach to a running process
    echo     HeapAlloc.cmd -attachPid dddd
    echo     HeapAlloc.cmd -stop 
    echo You must call -stop for your executable if you have used -start or startNext because heap allocation tracing will enabled until you stop it!
goto :Exit 

:failure
    echo Error occured
goto :Exit

:Exit



回答3:


25.1.2016 Writing as separate issue, as complementary information.

For stack unique id CaptureStackBackTrace uses simple sum of all instruction pointers - idea is borrowed from: "Windows_Research_Kernel(sources)\WRK-v1.2\base\ntos\rtl\amd64\stkwalk.c":

    size_t hashValue = 0;

    for (int i = 0; i < nFrames; i++)
        hashValue += PtrToUlong(BackTrace[i]);

    *pBackTraceHash = (DWORD)hashValue;

I'm not sure about last conversion - some specify last parameter as DWORD, some as ulong64, but it's not relevant. The main problem with this calculation is that it's not unique enough. For case of recursive function calls - if you have call order:

func1
func2
func3

Stack trace for:

func1
func3
func2

Will be identical.

What I have debugged - for memory leak detection I'm getting 62876 false hits - unique stack id calculation is not reliable enough.

I've bit switched formula to:

static DWORD crc32_tab[] =
{
    0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
    0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
    0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
    0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
    0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,
    0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
    0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,
    0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
    0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
    0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
    0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,
    0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
    0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
    0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
    0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
    0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
    0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
    0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
    0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
    0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
    0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
    0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
    0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
    0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
    0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
    0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
    0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
    0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
    0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
    0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
    0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
    0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
    0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
    0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
    0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
    0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
    0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
    0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
    0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
    0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
    0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
    0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
    0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
};

if (pBackTraceHash)
{

    size_t hashValue = 0;
    for( int idxFrame = 0; idxFrame < (int)iFrame; idxFrame++ )
    {
        unsigned char* p = (unsigned char*)&BackTrace[idxFrame];
        for( int i = 0; i < sizeof(void*); i++ )
            hashValue = crc32_tab[ ((hashValue ^ *p++) & 0xFF) ] ^ (hashValue >> 8);
    }
    *pBackTraceHash = (DWORD)hashValue;
}

This algorithm does not give false hits, but slows down execution a little bit.

Also memory leak statistics differs: Unreliable algorithm: Total amount of leaked memory: 48'874'764 / in 371 allocation pools Crc32 based algorithm: Total amount of leaked memory: 48'874'764 / in 614 allocation pools

Like you see - statistics combines (pools) similar call stack together - less fragmentation, but original call stack is lost. (Incorrect statistics)

May be someone could give me some faster algorithm for this ?




回答4:


Just notes to myself:

Apparently CaptureStackBackTrace is probably calls directly or indirectly RtlCaptureStackBackTrace, and source code of that function is apparently is open source code currently - can be searched using "windows research kernel".

Code which I've accidentally found by harvesting https://github.com/dotnet/coreclr/blob/master/src/unwinder/amd64/unwinder_amd64.cpp

where there was reference in code that's borrowed from windows kernel:

Everything below is borrowed from minkernel\ntos\rtl\amd64\exdsptch.c file from Windows

and by googling bit more I've located windows kernel itself.

May be I can upgrade that function to support managed stack as well (using information from process hacker).

[4.1.2015] By analyzing deeper it looks like main performance bottleneck is not CaptureStackBackTrace itself - because it's simple iteration, structure lookup, but managed mode stack walk, where I call C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll / OutOfProcessFunctionTableCallback - you can find it's source code in .net distribution, and apparently its allocating memory for analyzing JIT compiled structures. But problem is that JIT compilation can change whenever, and only way to have reliable stack trace is to re-query same information over and over again, which can cause overhead in memory allocation. I guess code needs to be changed so that mscordacwks similar code will not allocate memory by itself, but use run-time structures to determine call stack and function table / function entries.

P.S. if you vote down this answer, I would like to know reason why, what is the alternative. And better if you have tried alternative by yourself.




回答5:


Btw - if anyone is missing original implementation of StackWalk for windows, it's located over here:

https://github.com/dotnet/coreclr/blob/master/src/utilcode/stacktrace.cpp




回答6:


27.1.2016 And may be out of direct question - is 32-bit call stack determination. I've asked about which API to use - at least CaptureStackBackTrace produces incomplete traversals (Only native code), and also RtlVirtualUnwind api function does not exists for 32-bit windows.

From: Noah Falk <noahfalk@microsoft.com>
To: Tarmo Pikaro <tapika@yahoo.com>; Mike McLaughlin <mikem@microsoft.com> 
Cc: Jan Kotas <jkotas@microsoft.com>
Sent: Tuesday, January 26, 2016 1:34 AM
Subject: RE: Resolving managed call stack from void*

Hi Tarmo, hope the exploration of stackwalking has been interesting. 
If I followed you correctly you’ve been successful on x64 but hoping you can extend your technique to 32 bit. 
Indeed the RtlCaptureVirtualUnwind techniques don’t work here, and the fundamental reason behind it is that 
while x64 defines a specific calling convention that all code on Windows is forced to use, x86 does not. 
This means that there is no algorithm the OS could implement which guarantees correct unwinding when PDBs are 
unavailable. However you do have some options:

1)      You can use simple heuristics that work for certain kinds of code. 
Unoptimzed code on x86 often uses EBP chaining, in which ESP in the current frame points to EBP, and EBP points 
to the parent frame’s EBP, and so on down the stack. The return address is stored on the stack adjacent to EBP. 
As I recall all jitted code produced by recent versions of .Net follows these conventions, including optimized 
jitted code. However when a compiler performs inlining these conventions will be unable to detect it, and optimized 
code that does not follow this convention could easily cause the stack to become unwalkable.

2)      If you are willing to load PDBs you can use the DIA APIs to walk the stack: 
https://msdn.microsoft.com/en-us/library/dt06fh94.aspx. The PDB contains additional data about optimized code 
which allows frames that do not follow the EBP chaining convention to be correctly unwound. 
This is the stack walk API that Visual Studio is using when it debugs 32 bit native code on Windows.

3)      The ICorDebug APIs (https://msdn.microsoft.com/en-us/library/dd646502(v=vs.110).aspx) are a set of 
APIs that are designed to support managed code debuggers. Starting in .Net 4.0 the ICorDebug API supports 
dump debugging, however the API is designed in such a way that you don’t have to serialize a dump file. 
This is likely to be more complicated than you would want, but its supported to the use the Windows process 
snapshot APIs to take a snapshot of the memory space and then direct the ICorDebug API to read from this 
snapshot as if it was a dump. One advantage of the ICorDebug API is that not only will it give you managed 
stack frames, it also allows exporing all the other kinds of data debuggers would expose such as parameters, 
local values, fields of objects, types of the values, etc.

The MDbg tool (https://www.microsoft.com/en-us/download/details.aspx?id=2282) is a complete sample debugger 
with source included. It supports dump debugging and displaying callstacks, though it won’t have any specific 
example about using the process snapshot APIs in place of using a dump. The main change would be replacing 
the implementation of ICorDebugDataTarget. MDbg has an implementation that reads from a dump file and you 
would need to create a new implementation that reads from a process snapshot using the windows APIs 
(https://msdn.microsoft.com/en-us/library/dn457825(v=vs.85).aspx). I’ve never written the code myself and 
I’ve heard from other tool authors that they found using the windows snapshot APIs more difficult than expected,
 but eventually they were successful.

And I was bit inspired by approach 1, since already saw similar approach being done in another project, so I've wrote my own implementation for 32-bit stack traversal:

int CaptureStackBackTracePro( int FramesToSkip, int nFrames, PVOID* BackTrace, PDWORD pBackTraceHash )
{
    //
    //  This approach was taken from StackInfoManager.cpp / FillStackInfo
    //  http://www.codeproject.com/Articles/11221/Easy-Detection-of-Memory-Leaks
    //  - slightly simplified the function itself.
    //
    int regEBP;
    __asm mov regEBP, ebp;

    long *pFrame = (long*) regEBP;              // pointer to current function frame
    void* pNextInstruction;
    int iFrame = 0;

    //
    // Using __try/_catch is faster than using ReadProcessMemory or VirtualProtect.
    // We return whatever frames we have collected so far after exception was encountered.
    //
    __try {
        for( ; iFrame < nFrames; iFrame++ )
        {
            pNextInstruction = (void*)(*(pFrame + 1));

            if( !pNextInstruction )     // Last frame
                break;

            BackTrace[iFrame] = pNextInstruction;
            pFrame = (long*)(*pFrame);
        }
    }
    __except(EXCEPTION_EXECUTE_HANDLER) 
    {
    }

    // pBackTraceHash fillout is missing, see in another answer code snipet.

    return iFrame;

} //CaptureStackBackTracePro

Brief tests indicate that this function is able to capture native and managed stack frames.

Optimized code I guess requires bit deeper analysis. Better to leave out optimizations or only optimize relevant parts of code - for better diagnostics ?!



来源:https://stackoverflow.com/questions/34501602/fast-capture-stack-trace-on-windows-64-bit-mixed-mode

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!