Memory profiling with data.table

别等时光非礼了梦想. 提交于 2021-01-27 04:48:20

问题


What is the correct way to profile memory in R code that contains calls to data.table functions? Let's say I want to determine the maximum memory usage during an expression.

This reference indicates that Rprofmem may not be the right choice: https://cran.r-project.org/web/packages/profmem/vignettes/profmem.html

All memory allocations that are done via the native allocVector3() part of R's native API are logged, which means that nearly all memory allocations are logged. Any objects allocated this way are automatically deallocated by R's garbage collector at some point. Garbage collection events are not logged by profmem(). Allocations not logged are those done by non-R native libraries or R packages that use native code Calloc() / Free() for internal objects. Such objects are not handled by the R garbage collector.

The data.table source code contains plenty of calls to Calloc() and malloc() so this suggests that Rprofmem will not measure all memory allocated by data.table functions. If Rprofmem is not the right tool, how come Matthew Dowle uses it here: R: loop over columns in data.table?

I've found a reference suggesting similar potential issues for gc() (which can be used to measure maximum memory usage between two calls to gc()): https://r.789695.n4.nabble.com/Determining-the-maximum-memory-usage-of-a-function-td4669977.html

gc() is a good start. Call gc(reset = TRUE) before and gc() after your task, and you will see the maximum extra memory used by R in the interim. (This does not include memory malloced by compiled code, which is much harder to measure as it gets re-used.)

Nothing I've found suggests that similar issues exist with Rprof(memory.profiling=TRUE). Does this mean that the Rprof approach will work for data.table even though it doesn't always use the R API to allocate memory?

If Rprof(memory.profiling=TRUE) in fact is not the right tool for the job, what is?

Would ssh.utils::mem.usage work?


回答1:


This is not related to data.table. Recently there was a discussion on twitter about same dplyr behaviour: https://mobile.twitter.com/healthandstats/status/1182840075001819136

/usr/bin/time -v Rscript -e 'library(data.table); CJ(1:1e4, 1:1e4)' |& grep resident

There is also interesting cgmemtime project, but it requires a little bit more setup.

If you are on Windows I suggest you to move to Linux.




回答2:


If you are using Windows, you can call Powershell memory and other performance objects for RGui and Memory Compression as system commands through R and call various memory counters. I don't have a path to store Powershell objects in R yet. Powershell Code for RGui and 'Memory Compression' which Windows uses to store frequently used objects:

    $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
    $t2 = $t1 | Select { $_.Id;
    [math]::Round($_.WorkingSet64/1MB);
    [math]::Round($_.PrivateMemorySize64/1MB);
    [math]::Round($_.VirtualMemorySize64/1MB) };
    $t2 | ft * 

    $t1 | gm -View All
    $t1.Modules
    $t1.MaxWorkingSet

Powershell embedded in R:

    ps_f <- function() { system("powershell -ExecutionPolicy Bypass -command $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
    $t2 = $t1 | Select { 
     $_.Id;
     [math]::Round($_.WorkingSet64/1MB);
     [math]::Round($_.PrivateMemorySize64/1MB);
     [math]::Round($_.VirtualMemorySize64/1MB) };
    $t2 | ft * "); }

    ps_f()

     $_.Id;                                                                                                                
     [math]::Round($_.WorkingSet64/1MB);                                                                                   
     [math]::Round($_.PrivateMemorySize64/1MB);                                                                            
     [math]::Round($_.VirtualMemorySize64/1MB)                                                                             
    -----------------------------------------------------------------------------------------------------------------------
    {2264, 1076, 3, 1401}                                                                                                  
    {15832, 3544, 6691, 11965}   



    ps_mem <- function() { system("powershell -ExecutionPolicy Bypass -command $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
    $t1 | Select ProcessName,MaxWorkingSet,MinWorkingSet,PagedMemorySize64,NonpagedSystemMemorySize64;")} 

    > ps_mem()

    ProcessName                : Memory Compression
    MaxWorkingSet              : 
    MinWorkingSet              : 
    PagedMemorySize64          : 3411968
    NonpagedSystemMemorySize64 : 0

    ProcessName                : Rgui
    MaxWorkingSet              : 1413120
    MinWorkingSet              : 204800
    PagedMemorySize64          : 7014719488
    NonpagedSystemMemorySize64 : 6662736

    # run some data.table operation

    > ps_mem()
    ProcessName                : Memory Compression
    MaxWorkingSet              : 
    MinWorkingSet              : 
    PagedMemorySize64          : 3411968
    NonpagedSystemMemorySize64 : 0

    ProcessName                : Rgui
    MaxWorkingSet              : 1413120
    MinWorkingSet              : 204800
    PagedMemorySize64          : 7015915520
    NonpagedSystemMemorySize64 : 6662736

Powershell Code:

    $t1 | where {$_.ProcessName -eq "Rgui"} | Measure-Object -Maximum *memory* | ft  Property,Maximum

Powershell embedded in R:

    ps_mem_ <- function() { system("powershell -ExecutionPolicy Bypass -command $t1 = ps | where {$_.Name -EQ 'RGui' -or $_.Name -EQ 'Memory Compression'};
    $t2 = $t1 | where {$_.ProcessName -eq 'Rgui'}; 
    $t2 | Measure-Object -Maximum *memory* | ft  Property,Maximum ")} 

    # having some problems with rollover...

    > ps_mem_()

    Property                       Maximum
    --------                       -------
    NonpagedSystemMemorySize       6662736
    NonpagedSystemMemorySize64     6662736
    PagedMemorySize            -1570734080
    PagedMemorySize64           7019200512
    PagedSystemMemorySize           680240
    PagedSystemMemorySize64         680240
    PeakPagedMemorySize        -1260961792
    PeakPagedMemorySize64      11623940096
    PeakVirtualMemorySize       -161009664
    PeakVirtualMemorySize64    17018859520
    PrivateMemorySize          -1570734080
    PrivateMemorySize64         7019200512
    VirtualMemorySize           -339103744
    VirtualMemorySize64        12545798144

    some data.table run

    > ps_mem_()

    Property                       Maximum
    --------                       -------
    NonpagedSystemMemorySize       6662736
    NonpagedSystemMemorySize64     6662736
    PagedMemorySize            -1570734080
    PagedMemorySize64           7019200512
    PagedSystemMemorySize           680240
    PagedSystemMemorySize64         680240
    PeakPagedMemorySize        -1260961792
    PeakPagedMemorySize64      11623940096
    PeakVirtualMemorySize       -161009664
    PeakVirtualMemorySize64    17018859520
    PrivateMemorySize          -1570734080
    PrivateMemorySize64         7019200512
    VirtualMemorySize           -339103744
    VirtualMemorySize64        12545798144

To see all the Rgui objects:

    $t1 | gm -View All


       TypeName: System.Diagnostics.Process

    Name                       MemberType     Definition
    ----                       ----------     ----------
    Handles                    AliasProperty  Handles = Handlecount
    Name                       AliasProperty  Name = ProcessName
    NPM                        AliasProperty  NPM = NonpagedSystemMemorySize64
    PM                         AliasProperty  PM = PagedMemorySize64
    SI                         AliasProperty  SI = SessionId
    VM                         AliasProperty  VM = VirtualMemorySize64
    WS                         AliasProperty  WS = WorkingSet64
    Disposed                   Event          System.EventHandler Disposed(System.Object, System.EventArgs)
    ErrorDataReceived          Event          System.Diagnostics.DataReceivedEventHandler ErrorDataReceived(System.Object, System.Diagnostics.DataReceivedEventArgs)
    ...


来源:https://stackoverflow.com/questions/58278838/memory-profiling-with-data-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!