reshape alternating columns in less time and using less memory

后端 未结 1 973
有刺的猬
有刺的猬 2021-01-21 12:11

How can I do this reshape faster and so that it takes up less memory? My aim is to reshape a dataframe that is 500,000 rows by 500 columns with 4 Gb RAM.

Here\'s a func

1条回答
  •  夕颜
    夕颜 (楼主)
    2021-01-21 12:56

    I doubt very much that this will succeed with that small amount of RAM when passing a 500000 x 500 dataframe. I wonder whether you could do even simple actions in that limited space. Buy more RAM. Furthermore, reshape2 is slow, so use stats::reshape for big stuff. And give it hints about what the separator is.

    > set.seed(007)
    > dat <- make_example(5, 3)
    > dat
      docnum filename ntop_1     ptop_1 ntop_2    ptop_2 ntop_3    ptop_3
    1      1    y8214      3 0.06564574      1 0.6799935      2 0.8470244
    2      2    e6x39      2 0.62703876      1 0.2637199      3 0.4980761
    3      3    34c19      3 0.49047504      3 0.1857143      3 0.7905856
    4      4    1H0y6      2 0.97102441      3 0.1851432      2 0.8384639
    5      5    P6zqy      3 0.36222085      3 0.3792967      3 0.4569039
    
    > reshape(dat, direction="long", varying=3:8, sep="_")
        docnum filename time ntop       ptop id
    1.1      1    y8214    1    3 0.06564574  1
    2.1      2    e6x39    1    2 0.62703876  2
    3.1      3    34c19    1    3 0.49047504  3
    4.1      4    1H0y6    1    2 0.97102441  4
    5.1      5    P6zqy    1    3 0.36222085  5
    1.2      1    y8214    2    1 0.67999346  1
    2.2      2    e6x39    2    1 0.26371993  2
    3.2      3    34c19    2    3 0.18571426  3
    4.2      4    1H0y6    2    3 0.18514322  4
    5.2      5    P6zqy    2    3 0.37929675  5
    1.3      1    y8214    3    2 0.84702439  1
    2.3      2    e6x39    3    3 0.49807613  2
    3.3      3    34c19    3    3 0.79058557  3
    4.3      4    1H0y6    3    2 0.83846387  4
    5.3      5    P6zqy    3    3 0.45690386  5
    
    > system.time( dat <- make_example(5000,100) )
       user  system elapsed 
      2.925   0.131   3.043 
    > system.time( dat2 <-  reshape(dat, direction="long", varying=3:202, sep="_"))
       user  system elapsed 
     16.766   8.608  25.272 
    

    I'd say that around 1/5 of total in 32 GB memory got used during that process that was 250 times smaller than your goal, so I'm not surprised that your machine hung. (It should not have "crashed". The authors of R would prefer that you give accurate descriptions of behavior and I suspect the R process stopped responding when it paged into virtual memory.) I have performance issues that I need to work around with a dataset that is 7 million records x 100 columns when using 32 GB.

    0 讨论(0)
提交回复
热议问题