Iterate over vectors from an imported dataframe row-wise

跟風遠走 提交于 2021-01-05 07:47:10

问题


Im trying to make the switch from R to c++ coding. If you choose to down vote this question, at the very least patronize me with an answer so I can learn something. My question is how am I supposed to approach row-wise calculations in c++ once I have passed c++ a dataframe? Conceptually, I understand that once I pass c++ a dataframe, c++ will treat each column as its own vector that I have to explicitly name. Where I am having trouble is setting up a for loop to iterate through the same position of all of the vectors at once, thus functionally emulating a row-wise function in R. I would like to extend this question to the following applications as well:

  1. How would I set up a loop that iterates across a row and returns a vector. Like rowsum in R? There is an example of this in advanced R using a matrix, but the nomenclature doesn't translate to a pile vectors from a dataframe.
  2. How would I set up a loop that iterates across a row and changes the values in each row, and return the modified vectors?
  3. How would I set up a loop that iterates through a range of rows at once, thus making a sliding window function? like this:

    ## an example of a for loop in R that I want to recapitulate in c++
    output <- list() 
    
    for(i in 1:nrow(df)){
      end_row <- i+3
      df_tmp <- df[i:end_row, ]
      ## do some function here
      output[[i]] <- list(df_tmp)
    }
    
  4. How would I setup the same rolling function in question 3, but in a way that allows me to conditionally extend the vector lengths? In R, Ive written functions using apply that iterate over a range of rows, and then return a list of new dataframes that I then turn into a large dataframe. Doing this one vector at a time is conceptually perplexing to me at the moment.

Lets say I have this dataframe in R

#example data    
a <- c(0, 2, 4, 6, 8, 10)
b <- c(1, 3, 5, 7, 9, 11)
c <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1")
d <- c(10.2, 10.2, 4.3, 4.3, 3.4, 7.9)
e <- c("a", "t", "t", "g", "c", "a")

df <- data.frame(a, b, c, d, e)

In c++, I have gotten this far:

#include <algorithm>
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
DataFrame modifyDataFrame(DataFrame df) {

  // access the columns
  IntegerVector a = df["a"];
  IntegerVector b = df["b"];
  CharacterVector c = df["c"];
  IntegerVector d = df["d"];
  CharacterVector e = df["e"];

// write the for loop. I'm attempting to define a single
//position and then apply it to all vectors... 
//but no versions of this approach have worked.   

  for(int i=0; i < a.length(); ++i){

  // do some function
  }
  // return a new data frame
  return DataFrame::create(_["a"]= a, _["b"]= b, _["c"]= c, _["d"]= d, _["e"]=e);
}

I've been following the Advanced R section on this. The part I'm struggling to grasp is the multiple vector four loop construction, and how to define my range iterators. Based on my code, that is your interpretation too? Do I need to create an iterator for each vector, or can I simply define one position based on the length of one vector and then apply to all vectors?

The easiest way for me to move past this is to see an example. Once I see an example of functional code, I'll be able to apply the concepts Ive been reading about.

Edit: would it be possible to add some examples like this to the RCPP documentation? I imagine many people struggle at this step. Considering the dataframe is one of the most common r data containers, I think the rcpp documentation would be greatly strengthened by a couple more dataframe examples - the conceptual switch is not trivial at first glance.


回答1:


I am not convinced that you will gain performance from going to C++ here. However, if you have a set of vectors with equal length (data.frameguarantees that) then you can simply iterate with one index:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
DataFrame modifyDataFrame(DataFrame df) {

  // access the columns
  IntegerVector a = df["a"];
  IntegerVector b = df["b"];
  CharacterVector c = df["c"];
  NumericVector d = df["d"];
  CharacterVector e = df["e"];

  for(int i=0; i < df.nrow(); ++i){
    a(i) += 1;
    b(i) += 2;
    c(i) += "c";
    d(i) += 3;
    e(i) += "e";
  }
  // return a new data frame
  return DataFrame::create(_["a"]= a, _["b"]= b, _["c"]= c, _["d"]= d, _["e"]=e);
}
/*** R
a <- c(0, 2, 4, 6, 8, 10)
b <- c(1, 3, 5, 7, 9, 11)
c <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1")
d <- c(10.2, 10.2, 4.3, 4.3, 3.4, 7.9)
e <- c("a", "t", "t", "g", "c", "a")

df <- data.frame(a, b, c, d, e)
modifyDataFrame(df)  
*/

Result:

> modifyDataFrame(df)  
   a  b     c    d  e
1  1  3 chr1c 13.2 ae
2  3  5 chr1c 13.2 te
3  5  7 chr1c  7.3 te
4  7  9 chr1c  7.3 ge
5  9 11 chr1c  6.4 ce
6 11 13 chr1c 10.9 ae

Here I am using the nrow()method of the DataFrameclass, c.f. the Rcpp API. This uses R's C API, just as the length() method. I just find it more logical to use a DataFrame-method than single out one of the vectors to retrieve the length. The result would be the same.

As for a sliding window I would look into the RcppRoll package first.



来源:https://stackoverflow.com/questions/54080116/iterate-over-vectors-from-an-imported-dataframe-row-wise

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!