问题

My original data.table consists of three columns.
site, observation_number and id.

E.g. the following which is all the observations for id = z

|site|observation_number|id
|a   |                 1| z                 
|b   |                 2| z
|c   |                 3| z

Which means that ID z has traveled from a to b to c.

There is no fixed number of sites per id.

I wish to transform the data to an edge list like this

|from |to||id|
|a    | b| z |
|b    | c| z |

mock data

sox <- data.table(site =  c('a','b','c','a','c','c','a','d','e'),
       obsnum =c(1,2,3,1,2,1,2,3,4),
       id     =c('z','z','z','y','y','k','k','k','k'))

The way I am currently doing this, feels convoluted and is very slow (sox has 1.5 mio rows and dt_out has ca. 7.5 mio. rows). I basically use a for loop over observation_number to split the data in to chunks where each ID is only present once (that is - only one journey, to - from). Then I cast data, and rind all the chunks to a new data.table.

dt_out <- data.table()
maksimum = sox[,max(observation_number)]
for (i in 1:maksimum-1) {
  i=1
  mini = i
  maxi = i+1
  sox_t <- sox[observation_number ==maxi | observation_number ==mini, ]
  temp_dt <- dcast(sox_t[id %in% sox_t[, .N, by = id][N>=2]$id,
                             .SD[, list(site, observation_number, a=rep(c('from', 'to')))] ,by=id],
                       id='id', value.var='site', formula=id~a)
  dt_out <- rbind(dt_out, temp_dt)
  i=max
  }

I hope someone can help me optimize this, and preferable create a function where I can input the data.table, the site id, observationnumber id, and the id. For some reason I can't create a function regardless that works.

UPDATE

Using sytem time (and running system time a few times):

                             User - System - Elapsed
make_edgelist (data.table):  5.38     0.00      5.38
Data.table. with shift:     13.96     0.06     14.08 
dplyr, with arrange:         6.06     0.36      6.44

p.s. make_edgelist was updated to order the data.table

make_edgelist <- function(DT, site_var = "site", id_var = "id", obsnum_var   = "rn1") {
    DT[order(get(obsnum_var)),
    list(from = get(site_var)[-.N], to = get(site_var)[-1]), by = id_var]
}

I was surprised that dplyr (with lead) was almost as fast as make_edgelist and much faster than data.table with shift. I guess this means that dplyr will actually be faster with more complex lead/lags/shift.

Also I find it puzzling - but don't know enough to know if it has any significance, that dplyr used more 'system' time than any of the two data.table solutions.

Input data: 1.5 million rows. Result: 0.6 million rows.

回答1:

Is this what you are looking for?

sox[, .(from = site[-.N], to = site[-1]), by = id]

#    id from to
# 1:  z    a  b
# 2:  z    b  c
# 3:  y    a  c
# 4:  k    c  a
# 5:  k    a  d
# 6:  k    d  e

Wrapped in a function:

make_edgelist <- function(DT, site_var = "site", id_var = "id") {
  DT[, .(from = get(site_var)[-.N], to = get(site_var)[-1]), by = id_var]
}

Note: This solution assumes the data is already ordered by observation number. To avoid this assumptions add order(obsnum) before the first comma.

回答2:

With dplyr, you can try:

sox %>%
 group_by(id) %>%
 transmute(from = site,
           to = lead(from)) %>%
 na.omit()

  id    from  to   
  <chr> <chr> <chr>
1 z     a     b    
2 z     b     c    
3 y     a     c    
4 k     c     a    
5 k     a     d    
6 k     d     e

As @Sotos noted, it could be useful to arrange the data first:

sox %>%
 arrange(id, obsnum) %>%
 group_by(id) %>%
 transmute(from = site,
           to = lead(from)) %>%
 na.omit()

回答3:

Using data.table, in case it's faster than the dplyr solution above, you have:

sox <- sox[order(id, obsnum)]
sox[, from := shift(site), by = "id"]
sox <- sox[!is.na(from)]
setnames(sox, "site", "to")
sox[, obsnum := NULL]
setcolorder(sox, c("id", "from", "to"))
sox
#>    id from to
#> 1:  k    c  a
#> 2:  k    a  d
#> 3:  k    d  e
#> 4:  y    a  c
#> 5:  z    a  b
#> 6:  z    b  c

来源：https://stackoverflow.com/questions/58372627/how-to-construct-an-edgeliste-from-a-list-of-visited-places-effectively

标签