r - store a vector in each cell by group -


hi i'm searching way store vector in each cell, here's sample dataset.

dt1 <- data.table(id = rep(1:2, each = 3), set.a = c(5,1,3,10,4,7)) dt1    id set.a 1:  1     5 2:  1     1 3:  1     3 4:  2    10 5:  2     4 6:  2     7 

now want method convert dt1 this:

   id    set.a 1:  1    5,1,3 2:  2 10, 4, 7 

this question may elementary did consume me hour. in fact it's part of job. suppose have dataset looks like:

dt2 <- data.table(id = rep(1:2, each = 3), set.b = c(3,5,9,8,10,4)) dt    id set.b 1:  1     3 2:  1     5 3:  1     9 4:  2     8 5:  2    10 6:  2     4 

what i'm looking how compute overlap of each id, i.e, overlap variable indicate intersetion of set.a , set.b each id looks like

   id overlap 1:  1     5,3 2:  2   10, 4 

to implement this, plan first aggregate set variable each id vector , calculate intersection, failed store vector in each cell, mentioned @ beginning. there can me out? in advance.

update

i experimented both akrun's , frank's way , found data.table merge way more efficient. here little benchmark, both of :)

dt1 <- data.table(id = rep(1:10000, each = 10), set1 = sample(letters[1:24], 100000, replace = t)) dt2 <- data.table(id = rep(1:10000, each = 10), set2 = sample(letters[1:24], 100000, replace = t))  system.time({ re1 <- rbindlist(list(dt1, dt2), idcol=true)[,   .(overlap=tostring(intersect(set1[.id==1], set1[.id==2]))) , =id] })  # 0.25s  system.time({ re2 <- dt1[dt2, on = c(id = "id", set1 = "set2"), nomatch = 0][, .(ovlp = list(unique(set1))), = "id"] })   # 0.07s  system.time({ dt3 <- dt1[, .(set1 = list(set1)), = id] dt4 <- dt2[, .(set2 = list(set2)), = id] re3 <- dt3[dt4, nomatch = 0, on = "id"][, .(ov = list(intersect(unlist(set1), unlist(set2)))), = id] })   # 0.21s 

i use merge:

res <- merge(   dt1[, .(a = list(set.a)), by=id], # <- answer first question   dt2[, .(b = list(set.b)), by=id],    by="id" )[,   overlap := .(map(intersect, a, b)) ]  #    id               b overlap # 1:  1    5,1,3    3,5,9     5,3 # 2:  2 10, 4, 7  8,10, 4   10, 4 

if speed concern, suspect faster:

dt1[dt2, on = c(id = "id", set.a = "set.b"), nomatch=0][,   .(overlap = list(set.a)) , by=id]  #    id overlap # 1:  1     3,5 # 2:  2   10, 4 

dt1[dt2,...] kind of merge/join.


Comments