grep - Match column of one file to column of another using awk when second file column contains commas -


i have 2 files - 1 large file containing variants in genes, multiple columns separated tab. column containing gene names may contain single name, or multiple names separated commas (gene name in example samd11 , noc2l):

1   874816  874816  -   t   rs200996316 samd11  exonic  ensg00000187634 frameshift insertion 1   878331  878331  c   t   rs148327885 samd11  exonic  ensg00000187634 nonsynonymous snv 1   879676  879676  g     rs6605067   noc2l,samd11    utr3    ensg00000187634,ensg00000188976 1   879687  879687  t   c   rs2839  noc2l,samd11    utr3    ensg00000187634,ensg00000188976 1   881918  881918  g     rs35471880  noc2l   exonic  ensg00000188976 nonsynonymous snv 1   888659  888659  t   c   rs3748597   noc2l   exonic  ensg00000188976 nonsynonymous snv 

the second file single column list of gene names, such this:

evc2 samd11 comt 

i want match gene names in second file in first file. using awk:

awk -f $'\t' 'begin { while(getline <"secondfile.txt") gene[$0]=1; } gene[$7]' firstfile.txt > newfile.txt 

however, prints exact matches doesn't print lines noc2l,samd11. above example, expected output first 4 lines of first file:

1   874816  874816  -   t   rs200996316 samd11  exonic  ensg00000187634 frameshift insertion 1   878331  878331  c   t   rs148327885 samd11  exonic  ensg00000187634 nonsynonymous snv 1   879676  879676  g     rs6605067   noc2l,samd11    utr3    ensg00000187634,ensg00000188976 1   879687  879687  t   c   rs2839  noc2l,samd11    utr3    ensg00000187634,ensg00000188976 

i want still exact matches, of gene names can similar - eg there may gene called samd1, , if did fuzzy match samd1, samd11 , on. need exact match ignores comma in gene name column, or treats field delimiter or similar.

thanks in advance.

$ cat tst.awk nr==fnr { genes[$0]; next } {     split($7,a,/,/)     (i in a) {         if (a[i] in genes) {             print             next         }     } }  $ awk -f tst.awk secondfile.txt firstfile.txt 1   874816  874816  -   t   rs200996316 samd11  exonic  ensg00000187634 frameshift insertion 1   878331  878331  c   t   rs148327885 samd11  exonic  ensg00000187634 nonsynonymous snv 1   879676  879676  g     rs6605067   noc2l,samd11    utr3    ensg00000187634,ensg00000188976 1   879687  879687  t   c   rs2839  noc2l,samd11    utr3    ensg00000187634,ensg00000188976 

this work:

$ cat tst.awk nr==fnr { genes[$0]; next } {     (gene in genes) {         if ($7 ~ "(^|,)"gene"(,|$)") {             print             next         }     } } 

Comments