grep - Match column of one file to column of another using awk when second file column contains commas -
i have 2 files - 1 large file containing variants in genes, multiple columns separated tab. column containing gene names may contain single name, or multiple names separated commas (gene name in example samd11 , noc2l):
1 874816 874816 - t rs200996316 samd11 exonic ensg00000187634 frameshift insertion 1 878331 878331 c t rs148327885 samd11 exonic ensg00000187634 nonsynonymous snv 1 879676 879676 g rs6605067 noc2l,samd11 utr3 ensg00000187634,ensg00000188976 1 879687 879687 t c rs2839 noc2l,samd11 utr3 ensg00000187634,ensg00000188976 1 881918 881918 g rs35471880 noc2l exonic ensg00000188976 nonsynonymous snv 1 888659 888659 t c rs3748597 noc2l exonic ensg00000188976 nonsynonymous snv
the second file single column list of gene names, such this:
evc2 samd11 comt
i want match gene names in second file in first file. using awk:
awk -f $'\t' 'begin { while(getline <"secondfile.txt") gene[$0]=1; } gene[$7]' firstfile.txt > newfile.txt
however, prints exact matches doesn't print lines noc2l,samd11. above example, expected output first 4 lines of first file:
1 874816 874816 - t rs200996316 samd11 exonic ensg00000187634 frameshift insertion 1 878331 878331 c t rs148327885 samd11 exonic ensg00000187634 nonsynonymous snv 1 879676 879676 g rs6605067 noc2l,samd11 utr3 ensg00000187634,ensg00000188976 1 879687 879687 t c rs2839 noc2l,samd11 utr3 ensg00000187634,ensg00000188976
i want still exact matches, of gene names can similar - eg there may gene called samd1, , if did fuzzy match samd1, samd11 , on. need exact match ignores comma in gene name column, or treats field delimiter or similar.
thanks in advance.
$ cat tst.awk nr==fnr { genes[$0]; next } { split($7,a,/,/) (i in a) { if (a[i] in genes) { print next } } } $ awk -f tst.awk secondfile.txt firstfile.txt 1 874816 874816 - t rs200996316 samd11 exonic ensg00000187634 frameshift insertion 1 878331 878331 c t rs148327885 samd11 exonic ensg00000187634 nonsynonymous snv 1 879676 879676 g rs6605067 noc2l,samd11 utr3 ensg00000187634,ensg00000188976 1 879687 879687 t c rs2839 noc2l,samd11 utr3 ensg00000187634,ensg00000188976
this work:
$ cat tst.awk nr==fnr { genes[$0]; next } { (gene in genes) { if ($7 ~ "(^|,)"gene"(,|$)") { print next } } }
Comments
Post a Comment