indexing - Strange "NA.123" rows in an R data frame -


i having strange issue r. actions produce na records in dataframe aren't "real" nas -- don't have row in original dataset, row id says strange na.123 instead of real row number, , match test of ==1.

it's hard describe happening, i'll let heavily-commented code of talking. data file referenced here small (187 k) publicly-available file nhanes 2005 datasets, available @ http://wwwn.cdc.gov/nchs/nhanes/2005-2006/cot_d.xpt if wants can try replicate problem.

i creating yes/no variable assess whether cotinine blood test positive or negative, using cutoff of 10 define positive test. in code below 2 different ways, creating "cotpos1" , "cotpos2" illustrate of have found when troubleshooting issue.

for purpose of post, "good na" should na because original blood test results missing, , "bad na" 1 of mystery rows wasn't part of original data, every value na (including seqn, isn't missing row in original data), row number shows na.123, , nas in each column match ==1.

this dataset uses field called seqn identify every record. @ start, no records without seqn, when "bad nas" appear later , seqns na (as else in row) suggests me rows being added.

there other ways can don't produce "bad nas", using ifelse() or using package recodes, question isn't how make work - it's "why methods used in code below produce strange na.123 rows?"

library(foreign) # open sas xpt files  # read in data files testdata <- read.xport('cot_d.xpt')  ################# cotpos1, set 0 or 1 #################  testdata$cotpos1[testdata$lbxcot >= 10] <- 1 # positive cotinine test testdata$cotpos1[testdata$lbxcot < 10] <- 0 # negative cotinine test  testdata$cotpos1[testdata$cotpos1==1] # have nas match ==1 testdata[testdata$cotpos1==1,c("seqn","cotpos1")] # bad nas have no seqn , row numbers na.988 testdata[is.na(testdata$cotpos1),c("seqn","cotpos1")] # nas (ones na because lbxcot na, , match is.na()) have seqn , row numbers  ################# cotpos2, initialization 0 #################  testdata$cotpos2 <- 0 # assume negative until found otherwise testdata$cotpos2[testdata$lbxcot >= 10] <- 1 # positive cotinine test  # 3 tests show have no "bad nas" @ point testdata$cotpos2[testdata$cotpos2==1] # no nas match ==1 testdata[testdata$cotpos2==1,c("seqn","cotpos2")] # no lines no seqn values or strange row ids na.988 testdata[is.na(testdata$cotpos2),c("seqn","cotpos2")] # no nas either because initialized 0  # let's try finding "good na"s , setting them na (since initialized 0, not accurate if blood test results missing) testdata$cotpos2[is.na(testdata$lbxcot)] <- na  # re-run 3 tests, , show bad nas testdata$cotpos2[testdata$cotpos2==1] # there nas match ==1 testdata[testdata$cotpos2==1,c("seqn","cotpos2")] # there lines na seqn values , strange row ids na.988 testdata[is.na(testdata$cotpos2),c("seqn","cotpos2")] # these "good nas" only, bad ones don't show here 

there other ways can don't produce "bad nas", using ifelse() or using package recodes, question isn't how make work - it's "why methods used in code above produce strange na.988 rows?"

further information in response bondeddust: thank reply. can please clarify quirk of [] referring to?

i aware of quirk if feed na, na row, eg:

b = testdata$cotpos1==1 b testdata[b,c("seqn","cotpos1")] 

then anywhere b na should expect last line return na. 1 referring to? unfortunately, code, problem weird na rows showing in places b wasn’t na, quirk not explain it.

here last lines of b:

[8725]  true    na false false    na  true false false false false false    na [8737] false    na false false false  true false false false false false  true [8749]  true false  true false false 

here last lines of testdata[b,c("seqn","cotpos1")]:

8711   41422       1 na.986    na      na na.987    na      na 8722   41437       1 8725   41440       1 na.988    na      na na.989    na      na 8730   41447       1 na.990    na      na na.991    na      na 8742   41461       1 8748   41468       1 8749   41469       1 8751   41472       1 

the strange nas showing in places b not na

final edit: bondeddust's reply correct. when saying b , strange nas don't match (above), failing account fact [] doesn't print rows corresponding false. once take falses out, match perfectly.

if @ values of testdata$cotpos2 see:

> table( testdata$cotpos2==1, usena="always")  false  true  <na>   6346  1415   992  

read page "[" function. bears reading 10 times. should find section describes behavior of "[" when given na-value. whenunderstanding rules , subtleties key effective data management in r. (i have designed differently respect handling of na values.)


Comments