i having strange issue r. actions produce na records in dataframe aren't "real" nas -- don't have row in original dataset, row id says strange na.123 instead of real row number, , match test of ==1.
it's hard describe happening, i'll let heavily-commented code of talking. data file referenced here small (187 k) publicly-available file nhanes 2005 datasets, available @ http://wwwn.cdc.gov/nchs/nhanes/2005-2006/cot_d.xpt if wants can try replicate problem.
i creating yes/no variable assess whether cotinine blood test positive or negative, using cutoff of 10 define positive test. in code below 2 different ways, creating "cotpos1" , "cotpos2" illustrate of have found when troubleshooting issue.
for purpose of post, "good na" should na because original blood test results missing, , "bad na" 1 of mystery rows wasn't part of original data, every value na (including seqn, isn't missing row in original data), row number shows na.123, , nas in each column match ==1.
this dataset uses field called seqn identify every record. @ start, no records without seqn, when "bad nas" appear later , seqns na (as else in row) suggests me rows being added.
there other ways can don't produce "bad nas", using ifelse() or using package recodes, question isn't how make work - it's "why methods used in code below produce strange na.123 rows?"
library(foreign) # open sas xpt files # read in data files testdata <- read.xport('cot_d.xpt') ################# cotpos1, set 0 or 1 ################# testdata$cotpos1[testdata$lbxcot >= 10] <- 1 # positive cotinine test testdata$cotpos1[testdata$lbxcot < 10] <- 0 # negative cotinine test testdata$cotpos1[testdata$cotpos1==1] # have nas match ==1 testdata[testdata$cotpos1==1,c("seqn","cotpos1")] # bad nas have no seqn , row numbers na.988 testdata[is.na(testdata$cotpos1),c("seqn","cotpos1")] # nas (ones na because lbxcot na, , match is.na()) have seqn , row numbers ################# cotpos2, initialization 0 ################# testdata$cotpos2 <- 0 # assume negative until found otherwise testdata$cotpos2[testdata$lbxcot >= 10] <- 1 # positive cotinine test # 3 tests show have no "bad nas" @ point testdata$cotpos2[testdata$cotpos2==1] # no nas match ==1 testdata[testdata$cotpos2==1,c("seqn","cotpos2")] # no lines no seqn values or strange row ids na.988 testdata[is.na(testdata$cotpos2),c("seqn","cotpos2")] # no nas either because initialized 0 # let's try finding "good na"s , setting them na (since initialized 0, not accurate if blood test results missing) testdata$cotpos2[is.na(testdata$lbxcot)] <- na # re-run 3 tests, , show bad nas testdata$cotpos2[testdata$cotpos2==1] # there nas match ==1 testdata[testdata$cotpos2==1,c("seqn","cotpos2")] # there lines na seqn values , strange row ids na.988 testdata[is.na(testdata$cotpos2),c("seqn","cotpos2")] # these "good nas" only, bad ones don't show here
there other ways can don't produce "bad nas", using ifelse() or using package recodes, question isn't how make work - it's "why methods used in code above produce strange na.988 rows?"
further information in response bondeddust: thank reply. can please clarify quirk of [] referring to?
i aware of quirk if feed na, na row, eg:
b = testdata$cotpos1==1 b testdata[b,c("seqn","cotpos1")]
then anywhere b na should expect last line return na. 1 referring to? unfortunately, code, problem weird na rows showing in places b wasn’t na, quirk not explain it.
here last lines of b:
[8725] true na false false na true false false false false false na [8737] false na false false false true false false false false false true [8749] true false true false false
here last lines of testdata[b,c("seqn","cotpos1")]:
8711 41422 1 na.986 na na na.987 na na 8722 41437 1 8725 41440 1 na.988 na na na.989 na na 8730 41447 1 na.990 na na na.991 na na 8742 41461 1 8748 41468 1 8749 41469 1 8751 41472 1
the strange nas showing in places b not na
final edit: bondeddust's reply correct. when saying b , strange nas don't match (above), failing account fact [] doesn't print rows corresponding false. once take falses out, match perfectly.
if @ values of testdata$cotpos2 see:
> table( testdata$cotpos2==1, usena="always") false true <na> 6346 1415 992
read page "[" function. bears reading 10 times. should find section describes behavior of "[" when given na-value. whenunderstanding rules , subtleties key effective data management in r. (i have designed differently respect handling of na values.)
Comments
Post a Comment