i wanted extract attributes form xml using pig latin.
this sample of xml file
<catalog> <book> <title test="test1">hadoop defnitive guide</title> <author>tom white</author> <country>us</country> <company>cloudera</company> <price>24.90</price> <year>2012</year> </book> </catalog>
i used script didn't work:
register ./piggybank.jar define xpath org.apache.pig.piggybank.evaluation.xml.xpath(); = load './books.xml' using org.apache.pig.piggybank.storage.xmlloader('book') (x:chararray); b = foreach generate xpath(x, 'book/title/@test'), xpath(x, 'book/price'); dump b;
the output was:
(,24.90)
i hope can me this. thanks.
there 2 bugs in piggybank's xpath class:
the ignorenamespace logic breaks searching xml attributes https://issues.apache.org/jira/browse/pig-4751
the ignorenamepace parameter defaulted true , cannot overwritten https://issues.apache.org/jira/browse/pig-4752
here workaround using xpathall:
xpathall(x, 'book/title/@test', true, false).$0 (test:chararray)
also if still need ignore namespaces:
xpathall(x, '//*[local-name()=\'book\']//*[local-name()=\'title\']/@test', true, false).$0 (test:chararray)
Comments
Post a Comment