xml parsing - How to extract xml attributes using Xpath in Pig? -


i wanted extract attributes form xml using pig latin.

this sample of xml file

<catalog> <book> <title test="test1">hadoop defnitive guide</title> <author>tom white</author> <country>us</country> <company>cloudera</company> <price>24.90</price> <year>2012</year> </book> </catalog> 

i used script didn't work:

register ./piggybank.jar define xpath org.apache.pig.piggybank.evaluation.xml.xpath();  =  load './books.xml' using org.apache.pig.piggybank.storage.xmlloader('book') (x:chararray);  b = foreach generate xpath(x, 'book/title/@test'), xpath(x, 'book/price'); dump b; 

the output was:

(,24.90) 

i hope can me this. thanks.

there 2 bugs in piggybank's xpath class:

  1. the ignorenamespace logic breaks searching xml attributes https://issues.apache.org/jira/browse/pig-4751

  2. the ignorenamepace parameter defaulted true , cannot overwritten https://issues.apache.org/jira/browse/pig-4752

here workaround using xpathall:

xpathall(x, 'book/title/@test', true, false).$0 (test:chararray) 

also if still need ignore namespaces:

xpathall(x, '//*[local-name()=\'book\']//*[local-name()=\'title\']/@test', true, false).$0 (test:chararray) 

Comments