python - Getting name of attributes with Scrapy XPATH -


i'm trying both keys , values of attributes of tag in xml file (using scrapy , xpath).

the tag like:

<element attr1="value1" attr2="value2 ...> 

i don't know keys "attr1", "attr2" , on, , can change between 2 elements. didn't figure out how both keys , values xpath, there other practice doing that?

short version

>>> element in selector.xpath('//element'): ...     attributes = [] ...     # loop on attribute nodes of element ...     index, attribute in enumerate(element.xpath('@*'), start=1): ...         # use xpath's name() string function on each attribute, ...         # using position ...         attribute_name = element.xpath('name(@*[%d])' % index).extract_first() ...         # scrapy's extract() on attribute returns value ...         attributes.append((attribute_name, attribute.extract())) ...  >>> attributes # list of (attribute name, attribute value) tuples [(u'attr1', u'value1'), (u'attr2', u'value2')] >>> dict(attributes) {u'attr2': u'value2', u'attr1': u'value1'} >>>  

long version

xpath has name(node-set?) function node names (an attribute node, attribute node):

the name function returns string containing qname representing expanded-name of node in argument node-set first in document order.(...) if argument omitted, defaults node-set context node member.

(source: http://www.w3.org/tr/xpath/#function-name)

>>> import scrapy >>> selector = scrapy.selector(text=''' ...     <html> ...     <element attr1="value1" attr2="value2">some text</element> ...     </html>''') >>> selector.xpath('//element').xpath('name()').extract() [u'element'] 

(here, chained name() on result of //element selection, apply function selected element nodes. handy feature of scrapy selectors)

one same attribute nodes, right? not work:

>>> selector.xpath('//element/@*').extract() [u'value1', u'value2'] >>> selector.xpath('//element/@*').xpath('name()').extract() [] >>>  

note: don't know if it's limitation of lxml/libxml2, scrapy uses under hood, or if xpath specs disallow it. (i don't see why would.)

what can though use name(node-set) form, i.e. non-empty node-set parameter. if read part of xpath 1.0 specs pasted above, other string functions, name(node-set) takes account first node in node-set (in document order):

>>> selector.xpath('//element').xpath('@*').extract() [u'value1', u'value2'] >>> selector.xpath('//element').xpath('name(@*)').extract() [u'attr1'] >>>  

attribute nodes have positions, can loop on attributes position. here have 2 (result of count(@*) on context node):

>>> element in selector.xpath('//element'): ...     print element.xpath('count(@*)').extract_first() ...  2.0 >>> element in selector.xpath('//element'): ...     in range(1, 2+1): ...         print element.xpath('@*[%d]' % i).extract_first() ...  value1 value2 >>>  

now, can guess can do: call name() each @*[i]

>>> element in selector.xpath('//element'): ...     in range(1, 2+1): ...         print element.xpath('name(@*[%d])' % i).extract_first() ...  attr1 attr2 >>>  

if put together, , assume @* attributes in document order (not said in xpath 1.0 specs think, it's see happening lxml), end this:

>>> attributes = [] >>> element in selector.xpath('//element'): ...     index, attribute in enumerate(element.xpath('@*'), start=1): ...         attribute_name = element.xpath('name(@*[%d])' % index).extract_first() ...         attributes.append((attribute_name, attribute.extract())) ...  >>> attributes [(u'attr1', u'value1'), (u'attr2', u'value2')] >>> dict(attributes) {u'attr2': u'value2', u'attr1': u'value1'} >>>  

Comments