i'm trying both keys , values of attributes of tag in xml file (using scrapy , xpath).
the tag like:
<element attr1="value1" attr2="value2 ...>
i don't know keys "attr1", "attr2" , on, , can change between 2 elements. didn't figure out how both keys , values xpath, there other practice doing that?
short version
>>> element in selector.xpath('//element'): ... attributes = [] ... # loop on attribute nodes of element ... index, attribute in enumerate(element.xpath('@*'), start=1): ... # use xpath's name() string function on each attribute, ... # using position ... attribute_name = element.xpath('name(@*[%d])' % index).extract_first() ... # scrapy's extract() on attribute returns value ... attributes.append((attribute_name, attribute.extract())) ... >>> attributes # list of (attribute name, attribute value) tuples [(u'attr1', u'value1'), (u'attr2', u'value2')] >>> dict(attributes) {u'attr2': u'value2', u'attr1': u'value1'} >>>
long version
xpath has name(node-set?)
function node names (an attribute node, attribute node):
the name function returns string containing qname representing expanded-name of node in argument node-set first in document order.(...) if argument omitted, defaults node-set context node member.
(source: http://www.w3.org/tr/xpath/#function-name)
>>> import scrapy >>> selector = scrapy.selector(text=''' ... <html> ... <element attr1="value1" attr2="value2">some text</element> ... </html>''') >>> selector.xpath('//element').xpath('name()').extract() [u'element']
(here, chained name()
on result of //element
selection, apply function selected element nodes. handy feature of scrapy selectors)
one same attribute nodes, right? not work:
>>> selector.xpath('//element/@*').extract() [u'value1', u'value2'] >>> selector.xpath('//element/@*').xpath('name()').extract() [] >>>
note: don't know if it's limitation of lxml/libxml2
, scrapy uses under hood, or if xpath specs disallow it. (i don't see why would.)
what can though use name(node-set)
form, i.e. non-empty node-set parameter. if read part of xpath 1.0 specs pasted above, other string functions, name(node-set)
takes account first node in node-set (in document order):
>>> selector.xpath('//element').xpath('@*').extract() [u'value1', u'value2'] >>> selector.xpath('//element').xpath('name(@*)').extract() [u'attr1'] >>>
attribute nodes have positions, can loop on attributes position. here have 2 (result of count(@*)
on context node):
>>> element in selector.xpath('//element'): ... print element.xpath('count(@*)').extract_first() ... 2.0 >>> element in selector.xpath('//element'): ... in range(1, 2+1): ... print element.xpath('@*[%d]' % i).extract_first() ... value1 value2 >>>
now, can guess can do: call name()
each @*[i]
>>> element in selector.xpath('//element'): ... in range(1, 2+1): ... print element.xpath('name(@*[%d])' % i).extract_first() ... attr1 attr2 >>>
if put together, , assume @*
attributes in document order (not said in xpath 1.0 specs think, it's see happening lxml
), end this:
>>> attributes = [] >>> element in selector.xpath('//element'): ... index, attribute in enumerate(element.xpath('@*'), start=1): ... attribute_name = element.xpath('name(@*[%d])' % index).extract_first() ... attributes.append((attribute_name, attribute.extract())) ... >>> attributes [(u'attr1', u'value1'), (u'attr2', u'value2')] >>> dict(attributes) {u'attr2': u'value2', u'attr1': u'value1'} >>>
Comments
Post a Comment