i'm getting error when running flatmap() on list of objects of class. works fine regular python data types int, list etc. i'm facing error when list contains objects of class. here's entire code:
from pyspark import sparkcontext sc = sparkcontext("local","wordcountbysparkkeyword") def func(x): if x==2: return [2, 3, 4] return [1] rdd = sc.parallelize([2]) rdd = rdd.flatmap(func) # rdd.collect() has [2, 3, 4] rdd = rdd.flatmap(func) # rdd.collect() has [2, 3, 4, 1, 1] print rdd.collect() # gives expected output # class i'm defining class node(object): def __init__(self, value): self.value = value # representation, printing node def __repr__(self): return self.value def foo(x): if x.value==2: return [node(2), node(3), node(4)] return [node(1)] rdd = sc.parallelize([node(2)]) rdd = rdd.flatmap(foo) #marker 2 print rdd.collect() # rdd.collect should contain nodes values [2, 3, 4, 1, 1] the code works fine till marker 1(commented in code). problem arises after marker 2. specific error message i'm getting attributeerror: 'module' object has no attribute 'node' how resolve error?
i'm working on ubuntu, running pyspark 1.4.1
error unrelated flatmap. if define node class in main script accessible on driver not distributed workers. make work should place node definition inside separate module , makes sure distributed workers.
- create separate module
nodedefinition, lets callnode.py import
nodeclass inside main script:from node import nodemake sure module distributed workers:
sc.addpyfile("node.py")
now should work expected.
on side note:
- pep 8 recommends capwords class names. not hard requirement makes life easier
__repr__method should return a string representation of object. @ least make surestring, proper representation better:def __repr__(self): return "node({0})".format(repr(self.value))
Comments
Post a Comment