i'm getting error when running flatmap() on list of objects of class. works fine regular python data types int, list etc. i'm facing error when list contains objects of class. here's entire code:
from pyspark import sparkcontext sc = sparkcontext("local","wordcountbysparkkeyword") def func(x): if x==2: return [2, 3, 4] return [1] rdd = sc.parallelize([2]) rdd = rdd.flatmap(func) # rdd.collect() has [2, 3, 4] rdd = rdd.flatmap(func) # rdd.collect() has [2, 3, 4, 1, 1] print rdd.collect() # gives expected output # class i'm defining class node(object): def __init__(self, value): self.value = value # representation, printing node def __repr__(self): return self.value def foo(x): if x.value==2: return [node(2), node(3), node(4)] return [node(1)] rdd = sc.parallelize([node(2)]) rdd = rdd.flatmap(foo) #marker 2 print rdd.collect() # rdd.collect should contain nodes values [2, 3, 4, 1, 1]
the code works fine till marker 1(commented in code). problem arises after marker 2. specific error message i'm getting attributeerror: 'module' object has no attribute 'node'
how resolve error?
i'm working on ubuntu, running pyspark 1.4.1
error unrelated flatmap
. if define node
class in main script accessible on driver not distributed workers. make work should place node
definition inside separate module , makes sure distributed workers.
- create separate module
node
definition, lets callnode.py
import
node
class inside main script:from node import node
make sure module distributed workers:
sc.addpyfile("node.py")
now should work expected.
on side note:
- pep 8 recommends capwords class names. not hard requirement makes life easier
__repr__
method should return a string representation of object. @ least make surestring
, proper representation better:def __repr__(self): return "node({0})".format(repr(self.value))
Comments
Post a Comment