python - flatMap over list of custom objects in pyspark -


i'm getting error when running flatmap() on list of objects of class. works fine regular python data types int, list etc. i'm facing error when list contains objects of class. here's entire code:

from pyspark import sparkcontext   sc = sparkcontext("local","wordcountbysparkkeyword")  def func(x):     if x==2:         return [2, 3, 4]     return [1]  rdd = sc.parallelize([2]) rdd = rdd.flatmap(func) # rdd.collect() has [2, 3, 4] rdd = rdd.flatmap(func) # rdd.collect() has [2, 3, 4, 1, 1]  print rdd.collect() # gives expected output  # class i'm defining class node(object):     def __init__(self, value):         self.value = value      # representation, printing node     def __repr__(self):         return self.value   def foo(x):     if x.value==2:         return [node(2), node(3), node(4)]     return [node(1)]  rdd = sc.parallelize([node(2)]) rdd = rdd.flatmap(foo)  #marker 2  print rdd.collect() # rdd.collect should contain nodes values [2, 3, 4, 1, 1] 

the code works fine till marker 1(commented in code). problem arises after marker 2. specific error message i'm getting attributeerror: 'module' object has no attribute 'node' how resolve error?

i'm working on ubuntu, running pyspark 1.4.1

error unrelated flatmap. if define node class in main script accessible on driver not distributed workers. make work should place node definition inside separate module , makes sure distributed workers.

  1. create separate module node definition, lets call node.py
  2. import node class inside main script:

    from node import node 
  3. make sure module distributed workers:

    sc.addpyfile("node.py") 

now should work expected.

on side note:

  • pep 8 recommends capwords class names. not hard requirement makes life easier
  • __repr__ method should return a string representation of object. @ least make sure string, proper representation better:

    def __repr__(self):      return "node({0})".format(repr(self.value)) 

Comments