we love ipython.parallel (now ipyparallel).
there bugs me, though. when sending ~1.5gb pandas dataframe bunch of workers, memoryerror if cluster has many nodes. looks there many copies of dataframe there engines (or proportional number). there way avoid these copies?
example:
in[]: direct_view.push({'xy':xy}, block=true) # or direct_view['xy'] = xy
for small cluster (e.g. 30 nodes), memory grows , grows, data goes through , fine. larger cluster, e.g. 80 nodes (all r3.4xlarge 1 engine, not n_core engines), htop
reports growing memory max (123gb) , get:
--------------------------------------------------------------------------- memoryerror traceback (most recent call last) <ipython-input-120-f6a9a69761db> in <module>() ----> 1 get_ipython().run_cell_magic(u'time', u'', u"ipc.direct_view.push({'xy':xy}, block=true)") /opt/anaconda/lib/python2.7/site-packages/ipython/core/interactiveshell.pyc in run_cell_magic(self, magic_name, line, cell) 2291 magic_arg_s = self.var_expand(line, stack_depth) 2292 self.builtin_trap: -> 2293 result = fn(magic_arg_s, cell) 2294 return result 2295 (...)
note, after looking @ https://ipyparallel.readthedocs.org/en/latest/details.html, tried send underlying numpy array (xy.values
) in attempt have "non-copying send" memoryerror
.
versions:
- jupyter notebook v.4.0.4
- python 2.7.10
ipyparallel.__version__
: 4.0.2
Comments
Post a Comment