python - How to asynchronously apply function via Spark to subsets of dataframe? -


i've written program in python , pandas takes large dataset (~4 million rows per month 6 months), groups 2 of columns (date , label), , applies function each group of rows. there variable number of rows in each grouping - anywhere handful of rows thousands of rows. there thousands of groups per month (label-date combos).

my current program uses multiprocessing, it's pretty efficient, , thought map spark. i've worked map-reduce before, having trouble implementing in spark. i'm sure i'm missing concept in pipelining, i've read appears focus on key-value processing, or splitting distributed dataset arbitrary partitions, rather i'm trying do. there simple example or paradigm doing this? appreciated.

edit: here's pseudo-code i'm doing:

reader = pd.read_csv() pool = mp.pool(processes=4) labels = <list of unique labels> label in labels:     dates = reader[(reader.label == label)]     date in dates:         df = reader[(reader.label==label) && (reader.date==date)]         pool.apply_async(process, df, callback=callbackfunc) pool.close() pool.join() 

when asynchronous, mean analogous pool.apply_async().

as (pyspark 1.5.0) see two 3 options:

  1. you can try express logic using sql operations , udfs. unfortunately python api doesn't support udafs (user defined aggregate functions) still expressive enough, window functions, cover wide range of scenarios.

    access external data sources can handled in couple of ways including:

    • access inside udf optional memoization
    • loading data frame , using join operation
    • using broadcast variable
  2. converting data frame pairrdd , using on of following:

    • partitionby + mappartitions
    • reducebykey / aggregatebykey

if python not strong requirement scala api > 1.5.0 supports udafs enable this:

df.groupby(some_columns: _*).agg(some_udaf) 
  1. partitioning data key , using local pandas data frames per partition

Comments