python - 'PipelinedRDD' object has no attribute 'toDF' in PySpark -


i'm trying load svm file , convert dataframe can use ml module (pipeline ml) spark. i've installed fresh spark 1.5.0 on ubuntu 14.04 (no spark-env.sh configured).

my my_script.py is:

from pyspark.mllib.util import mlutils pyspark import sparkcontext  sc = sparkcontext("local", "teste original") data = mlutils.loadlibsvmfile(sc, "/home/svm_capture").todf() 

and i'm running using: ./spark-submit my_script.py

and error:

traceback (most recent call last): file "/home/fred-spark/spark-1.5.0-bin-hadoop2.6/pipeline_teste_original.py", line 34, in <module> data = mlutils.loadlibsvmfile(sc, "/home/fred-spark/svm_capture").todf() attributeerror: 'pipelinedrdd' object has no attribute 'todf' 

what can't understand if run:

data = mlutils.loadlibsvmfile(sc, "/home/svm_capture").todf() 

directly inside pyspark shell, works.

todf method monkey patch executed inside sparksession (sqlcontext constructor in 1.x) constructor able use have create sqlcontext (or sparksession) first:

# sqlcontext or hivecontext in spark 1.x pyspark.sql import sparksession pyspark import sparkcontext  sc = sparkcontext()  rdd = sc.parallelize([("a", 1)]) hasattr(rdd, "todf") ## false  spark = sparksession(sc) hasattr(rdd, "todf") ## true  rdd.todf().show() ## +---+---+ ## | _1| _2| ## +---+---+ ## |  a|  1| ## +---+---+ 

not mention need sqlcontext work dataframes anyway.


Comments