i'm trying load svm file , convert dataframe
can use ml module (pipeline
ml) spark. i've installed fresh spark 1.5.0 on ubuntu 14.04 (no spark-env.sh
configured).
my my_script.py
is:
from pyspark.mllib.util import mlutils pyspark import sparkcontext sc = sparkcontext("local", "teste original") data = mlutils.loadlibsvmfile(sc, "/home/svm_capture").todf()
and i'm running using: ./spark-submit my_script.py
and error:
traceback (most recent call last): file "/home/fred-spark/spark-1.5.0-bin-hadoop2.6/pipeline_teste_original.py", line 34, in <module> data = mlutils.loadlibsvmfile(sc, "/home/fred-spark/svm_capture").todf() attributeerror: 'pipelinedrdd' object has no attribute 'todf'
what can't understand if run:
data = mlutils.loadlibsvmfile(sc, "/home/svm_capture").todf()
directly inside pyspark shell, works.
todf
method monkey patch executed inside sparksession
(sqlcontext
constructor in 1.x) constructor able use have create sqlcontext
(or sparksession
) first:
# sqlcontext or hivecontext in spark 1.x pyspark.sql import sparksession pyspark import sparkcontext sc = sparkcontext() rdd = sc.parallelize([("a", 1)]) hasattr(rdd, "todf") ## false spark = sparksession(sc) hasattr(rdd, "todf") ## true rdd.todf().show() ## +---+---+ ## | _1| _2| ## +---+---+ ## | a| 1| ## +---+---+
not mention need sqlcontext
work dataframes anyway.
Comments
Post a Comment