i'm trying load svm file , convert dataframe can use ml module (pipeline ml) spark. i've installed fresh spark 1.5.0 on ubuntu 14.04 (no spark-env.sh configured).
my my_script.py is:
from pyspark.mllib.util import mlutils pyspark import sparkcontext sc = sparkcontext("local", "teste original") data = mlutils.loadlibsvmfile(sc, "/home/svm_capture").todf() and i'm running using: ./spark-submit my_script.py
and error:
traceback (most recent call last): file "/home/fred-spark/spark-1.5.0-bin-hadoop2.6/pipeline_teste_original.py", line 34, in <module> data = mlutils.loadlibsvmfile(sc, "/home/fred-spark/svm_capture").todf() attributeerror: 'pipelinedrdd' object has no attribute 'todf' what can't understand if run:
data = mlutils.loadlibsvmfile(sc, "/home/svm_capture").todf() directly inside pyspark shell, works.
todf method monkey patch executed inside sparksession (sqlcontext constructor in 1.x) constructor able use have create sqlcontext (or sparksession) first:
# sqlcontext or hivecontext in spark 1.x pyspark.sql import sparksession pyspark import sparkcontext sc = sparkcontext() rdd = sc.parallelize([("a", 1)]) hasattr(rdd, "todf") ## false spark = sparksession(sc) hasattr(rdd, "todf") ## true rdd.todf().show() ## +---+---+ ## | _1| _2| ## +---+---+ ## | a| 1| ## +---+---+ not mention need sqlcontext work dataframes anyway.
Comments
Post a Comment