i python programmer , python api slow spark application , decided port code spark scala api, compare computation time.
i trying filter out lines start numeric characters huge file using scala api in spark. in file, lines have numbers , have words , want lines have numbers.
so, in python application, have these lines.
l = sc.textfile("my_file_path") l_filtered = l.filter(lambda s: s[0].isdigit())
which works want.
this have tried far.
val l = sc.textfile("my_file_path") val l_filtered = l.filter(x => x.forall(_.isdigit))
this throws out error saying char not have forall() function.
i tried taking first character of lines using s.take(1) , apply isdigit() function on in following way.
val l = sc.textfile("my_file_path") val l_filtered = l.filter(x => x.take(1).isdigit)
and too...
val l = sc.textfile("my_file_path") val l_filtered = l.filter(x => x.take(1).character.isdigit)
this throws error.
this small error , not accustomed scala syntax, having hard time figuring out. appreciated.
edit: answered question, tried writing function, unable use in filter() function in application. to apply function lines in file.
in scala indexing syntax uses parens ()
instead of brackets []
. exact translation of python code this:
val l = sc.textfile("my_file_path") val l_filtered = l.filter(_(0).isdigit)
a more idiomatic extraction of first symbol using head
method:
val l = sc.textfile("my_file_path") val l_filtered = l.filter(_.head.isdigit)
both of these methods fail if file contains empty lines.
if that's case, want this:
val l = sc.textfile("my_file_path") val l_filtered = l.filter(_.headoption.map(_.isdigit).getorelse(false))
upd.
as curious noted map(predicate).getorelse(false)
on option
shortened exists(predicate)
:
val l = sc.textfile("my_file_path") val l_filtered = l.filter(_.headoption.exists(_.isdigit))
Comments
Post a Comment