here's problem statement:
in folder in hdfs, there're few csv files each row being record schema (id, attribute1, attribute2, attribute3).
some of columns (except id) null or empty strings, , no 2 records same id can have same non-empty value.
we'd merge records same id, , write merged records in hdfs. example:
record r1: id = 1, attribute1 = "hello", attribute2 = null, attribute3 = ""; record r2: id = 1, attribute1 = null, attribute2 = null, attribute3 = "testa"; record r3: id = 1 attribute1 = null, attribute2 = "okk", attribute3 = "testa"; merged record should be: id = 1, attribute1 = "hello", attribute2 = "okk", attribute3 = "testa"
i'm starting learn spark. share thoughts on how write in java spark? thanks!
here're sample csv files:
file1.csv:
id,str1,str2,str3, 1,hello,,,
file2.csv:
id,str1,str2,str3, 1,,,testa,
file3.csv:
id,str1,str2,str3, 1,,okk,testa,
the merged file should be:
id,str1,str2,str3, 1,hello,okk,testa, it's known beforehand there won't conflicts on fields.
thanks!
Comments
Post a Comment