java - Merge records from csv files in HDFS using Spark -


here's problem statement:

in folder in hdfs, there're few csv files each row being record schema (id, attribute1, attribute2, attribute3).

some of columns (except id) null or empty strings, , no 2 records same id can have same non-empty value.

we'd merge records same id, , write merged records in hdfs. example:

record r1: id = 1, attribute1 = "hello", attribute2 = null, attribute3 = ""; record r2: id = 1, attribute1 = null, attribute2 = null, attribute3 = "testa"; record r3: id = 1 attribute1 = null, attribute2 = "okk", attribute3 = "testa";  merged record should be: id = 1, attribute1 = "hello", attribute2 = "okk", attribute3 = "testa" 

i'm starting learn spark. share thoughts on how write in java spark? thanks!

here're sample csv files:

file1.csv:

id,str1,str2,str3, 1,hello,,,

file2.csv:

id,str1,str2,str3, 1,,,testa,

file3.csv:

id,str1,str2,str3, 1,,okk,testa,

the merged file should be:

id,str1,str2,str3, 1,hello,okk,testa, it's known beforehand there won't conflicts on fields.

thanks!


Comments