database - Remove duplicate rows in CSV comparing data in only two columns with Python -


there many ways go this, here's gist when comes down it:

i have 2 databases full of people, both exported csv files. 1 of databases being decommissioned. i need compare each csv file (or combined version of two) , filter out non-unique people in soon-to-be decommissioned server. way can import unique people decommissioned database current database.

i need compare firstname , lastname (which 2 separate columns). part of problem not precise duplicates, names capitalized in 1 database, , in other.

here example of data when combine 2 csv files one. caps names current database (which how csv formatted):

firstname,lastname,id,id2,id3 john,doe,123,432,645 jacob,smith,456,372,383 susy,saucy,9999,12,8r83 contractor ,#1,8dh,28j,153s testing2,contrator,7463,99999,0283 john,doe,999,888,999 susy,saucy,8373,08j,9023 

would parsed into:

jacob,smith,456,372,383 contractor,#1,8dh,28j,153s testing2,contrator,7463,99999,0283 

parsing other columns irrelevant, data relevant, must remain untouched. (there dozens of other columns, not three).

to idea of how many duplicates had, ran script (taken previous post):

with open('1.csv','r') in_file, open('2.csv','w') out_file:     seen = set() # set fast o(1) amortized lookup     line in in_file:         if line in seen: continue # skip duplicate          seen.add(line)         out_file.write(line) 

too simple needs though.

using set no unless want keep 1 unique line of recurring values not keep lines unique, need find unique values looking through file first counter dict do:

with open("test.csv", encoding="utf-8") f, open("file_out.csv", "w") out:     collections import counter     csv import reader, writer     wr = writer(out)     header = next(f) # header     # count of each first/last name pair lowering each string     counts = counter((a.lower(), b.lower()) a, b, *_ in reader(f))     f.seek(0) # reset counter      out.write(next(f))  # write header ?     # iterate on file again, keeping rows have     # unique first , second names     wr.writerows(row row in reader(f)                    if counts[row[0].lower(),row[1].lower()] == 1) 

input:

firstname,lastname,id,id2,id3 john,doe,123,432,645 jacob,smith,456,372,383 susy,saucy,9999,12,8r83 contractor,#1,8dh,28j,153s testing2,contrator,7463,99999,0283 john,doe,999,888,999 susy,saucy,8373,08j,9023 

file_out:

firstname,lastname,id,id2,id3 jacob,smith,456,372,383 contractor,#1,8dh,28j,153s testing2,contrator,7463,99999,0283 

counts counts how many times each of names appear after being lowered. reset pointer , write lines first 2 column values seen once in whole file.

or without csv module may faster if have namy columns:

with open("test.csv") f, open("file_out.csv","w") out:     collections import counter     header = next(f) # header     next(f) # skip blank line     counts = counter(tuple(map(str.lower,line.split(",", 2)[:2])) line in f)     f.seek(0) # start of file     next(f), next(f) # skip again     out.write(header) # write original header ?     out.writelines(line line in  f                    if counts[map(str.lower,line.split(",", 2)[:2])] == 1) 

Comments