there many ways go this, here's gist when comes down it:
i have 2 databases full of people, both exported csv files. 1 of databases being decommissioned. i need compare each csv file (or combined version of two) , filter out non-unique people in soon-to-be decommissioned server. way can import unique people decommissioned database current database.
i need compare firstname , lastname (which 2 separate columns). part of problem not precise duplicates, names capitalized in 1 database, , in other.
here example of data when combine 2 csv files one. caps names current database (which how csv formatted):
firstname,lastname,id,id2,id3 john,doe,123,432,645 jacob,smith,456,372,383 susy,saucy,9999,12,8r83 contractor ,#1,8dh,28j,153s testing2,contrator,7463,99999,0283 john,doe,999,888,999 susy,saucy,8373,08j,9023
would parsed into:
jacob,smith,456,372,383 contractor,#1,8dh,28j,153s testing2,contrator,7463,99999,0283
parsing other columns irrelevant, data relevant, must remain untouched. (there dozens of other columns, not three).
to idea of how many duplicates had, ran script (taken previous post):
with open('1.csv','r') in_file, open('2.csv','w') out_file: seen = set() # set fast o(1) amortized lookup line in in_file: if line in seen: continue # skip duplicate seen.add(line) out_file.write(line)
too simple needs though.
using set no unless want keep 1 unique line of recurring values not keep lines unique, need find unique values looking through file first counter
dict do:
with open("test.csv", encoding="utf-8") f, open("file_out.csv", "w") out: collections import counter csv import reader, writer wr = writer(out) header = next(f) # header # count of each first/last name pair lowering each string counts = counter((a.lower(), b.lower()) a, b, *_ in reader(f)) f.seek(0) # reset counter out.write(next(f)) # write header ? # iterate on file again, keeping rows have # unique first , second names wr.writerows(row row in reader(f) if counts[row[0].lower(),row[1].lower()] == 1)
input:
firstname,lastname,id,id2,id3 john,doe,123,432,645 jacob,smith,456,372,383 susy,saucy,9999,12,8r83 contractor,#1,8dh,28j,153s testing2,contrator,7463,99999,0283 john,doe,999,888,999 susy,saucy,8373,08j,9023
file_out:
firstname,lastname,id,id2,id3 jacob,smith,456,372,383 contractor,#1,8dh,28j,153s testing2,contrator,7463,99999,0283
counts
counts how many times each of names appear after being lowered. reset pointer , write lines first 2 column values seen once in whole file.
or without csv module may faster if have namy columns:
with open("test.csv") f, open("file_out.csv","w") out: collections import counter header = next(f) # header next(f) # skip blank line counts = counter(tuple(map(str.lower,line.split(",", 2)[:2])) line in f) f.seek(0) # start of file next(f), next(f) # skip again out.write(header) # write original header ? out.writelines(line line in f if counts[map(str.lower,line.split(",", 2)[:2])] == 1)
Comments
Post a Comment