i need remove url, empty lines , lines unicode characters big text file (500mib) using python.
this file:
https://removethis1.com http://removethis2.com foobar1 http://removethis3.com foobar2 foobar3 http://removethis4.com www.removethis5.com foobar4 www.removethis6.com foobar5 foobar6 foobar7 foobar8 www.removethis7.com
after regex should this:
foobar1 foobar2 foobar3 foobar4 foobar5 foobar6 foobar7 foobar8
the code come this:
file = open(file_path, encoding="utf8") self.rawfile = file.read() rep = re.compile(r""" http[s]?://.*?\s |www.*?\s |(\n){2,} """, re.x) self.processedfile = rep.sub('', self.rawfile)
but output incorrect:
foobar3 foobar4 foobar5 foobar6 foobar7 foobar8 www.removethis7.com
i need remove lines containing @ least 1 non-ascii char can't come regex task.
you can try encode ascii catch non ascii lines presume want:
with open("test.txt",encoding="utf-8") f: rep = re.compile(r""" http[s]?://.*?\s |www.*?\s |(\n) """, re.x) line in f: m = rep.search(line) try: if m: line = line.replace(m.group(), "") line.encode("ascii") except unicodeencodeerror: continue if line.strip(): print(line.strip())
input:
https://removethis1.com http://removethis2.com foobar1 http://removethis3.com foobar2 foobar3 http://removethis4.com www.removethis5.com 1234 ā 5678 字 foobar4 www.removethis6.com foobar5 foobar6 foobar7 foobar8 www.removethis7.com
output:
foobar1 foobar2 foobar3 foobar4 foobar5 foobar6 foobar7 foobar8
or using regex match non ascii:
with open("test.txt",encoding="utf-8") f: rep = re.compile(r""" http[s]?://.*?\s |www.*?\s |(\n) """, re.x) non_asc = re.compile(r"[^\x00-\x7f]") line in f: non = non_asc.search(line) if non: continue m = rep.search(line) if m: line = line.replace(m.group(), "") if line.strip(): print(line.strip())
same output above. cannot combine regexes removing lines 1 if there match , replacing other.
Comments
Post a Comment