regex - Remove urls, empty lines, and unicode characters in python -


i need remove url, empty lines , lines unicode characters big text file (500mib) using python.

this file:

https://removethis1.com http://removethis2.com foobar1 http://removethis3.com foobar2 foobar3 http://removethis4.com www.removethis5.com   foobar4 www.removethis6.com foobar5 foobar6 foobar7 foobar8 www.removethis7.com 

after regex should this:

foobar1 foobar2 foobar3  foobar4 foobar5 foobar6 foobar7 foobar8 

the code come this:

    file = open(file_path, encoding="utf8")     self.rawfile = file.read()     rep = re.compile(r"""                         http[s]?://.*?\s                          |www.*?\s                           |(\n){2,}                           """, re.x)     self.processedfile = rep.sub('', self.rawfile) 

but output incorrect:

foobar3 foobar4 foobar5 foobar6 foobar7 foobar8 www.removethis7.com 

i need remove lines containing @ least 1 non-ascii char can't come regex task.

you can try encode ascii catch non ascii lines presume want:

with open("test.txt",encoding="utf-8") f:     rep = re.compile(r"""                         http[s]?://.*?\s                         |www.*?\s                         |(\n)                         """, re.x)     line in f:         m = rep.search(line)         try:             if m:                 line = line.replace(m.group(), "")                 line.encode("ascii")         except unicodeencodeerror:             continue         if line.strip():             print(line.strip()) 

input:

https://removethis1.com http://removethis2.com foobar1 http://removethis3.com foobar2 foobar3 http://removethis4.com www.removethis5.com  1234 ā 5678 字 foobar4 www.removethis6.com foobar5 foobar6 foobar7 foobar8 www.removethis7.com 

output:

foobar1 foobar2 foobar3 foobar4 foobar5 foobar6 foobar7 foobar8 

or using regex match non ascii:

with open("test.txt",encoding="utf-8") f:     rep = re.compile(r"""                         http[s]?://.*?\s                         |www.*?\s                         |(\n)                         """, re.x)     non_asc = re.compile(r"[^\x00-\x7f]")     line in f:         non = non_asc.search(line)         if non:             continue         m = rep.search(line)         if m:             line = line.replace(m.group(), "")             if line.strip():                 print(line.strip()) 

same output above. cannot combine regexes removing lines 1 if there match , replacing other.


Comments