Python crawler: downloading HTML page -


i want crawl (gently) website , download each html page crawl. accomplish use library requests. did crawl-listing , try crawl them using urllib.open without user-agent, error message. choose use requests, don't know how use it.

headers = {     'user-agent': 'mozilla/5.0 (x11; linux x86_64; rv:2.0.1) gecko/20100101 firefox/4.0.1' } page = requests.get('http://www.xf.com/ranking/get/?amount=1&from=left&to=right', headers=headers) open('pages/test.html', 'w') outfile:      outfile.write(page.text)  

the problem when script try write response in file encoding error:

unicodeencodeerror: 'ascii' codec can't encode characters in position 6673-6675: ordinal not in range(128) 

how can write in file without having encoding problem?

in python 2, text files don't accept unicode strings. use response.content access original binary, undecoded content:

with open('pages/test.html', 'w') outfile:     outfile.write(page.content)  

this write downloaded html in original encoding served website.

alternatively, if want re-encode responses specific encoding, use io.open() produce file object accept unicode:

import io  io.open('pages/test.html', 'w', encoding='utf8') outfile:     outfile.write(page.text)  

note many websites rely on signalling correct codec in html tags, , content can served without characterset parameter altogether.

in case requests uses default codec text/* mimetype, latin-1, decode html unicode text. this wrong codec , relying on behaviour can lead mojibake output later on. recommend stick writing binary content , rely on tools beautifulsoup detect correct encoding later on.

alternatively, test explicitly charset parameter being present , re-encode (via response.text , io.open() or otherwise) if requests did not fall latin-1 default. see retrieve links web page using python , beautifulsoup answer use such method tell beautifulsoup codec use.


Comments