i want crawl (gently) website , download each html page crawl. accomplish use library requests. did crawl-listing , try crawl them using urllib.open without user-agent, error message. choose use requests, don't know how use it.
headers = { 'user-agent': 'mozilla/5.0 (x11; linux x86_64; rv:2.0.1) gecko/20100101 firefox/4.0.1' } page = requests.get('http://www.xf.com/ranking/get/?amount=1&from=left&to=right', headers=headers) open('pages/test.html', 'w') outfile: outfile.write(page.text)
the problem when script try write response in file encoding error:
unicodeencodeerror: 'ascii' codec can't encode characters in position 6673-6675: ordinal not in range(128)
how can write in file without having encoding problem?
in python 2, text files don't accept unicode strings. use response.content
access original binary, undecoded content:
with open('pages/test.html', 'w') outfile: outfile.write(page.content)
this write downloaded html in original encoding served website.
alternatively, if want re-encode responses specific encoding, use io.open()
produce file object accept unicode:
import io io.open('pages/test.html', 'w', encoding='utf8') outfile: outfile.write(page.text)
note many websites rely on signalling correct codec in html tags, , content can served without characterset parameter altogether.
in case requests
uses default codec text/*
mimetype, latin-1, decode html unicode text. this wrong codec , relying on behaviour can lead mojibake output later on. recommend stick writing binary content , rely on tools beautifulsoup detect correct encoding later on.
alternatively, test explicitly charset
parameter being present , re-encode (via response.text
, io.open()
or otherwise) if requests
did not fall latin-1 default. see retrieve links web page using python , beautifulsoup answer use such method tell beautifulsoup codec use.
Comments
Post a Comment