i have large list of http user agent strings (taken pandas dataframe) trying parse using python implementation of ua-parser. can parse list fine when using single thread, based on preliminary speed testing, it'd take me on 10 hours run whole dataset.
i trying use pool.map()
decrease processing time can't quite seem figure out how work. i've read dozen 'tutorials' found online , have searched (likely duplicate of sort, there lot of similar questions), none of dozens of attempts have worked 1 reason or another. i'm assuming/hoping it's easy fix.
here have far:
from ua_parser import user_agent_parser http_str = df['user_agents'].tolist() def uaparse(http_str): i, item in enumerate(http_str): return user_agent_parser.parse(http_str[i]) pool = mp.pool(processes=10) parsed = pool.map(uaparse, range(0,len(http_str))
right i'm seeing following error message:
--------------------------------------------------------------------------- typeerror traceback (most recent call last) <ipython-input-25-701fbf58d263> in <module>() 7 8 pool = mp.pool(processes=10) ----> 9 results = pool.map(uaparse, range(0,len(http_str))) /home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in map(self, func, iterable, chunksize) 249 ''' 250 assert self._state == run --> 251 return self.map_async(func, iterable, chunksize).get() 252 253 def imap(self, func, iterable, chunksize=1): /home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout) 565 return self._value 566 else: --> 567 raise self._value 568 569 def _set(self, i, obj): typeerror: 'int' object not iterable
thanks in advance assistance/direction can provide.
it seems need is:
http_str = df['user_agents'].tolist() pool = mp.pool(processes=10) parsed = pool.map(user_agent_parser.parse, http_str)
Comments
Post a Comment