Message390287
Hi, I wrote an webcrawler, which is using ThreadPoolExecutor to span multiple thread workers, retrieve content of a web using via http.client and saves it to a file.
After a couple of thousands requests have been processes, the crawler starts to consume memory rapidly, resulting in consumption of all available memory.
tracemalloc shows the memory is not collected from:
/usr/lib/python3.9/http/client.py:468: size=47.6 MiB, count=6078, average=8221 B
File "/usr/lib/python3.9/http/client.py", line 468
s = self.fp.read()
I have tested as well with requests and urllib3 and as they use http.client underneath, the result is always the same.
My code around that:
def get_html3(session, url, timeout=10):
o = urlparse(url)
if o.scheme == 'http':
cn = http.client.HTTPConnection(o.netloc, timeout=timeout)
else:
cn = http.client.HTTPSConnection(o.netloc, context=ctx, timeout=timeout)
cn.request('GET', o.path, headers=headers)
r = cn.getresponse()
log.debug(f'[*] [{url}] Status: {r.status} {r.reason}')
if r.status not in [400, 403, 404]:
ret = r.read().decode('utf-8')
else:
ret = ""
r.close()
del r
cn.close()
del cn
return ret |
|
Date |
User |
Action |
Args |
2021-04-06 07:43:38 | HynekPetrak | set | recipients:
+ HynekPetrak |
2021-04-06 07:43:38 | HynekPetrak | set | messageid: <1617695018.24.0.921353674575.issue43741@roundup.psfhosted.org> |
2021-04-06 07:43:38 | HynekPetrak | link | issue43741 messages |
2021-04-06 07:43:37 | HynekPetrak | create | |
|