I am scraping 23770 webpages with a pretty simple web scraper using scrapy
. I am quite new to scrapy and even python, but managed to write a spider that does the jo
I work also on web scrapping, using optimized C#, and it ends up CPU bound, so I am switching to C.
Parsing HTML blows the CPU data cache, and pretty sure your CPU is not using SSE 4.2 at all, as you can only access this feature using C/C++.
If you do the math, you are quickly compute bound but not memory bound.