Actors and monthly crawling limit

I have to crawl about 5000 couples of webpages (so 10000 total). I have written my crawler, only to figure I would explode the monthly limit of the free plan. I wish I could get a paid plan but unfortunately, currently I am really unable to afford it (I wonder what are the limits for paid plans, though ? They don’t appear on https://apify.com/pricing).

Then I guessed I may get my way with an actor, which this post seems to confirm :

With Actor you can crawl/scrape even more that 10k pages on a free plan - depends on approach (pure request / Puppeteer with headless mode / non-headless mode / AJAX calls / etc.)

But I’m not sure to understand it correctly, is there a particular efficient approach ? Or does it just mean that it depends on the memory required, in which case the webpages size is important as well ? What if I just feed my crawler to the actor ?

Also, how are actor units computed ? The pricing page indicates 1 unit = 1 hour @ 1 GB RAM, but is there a linear relationship between both ? I.e. 1 unit = 1/2 hour @ 2 GB = 4 hour @ 250 MB ?

Finally, reading this example I am also guessing that if an actor calls a crawler, the crawler is still subject to monthly crawling limit, am I right ?

Edit : I just realized apify/web-scraper tasks might be the actual successors of crawlers.

We don’t have pricing for legacy crawler on the pricing page. The crawler product was replaced with this actor. It has the same interface as the old crawler product. You need to migrate old crawler to the actor tasks. The good thing is the actor is charge by CU so you can optimize it to decrease its usage. You can not do it in the old crawler product.
If you want to decrease your usage and you don’t need to use the browser for crawling you can use cheerio-scraper. It uses just HTTP requests for crawling and it is 10 times cheaper because you don’t need to open the browser.

The Compute units are calculated as 1 GB of memory running for 1 hour. So if you have an actor with 4 GB of memory running for 4 hours, it will consume 16 Compute units from your account.

I hope it helps.

Ok, thanks. I managed to migrate my crawler to a legacy-phantomjs-crawler (had unrelated issues with the target pages).

I ran it at first successfully, but then after 23 minutes (3200 pages over 4900) it ceased to crawl, yet still consuming CU for about 40 minutes more, exceeding my CU quota, and only logging lines like

2019-05-29T10:49:06.278Z INFO: AutoscaledPool state {"currentConcurrency":0,"desiredConcurrency":23,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"maxOverloadedRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"maxOverloadedRatio":0.4,"actualRatio":0.019956777199972985},"cpuInfo":{"isOverloaded":false,"maxOverloadedRatio":0.4,"actualRatio":0.08483290488431877},"clientInfo":{"isOverloaded":false,"maxOverloadedRatio":0.3,"actualRatio":0}}}
2019-05-29T10:49:35.116Z INFO: AutoscaledPool state {"currentConcurrency":0,"desiredConcurrency":23,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"maxOverloadedRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"maxOverloadedRatio":0.4,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"maxOverloadedRatio":0.4,"actualRatio":0.07894736842105263},"clientInfo":{"isOverloaded":false,"maxOverloadedRatio":0.3,"actualRatio":0}}}
2019-05-29T10:50:35.121Z INFO: AutoscaledPool state {"currentConcurrency":0,"desiredConcurrency":23,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"maxOverloadedRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"maxOverloadedRatio":0.4,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"maxOverloadedRatio":0.4,"actualRatio":0.07374301675977654},"clientInfo":{"isOverloaded":false,"maxOverloadedRatio":0.3,"actualRatio":0}}}

It seems it would still be running if I hadn’t aborted it.

I wonder if the issue comes from my script or if there were server issues ? I noticed the user interface went blank for a while afterwards.

Here is the log, FWIW.

Hi @apify1,

sadly, this is a known issue and our team is working hard on resolving it as soon as possible.

I have added 5 extra compute units to your account for the run with issues.

Best,
Ondra

1 Like

Thanks. Does this issue appear randomly, or do you know how it is triggered ?

@mnmkng Please let me know when the issue is fixed.

Hi @apify1, the issue is fixed now. You may still see some wrong log messages when the actor is about to finish, but it will no longer stall. The log messages will stop in the coming days when we release a new version.

1 Like