Dealing with server errors / proxies


#1

Suppose I want to crawl example.com and their server responds with a 4xx or 5xx status code (ex: 403 restricted). From my tests, it appears the apify crawler will not recognize that the page load has failed and incorrectly set willLoad as true in the request object.

This brings me to several questions:

  1. Am I understanding this correctly? Apify crawler needs a timeout to retry a page load?
  2. I expect I will want to update the intercept request function to set willLoad to false when the server response is a 4xx or 5xx error. is that right?
  3. I would like to keep trying to load a page (ex: n tries) until it gives me a 200 response. It would be best if I could set apify to slow it’s throttle each time there is an error so that we don’t hammer the server. Is there a setting/feature for this or do i have to build it?
  4. I would like to retire a proxy (at least temporarily) if a server responds with an error code (especially 4xx errors). is there a setting for this or do i have to build it?

Apologies if this is explained elsewhere; I’ve been looking through docs and forum and library without finding this topic.


Inconsistent results
#2

Hi Jonathan,

crawler retries the request in a case of network error (dns problem, timeout, etc.). In a case of 4xx or 5xx the pageFunction gets executed as usual but you can implement retry your side:

  • Check if context.request.responseStatus >= 400
  • If so then enqueue the page using uniqueKey with number of retries (here is 3 for example)
context.enqueuePage({
   url: context.request.url,
   uniqueKey: context.request.url + ?retry=3
});
  • Number of errors will be stored in retry parameter in uniqueKey so you can limit number of retries

Does this work for your usecase?
Let me know if you have any questions.

Marek


#3

this at least put me in the right direction for dealing with 403 errors and similar. it would be nice if this was just built into the crawler, but here is a default page function that resolves this:

function pageFunction(context) {
    if(isUnacceptableResponseStatus()){
        return;
    }
    var $ = context.jQuery;
    var results = [];
    // your code here
    return results;
    
    function isUnacceptableResponseStatus(){
        if(context.request.responseStatus != 200){ //may want to change to >= 400 to allow redirects
        	retries = getRetriesCount();
        	if (retries >= 10) {
        	    // Log error so we know that page failed.
        		throw new Error('Gave up on this page after 10 retries!');
        	}else{
        		context.enqueuePage({
	                url: context.request.url,
	                uniqueKey: context.request.url + '#retry=' + retries
	            }); //uniqueKey required to re-load same url
        	}
            return true;
        }
        return false;
    }

    function getRetriesCount(){
    	if (context.request.uniqueKey.indexOf('#retry=') > 0) {
    		retries = parseInt(context.request.uniqueKey.split('#retry=').pop())+1;
    	} else{
    		retries = 1;
    	}
    	return retries;
    }
}

#4

There were some bugs in this code. I improved it in other tread.


#5

Thanks. I made updates based on what you said. Also, can you suggest a modification to slow down the crawler? I think some sort of throttle that slows down as the retries increases would be helpful. Would something like this modification work?

setTimeout(context.enqueuePage({
	                url: context.request.url,
	                uniqueKey: context.request.url + '#retry=' + retries
	            }),retries*1000); //uniqueKey required to re-load same url

#6

If you want to slow down crawler you can use Parallel crawling processes and Delay between requests in advanced setting of crawler. It can slow down the crawler.


#7

I was hoping to have a variable delay between requests. For example, if a page is on its 10th try, I’d like to delay 10s or maybe 100s. But if a page is on the 1st try, 0s or 1s is fine.


#8

It make sence, then your code should work :+1: