Inconsistent results


#1

It seems like the crawler is not providing proper error reporting. For example, I have a crawler which crawls a standard set of web pages. If I run the crawler twice, I will get substantially different results even though the web pages have not changed. For the pages I’m crawling right now, there should be exactly 203 results… however sometimes I get 240 results, sometimes 160. I updated the pageresults to retry pages with 403 errors per Dealing with server errors / proxies … but still not getting consistent results. Any ideas?


#3

Hi @jstanley,

It looks like you try to scrape yelp.
We are facing there with issues, that some our proxy IPs were banned.
The banned proxies ends with this screen


and with status code 503.

I checked your code and you retried page there but you can not increment a number of retries. So you will end up after the first retry.
I improved your code:

function pageFunction(context) {
    if(isUnacceptableResponseStatus()){
        return;
    }
    var $ = context.jQuery;
    var results = [];
    // your code here
    return results;
    
    function isUnacceptableResponseStatus(){
        if(context.request.responseStatus != 200){ //may want to change to >= 400 to allow redirects
        	retries = getRetriesCount() + 1; // Increment retries!
        	uKey = context.request.url + '#retry=' + retries;
        	if (retries >= 10) {
                        // don't enqueue again. at this point just give up
                        // This errors appears in results, so we know that page failed.
                        throw new Error('Can not open page after 10 retries!');
        	}else{
        		context.enqueuePage({
	                url: context.request.url,
	                uniqueKey: uKey
	            }); //uniqueKey required to re-load same url
        	}
            return true;
        }
        return false;
    }

    function getRetriesCount(){
    	if (context.request.uniqueKey.indexOf('#retry=') > 0) {
                // We need to parse number as number
    		retries = parseInt(context.request.uniqueKey.split('#retry=').pop());
    	} else{
    		retries = 1;
    	}
    	return retries;
    }
}

It should work like that, if not you need new proxies. We can set up some for you, you can let us know in chat in Apify app. :slight_smile: