Retry to crawl one of the start urls if result failed


#1

Hi,

I have a crawler which crawls several different urls with proxies but sometimes the website blocks the crawl. I have a temporary block message. If I relaunch the entire crawler, the same url could be ok.

I wanted to know if it’s possible to retry a start url with a different proxy until the result is good and to go to the next start url. I think the site blocks the crawl because the proxy is blacklisted.

I don’t know if I am clear…

Lets say I have 3 urls to crawl. The first one is blocked at the first crawl. Before to go to the second url, I want to retry again this one with a new proxie.

At the end, I just want to be sure that all urls have been crawled successfully.

Thanks


#2

Hi @bastien,

there is one simple a solution in Apify crawler.
You can set up Max pages per IP address to 1 in advanced settings. In that case each page will be crawl with different IP.
Then you need to set up enqueuing page again if you find an issue in data. There is simple code how you can do it in pageFunction:

function pageFunction(context) {
    // called on every page the crawler visits, use it to extract data from it
    var $ = context.jQuery;
    var result = {
        title: $('title').text(),
        myValue: $('TODO').text()
    };
    if (context.request.responseStatus !== 200) {
        console.log('Enqueue page again: ', context.request.url);
        var retryCountSpliter = context.request.uniqueKey.split('_retryCount=');
        var retryCount = retryCountSpliter.length > 1 ? parseInt(retryCount[1]) + 1 : 1;
        if (retryCount > 4) throw new Error('Retry count reach!');
        context.skipOutput();
        context.enqueuePage({
            url: context.request.url,
            label: context.request.label,
            uniqueKey: Math.random() + '_retryCount=' + (retryCount + 1),
        });
    }
    return result;
}
``