Links matching regex are not crawled


#1

Hi!
I would like to scrap articles from this site: https://www.crossnews.at/category/news/national_news/technik/

I am running an actor like this:

Apify.main(async () => {
    log.debug("entered Apify.main");
    const requestQueue = await Apify.openRequestQueue('crossnews.at');

    await requestQueue.addRequest({ url: 'https://www.crossnews.at/category/news/national_news/technik/' });
    log.debug("requestQueue.isEmpty: " + requestQueue.isEmpty);
   
    // Open a named dataset
    const dataset = await Apify.openDataset('crossnews.at');

    const handlePageFunction = async ({ $, request, response }) => {
        log.info("entered crawler");        
        log.info('crawling url: ' + request.url.href);
   ......
   const options = {
        $,
        requestQueue,            
        selector: 'a',                      
        pseudoUrls : [ 'http[s?]:\/\/[-\w.]+\.crossnews\.at\/category\/news\/national_news\/technik\/.*',
                       'http[s?]:\/\/[-\w.]+\.crossnews\.at\/category\/news\/national_news\/technik\/page\/\d\/',
                       'http[s?]:\/\/[-\w.]+\.crossnews\.at\/\d\d\d\d\/\d\d\/.*' ],
        baseUrl: "https://www.crossnews.at/"          
    };
    await Apify.utils.enqueueLinks(options);    
    log.debug(requestQueue.isEmpty);    
}

const autoscaledPoolOptions = {
    maxConcurrency : 10
}

const crawler = new Apify.CheerioCrawler( {  
    maxRequestsPerCrawl: 4000,               
    requestQueue,        
    handlePageFunction,
    autoscaledPoolOptions,

    handleFailedRequestFunction: async ({ request }) => {
        log.debug(`Request ${request.url} failed twice.`);
    }
})

log.debug("starting crawler run....");
await crawler.run();
log.debug('Crawler finished.....');
});

So for example this url should match the regex:

https://www.crossnews.at/2018/11/neu-ktm-sx-e-5/

https://[-\w.]+.crossnews.at/\d\d\d\d/\d\d/.*

So I have no idea, why the crawler won’t go into this subpage?


#2

Hi,

the main problem is with your pseudoUrl configuration, it does not directly take regexp, but instead PURL which basically is url with parts in brackets being matched as regular expressions, this is how yours should look like:

'http[s?]://[[-\w.]+].crossnews.at/category/news/national_news/technik/page/[\\d]/',
'http[s?]://[[-\w.]+].crossnews.at/[\\d+]/[\\d+]/[.*]'

#3

Well, either way doesn’t work. I have also tried using Apify.Pseudourl("…") without success


#4

I have a doubt about [[-\w.]+]
Test with [[-\\w.]+]

Can you examples of Urls that don’t match?


#5

Hi,

There were several issues with your code. I’ve updated it and will try to explain what went wrong:

const Apify = require('apify');

const { utils: { log } } = Apify;

Apify.main(async () => {
    log.debug("entered Apify.main");
    const requestQueue = await Apify.openRequestQueue();

    await requestQueue.addRequest({ url: 'https://www.crossnews.at/category/news/national_news/technik/' });
    log.debug("requestQueue.isEmpty: " + await requestQueue.isEmpty());

    // Open a named dataset
    const dataset = await Apify.openDataset();

    const handlePageFunction = async ({ $, request, response }) => {
        log.info("entered crawler");
        log.info('crawling url: ' + request.url);
        const options = {
            $,
            requestQueue,
            selector: 'a',
            pseudoUrls : [
                'http[s?]://[[-\\w.]+].crossnews.at/category/news/national_news/technik/[.*]',
                'http[s?]://[[-\\w.]+].crossnews.at/category/news/national_news/technik/page/[\\d+]/',
                'http[s?]://[[-\\w.]+].crossnews.at/[\\d\\d\\d\\d/\\d\\d]/[.*]'
            ],
            baseUrl: "https://www.crossnews.at/"
        };
        await Apify.utils.enqueueLinks(options);
        log.debug(requestQueue.isEmpty);
    }

    const autoscaledPoolOptions = {
        maxConcurrency : 10
    }

    const crawler = new Apify.CheerioCrawler( {
        maxRequestsPerCrawl: 4000,
        requestQueue,
        handlePageFunction,
        autoscaledPoolOptions,

        handleFailedRequestFunction: async ({ request }) => {
            log.debug(`Request ${request.url} failed twice.`);
        }
    })

    log.debug("starting crawler run....");
    await crawler.run();
    log.debug('Crawler finished.....');
});

First of all, you’re using a named request queue and dataset. This is unnecessary and also may cause headaches. Named storages are not automatically deleted when using apify run -p so when rerunning the actor, your request queue does not empty itself and therefore the crawl finishes immediately (because all the requests in the queue were already handled).

Second, the Pseudo URLs were wrong. You need to distinguish between using string type and RegExp type and escape characters accordingly. When using string, you need to escape all backslashes, which turns in double backslashes. In RegExp you need to escape special characters.

string

[
    'http[s?]://[[-\\w.]+].crossnews.at/category/news/national_news/technik/[.*]',
    'http[s?]://[[-\\w.]+].crossnews.at/category/news/national_news/technik/page/[\\d+]/',
    'http[s?]://[[-\\w.]+].crossnews.at/[\\d\\d\\d\\d/\\d\\d]/[.*]'
]

vs RegExp

[
    /^https?:\/\/[-\w.]+\.crossnews\.at\/category\/news\/national_news\/technik\/.*/,
    /^https?:\/\/[-\w.]+\.crossnews\.at\/category\/news\/national_news\/technik\/page\/\d+/,
    /^https?:\/\/[-\w.]+\.crossnews\.at\/\d\d\d\d\/\d\d\/.*/
]

Also, at the top of handlePageFunction, the log message should log request.url, not request.url.href, but that’s just a minor thing.

Hope this helped.