Problem with pseudo url regex


#1

I have a problem with matching pseudourls within my actor.

I want to scrap articles according motorcycles from msn.

I start here: https://www.msn.com/de-at/autos/motorrad/

All the links on this page seem to the format:

/de-at/autos/nachrichten/bmw-s-1000-rr-im-fahrbericht/ar-BBUqR4V

I start my requestqueue here:

const requestQueue = await Apify.openRequestQueue('msn.com/de-at/autos/motorrad');
await requestQueue.addRequest({ url: 'https://www.msn.com/de-at/autos/motorrad/' });    

and I enque these options:

 const options = {
            $,
            requestQueue,            
            pseudoUrls : [ 'https:\/\/www\.msn\.com\/de-at\/autos\/nachrichten\/.*',
                             ],
            baseUrl: "https://www.msn.com"
};
await Apify.utils.enqueueLinks(options);

I have also tried to use the PseudoURL Class:

new Apify.PseudoUrl("https://www.msn.com/de-at/autos/nachrichten/[.*]")

This doesn’t work either. No page further page is crawled…

var url = new Apify.PseudoUrl("https://www.msn.com/de-at/autos/nachrichten/[.*]"); log.debug("match? " + url.matches("https://www.msn.com/de-at/autos/nachrichten/druid-motorcycles-sorcerer-hybrid-exv/ar-BBTB0vw"));

… prints “true” so pseudo url seems to match. But why isn’t there any link enqueued?

… it seems as if the crawler doesn’t process the a-tags on the page?


#2

I don’t see any immediate problem. Are you running locally or on the platform? If locally, could you share the full source code? If on the platform, could you share the actor ID?

The only thing that comes to mind is the fact that you’re using a named RequestQueue. Named storages are not automatically deleted by running apify run -p so the queue might still be in the completed state, preventing the crawl from running.

Unless needed, we suggest just using the default RequestQueue by calling await Apify.openRequestQueue() with no arguments.