Wrong pages crawled by "Puppeteer Crawler"


#1

Hi,
I tested the examples on the SDK pages:

First the example on the SDK page it self https://sdk.apify.com/ (crawls iana.org) … than the example on the page “Puppeteer Crawler” https://sdk.apify.com/docs/examples/puppeteercrawler (should crawl Hacker News).

When I run the “Puppeteer Crawler” the iana.org pages are shown on the command line:


Also the browser shows the iana.org page:
PuppeteerCrawler%20browser

How does this come? How can I crawl the right pages?

Regards,
Wolfgang

PS: Source of PuppeteerCrawler.js
const Apify = require(‘apify’);

Apify.main(async () => {
// Create and initialize an instance of the RequestList class that contains the start URL.
const requestList = new Apify.RequestList({
sources: [
{ url: ‘https://news.ycombinator.com/’ },
],
});
await requestList.initialize();

// Apify.openRequestQueue() is a factory to get a preconfigured RequestQueue instance.
const requestQueue = await Apify.openRequestQueue();

// Create an instance of the PuppeteerCrawler class - a crawler
// that automatically loads the URLs in headless Chrome / Puppeteer.
const crawler = new Apify.PuppeteerCrawler({
    // The crawler will first fetch start URLs from the RequestList
    // and then the newly discovered URLs from the RequestQueue
    requestList,
    requestQueue,

    // Here you can set options that are passed to the Apify.launchPuppeteer() function.
    // For example, you can set "slowMo" to slow down Puppeteer operations to simplify debugging
    launchPuppeteerOptions: { slowMo: 500 },

    // Stop crawling after several pages
    maxRequestsPerCrawl: 10,

    // This function will be called for each URL to crawl.
    // Here you can write the Puppeteer scripts you are familiar with,
    // with the exception that browsers and pages are automatically managed by the Apify SDK.
    // The function accepts a single parameter, which is an object with the following fields:
    // - request: an instance of the Request class with information such as URL and HTTP method
    // - page: Puppeteer's Page object (see https://pptr.dev/#show=api-class-page)
    handlePageFunction: async ({ request, page }) => {
        console.log(`Processing ${request.url}...`);

        // A function to be evaluated by Puppeteer within the browser context.
        const pageFunction = ($posts) => {
            const data = [];

            // We're getting the title, rank and URL of each post on Hacker News.
            $posts.forEach(($post) => {
                data.push({
                    title: $post.querySelector('.title a').innerText,
                    href: $post.querySelector('.title a').href
                });
            });

            return data;
        };
        const data = await page.$$eval('.athing', pageFunction);

        // Store the results to the default dataset.
        await Apify.pushData(data);

        // Find the link to the next page using Puppeteer functions.
        let nextHref;
        try {
            nextHref = await page.$eval('.morelink', el => el.href);
        } catch (err) {
            console.log(`${request.url} is the last page!`);
            return;
        }

        // Enqueue the link to the RequestQueue
        await requestQueue.addRequest(new Apify.Request({ url: nextHref }));

    },

    // This function is called if the page processing failed more than maxRequestRetries+1 times.
    handleFailedRequestFunction: async ({ request }) => {
        console.log(`Request ${request.url} failed too many times`);
    },
});

// Run the crawler and wait for it to finish.
await crawler.run();

console.log('Crawler finished.');

});


#2

Hi there, I think what is happening is that both the RequestList and RequestQueue class in Apify SDK stores its state into the local storage directory, by default in ./apify_storage. You should delete the directory before every run, so that it starts from a clean state.

BTW if you’re using Apify CLI, you can do this by running apify run --purge.

Please let me know whether this helped. We’ll look how we can make this more clear in the documentation.


#3

Thx, this was it. I discovered it independendly and fixed it by calling the delete method of the request queue, see below. This works too.

Regards,
Wolfgang

 const requestQueue = await Apify.openRequestQueue();
// delete old request queue at initialization
requestQueue.delete();