Request queue not working


#1

Hi, so I’ve been struggling for days trying to make my crawlers work. I followed the installation instructions, and built a few crawlers based on a list of URLs, and everything was good. But now I need to handle pagination, and nothing I try will make the damned thing work. My knowledge of Node and back-end stuff is zero, but I would have thought given I’m trying to do some really basic stuff, I would be able to follow a few instructions and examples and use for my context.

So. I’ve given up on my code, and gone back to scratch. I just want to run a script from the ‘getting started’ tutorial. Under this element .h4{putting it all together}./h4, there’s some code which should go to a page, add some urls to the request queue and run them. I’ve copy / pasted this, and…

ERROR: BasicCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {“url”:“https://www.apify.com”,“retryCount”:1}
TypeError [ERR_INVALID_URL]: Invalid URL: /
at onParseError (internal/url.js:241:17)
at parse (internal/url.js:257:3)
at new URL (internal/url.js:332:5)
at links.filter (C:\scrape\3.js:17:34)
at Array.filter ()
at CheerioCrawler.handlePageFunction (C:\scrape\3.js:16:39)
at CheerioCrawler._handleRequestFunction (C:\scrape\node_modules\apify\build\cheerio_crawler.js:342:50)
at process._tickCallback (internal/process/next_tick.js:68:7)

Does anyone have an idea as to what may be going wrong?


#2

Hello @tobyjohnson10,

It would be great if you could share your source code, so we could take a look. Copy/pastes sometimes don’t work as expected.

Nevertheless, I have a hunch. It seems that you’re missing a baseUrl argument in your enqueueLinks() call, because it seems that your relative links are not getting resolved.

Try reading this part of the tutorial and see if it helps.

This may also be helpful.


#3

Thanks for the reply mnmkng.
So the code that’s creating that error isn’t mine - that’s the tutorial code from the getting started page:

const { URL } = require(‘url’); // <------ This is new.
const Apify = require(‘apify’);

Apify.main(async () => {
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({ url: ‘https://www.apify.com’ });

const handlePageFunction = async ({ request, $ }) => {
    const title = $('title').text();
    console.log(`The title of "${request.url}" is: ${title}.`);
    
    // Here starts the new part of handlePageFunction.
    const links = $('a[href]').map((i, el) => $(el).attr('href')).get();

    const ourDomain = 'apify.com'
    const sameDomainLinks = links.filter((link) => {
        const linkHostname = new URL(link).hostname;
        return linkHostname.endsWith(ourDomain);
    });
    
     for (const url of sameDomainLinks) {
         console.log(`Enqueueing ${url}`);
         await requestQueue.addRequest({ url });
     }
}

const crawler = new Apify.CheerioCrawler({
    maxRequestsPerCrawl: 20, // <------ This is new too.
    requestQueue,
    handlePageFunction
});

await crawler.run();

});

I would have thought there was no way that tutorial code should be erroring, so I’m assuming there’s something about my environment or installation that is broken. But I have no idea what to check


#4

Actually, it’s not your environment @tobyjohnson10. It fails for me too. There must’ve been some changes on the Apify website that went unnoticed and are currently breaking the example.

Thanks a lot for coming with this to the forum, we’ll update the example as soon as we can.

What was your original problem with the request queue? Maybe I can help.


#5

Ah great - thanks so much for looking.

So with my original problem, I expect it’s just my failure to understand Apify properly, but my understanding is that when you use the request list, it adds a JSON file containing url and parameters to /request_queries/pending. When you create a cheerio crawler, and pass the request list and request queue, it would add the request list items to the queue, then run all the queue items.

So I’ve just started everything again, reinstalling apify to a new folder. I ran the final tutorial crawler, and it worked. I then tried to modify it to see if I can get it to do what I want. Here’s my code:

const Apify = require(‘apify’);

Apify.main(async () => {

const sources = [
    'https://en.wikipedia.org/wiki/Data_scraping'
];

const requestList = await Apify.openRequestList('x', sources);
const requestQueue = await Apify.openRequestQueue();

const crawler = new Apify.CheerioCrawler({
    maxRequestsPerCrawl: 5,
    requestList,
    requestQueue,
    handlePageFunction: async ({ $, request, response }) => {
        console.log(`Processing ${request.url}`);
        
        var newUrl = 'https://stackoverflow.com/'
        requestQueue.addRequest({ url: newUrl}); 
        data = {};
        data.url = request.url
        await Apify.pushData(data);

    }
});
await crawler.run();

});

I’ve deleted all files in the request_queries folder and datasets folder. I think on running the file, it should do this:

  • Go to wikipedia
  • Add stack-overflow to the queue
  • Save a json file with the wikipedia url into the datasets/default folder
  • Go to stackoverflow
  • Save a json file with the stackoverflow url into the datasets/default folder
  • End

Yet when I run it, I get this:

Processing …apify.com/library?type=acts&category=ECOMMERCE
Processing …apify.com/library?type=acts&category=TRAVEL
Processing …apify.com/library?type=acts&category=ENTERTAINMENT
Processing …stackoverflow.com/
INFO: BasicCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.

My questions are:

  • Why has it still got the original URLs from the tutorial code I ran still in the queue?
  • Why hasn’t it gone to the url in my source array?

I’m assuming I must have misunderstood the queue, and the folders are just a log of the actual queue held in memory somewhere. Maybe I need to re-initialize that somehow on running a crawler for it to accept new source urls and not go to old queue urls?

Thanks


#6

You got everything right. That’s a pretty good job actually! Just missed one minor thing I guess.

Since you’re not speaking about the /key-value-stores/default/ folder, I assume you have not deleted that one. The Apify.openRequestList() function, if you provide it with a name (first parameter), actually creates 2 files in this folder, where it saves all the sources you give it and its internal state. This is probably the core of your troubles.

If the function finds those files, it does not use the new sources, but rather reuses the original ones. This is mainly to ensure sources consistency on a process restart in situations where the sources are dynamically loaded from the web. It might sound like an edge case, but it’s actually the most common way of using RequestList.

You can probably fix the problem just by deleting the two x- files in your key-value-stores/default/ folder.

A tip: If you run the actor with this command:

apify run -p

it will automatically delete all those lingering files, except INPUT.json and stuff saved in named storages (aka, in folders other than default).

Hope this helped!


#7

Ah thank you so much - the key store values folder was the missing bit of information. Apify seems to be a really great set of tools, but I’ve spent hours just trying to get over that first hurdle, so thanks.