How to do breadth first search with an Apify crawler?


#1

I’m a beginner with Apify. I want to make an Apify crawler that gets all the data with breadth-first search algorithm.

How can I do it? are there many references available?

An example or sample code would be greatly appreciated. Thanks!


#2

Hello @cnhx27,

Apify uses breadth-first crawling by default so any of the existing examples should be fine for you.

Be sure to check out our SDK page and the Getting Started tutorial and let us know if you get stuck anywhere :slight_smile:


#3

Sorry, I expelled myself badly.
I want to start the crawl of level + 1 only if the previous one is finished. This means that all URL’s in level - 1 were crawled.


#4

As I said, this is the default behavior when using a Request Queue.

Imagine a scenario where you have a Start URL and enqueue new requests on each page.

At the beginning, the queue would look like this:

[startUrl]

and after the first request is processed, it will look like this:

[requestFromStartUrl1, requestFromStartUrl2, requestFromStartUrl3, ...]

The crawler then visits requestFromStartUrl1 and enqueues more requests. The queue will look like this:

[requestFromStartUrl2, requestFromStartUrl3, ..., newRequest1, newRequest2, newRequest3, ...]

So, as you can see, the requestFromStartUrl# items - i.e. level 1 items, will always be crawled before the lower level items are crawled.

There is a catch though, and that is, if some of the requests fail, they will be enqueued to the end of the queue for retries, so this breaks the principle, but there’s a good reason to do it this way. It allows some time between similar requests, which helps with blocking.

Or am I getting this wrong?


#5

There is a catch though, and that is, if some of the requests fail, they will be enqueued to the end of the queue for retries, so this breaks the principle

That’s a first reason it’s not really breadth-first crawling.

Another reason is the time it takes to load pages. Indeed, there could be some pages in level 1 that have a high loading time. So possibility to have level 2 Url’s crawled before the end of level 1 Url’s.

One solution would be RequestQueueue to be a priority queue (priority being the level/depth, with 0 hight priority, 1, … , 9 low priority, …and so on)?

Rem : I use Apify SDK on my system.


#6

May I ask what is the actual use case behind needing to to it like this?

You can always limit the concurrency of the runs to 1 (see maxConcurrency option of crawlers), thus eliminating the chance of level 2 requests being crawled before level 1 finishes. You can also turn off request retries (see maxRequestRetries) and in the handleFailedRequestFunction do your enqueueing back into the queue manually using `requestQueue.addRequest(req, { forefront: true }), which would put the failed request back to the front of the queue, thus not breaking breadth first principle.

If request queue were a priority queue, it would work like you suggest, but unfortunately it isn’t. I guess we can’t do a “theoretically sound” breadth first crawl, but we can get reasonably close.


#7

I don’t quite understand. In Apify SDK Doc :

If the function throws an exception, the crawler will try to re-crawl the request later, up to option.maxRequestRetries times. If all the retries fail, the crawler calls the function provided to the options.handleFailedRequestFunction parameter.

I understand that handleFailedRequestFunction is called if the page processing failed more than maxRequestRetries+1 times. Why then do requestQueue.addRequest(req, { forefront: true }) in handleFailedRequestFunction since the page is in error. By doing this I create an infinite loop?