External list of URLs, still needs START url

#1

Hi all,

In all the examples it looks like the Start URL param is “obligatory”
In the case of using an external list of URLs ( for example via Google Sheets) this means that the Start URL has to be a random URL, just to get the logic right (in the example pasted hereunder “example.com”)

Even tho in the code example.com is set to context.skipOutput , the log file still shows that that page is being crawled initially - isn’t that a waste of “crawl” ? Especially since the amount of pages we crawl per month is limited.

#2

Hi,

you are right the start page is crawled when you call context.skipOutput. Because this method just skips output from this URL from results. But it is crawled.

We don’t have a better solution in the legacy crawler. But if you want to you can use Apify actor where you can code a better solution for handling URLs from the google spreadsheet.

1 Like