External list of URLs, still needs START url


Hi all,

In all the examples it looks like the Start URL param is “obligatory”
In the case of using an external list of URLs ( for example via Google Sheets) this means that the Start URL has to be a random URL, just to get the logic right (in the example pasted hereunder “example.com”)

Even tho in the code example.com is set to context.skipOutput , the log file still shows that that page is being crawled initially - isn’t that a waste of “crawl” ? Especially since the amount of pages we crawl per month is limited.



you are right the start page is crawled when you call context.skipOutput. Because this method just skips output from this URL from results. But it is crawled.

We don’t have a better solution in the legacy crawler. But if you want to you can use Apify actor where you can code a better solution for handling URLs from the google spreadsheet.

1 Like