Can I use my list of urls instead of crawling?


There are more ways how to use own list of urls:

  • set them as a StartURLs at crawler configuration page - only for small number as they are shown in GUI

  • start the crawler through API and alter list of StartUrls for current run - there’s a 9MB limit for a POST data, so it can handle about 50k urls

  • fetch list of urls from external source using REST API - here’s a tutorial on how to fetch urls form Google Spreadsheet (you can use more sheets and divide enqueuing into more page functions, so you an fetch milions of urls this way)

In all options above it’s good to leave Clickable elements empty (in advanced settings) to tell the crawler not to follow any links


Just to add two cents to this question, follow this link to see a more complete explanation using Google Spreadsheets as as source of URLs.


Option 2 is good. and I can see an example of that here.
Is there a c# example. ?


Recent update of the actor platform adds this functionality. UI input for “request list” now enables user to upload a file containing a list of URLs or link web hosted file with URL list:

Just try out one of our recently released crawlers implemented on the top of an actor platform: