How to scrape pre-defined list of pages?


#1

So far, I have been traversing start/list pages first, enqueuing a list of detail (eg. news story) pages to extract content from.

What if I already know the list of story pages I want to extract from, and there is no listing page for them?

How can I pre-populate something to tell an Apify crawler to only scrape them?

Is there any example of this?

Thanks.


#2

Hi @robertandrews,

there are some options how you can use your list of URLs for crawling.

  1. You can use API and pass all URLs to startUrl attribute when you start your crawler. This is limited with a number of URLs, you can pass there thousands of URLs.

  2. You can use an external source to store the URLs, for example, google sheet. There is an article about how you can use google sheets for it.

  3. Use Actor for that, there you can use Apify SDK with request list class. This class handles use cases like that.

Best,
Jakub D.


#3

Thanks. Looks like I would try Google Sheets to Apify.

So there’s no option to just paste in multiple end page URLs in to “Start URLs”? Or to add multiple “Start URLs”? It doesn’t work that simply?

Thanks.


#4

If you don’t have so much urls you can pass them into star urls with proper label. It should also work :slight_smile:.


#5

The Google Sheets API method was working successfully all today, but currently hangs after example.com, before the URLs are collected from the sheet.

Log: “Page function will asynchronously finish later (if the crawler hangs here, make sure context.finish() is really called in pageFunction!).”

Only 1 in queue, example.com.


#6

Nevermind, it’s just because a typo crept in to my Sheets handling code.
Your Sheets guide page is useful!


#7

Recent update of the actor platform adds this functionality. UI input for “request list” now enables user to upload a file containing a list of URLs or link web hosted file with URL list:


Just try out one of our recently released crawlers implemented on the top of an actor platform: