Making RequestList NOT persist state

The URL’s i feed to requestlist will not change, each time i run the scraper, i want them to be re-scraped, how do i achieve that?

I have tried giving requestlist a new unique name on each run using a time stamp, but it still somehow recalls that it’s visited those pages again?

Basically, imagine scraping hackernews for new story URL’s. To my request list I would provide as sources; [https://news.ycombinator.com/, https://news.ycombinator.com/news?p=2, …news?p=3, etc] . Those index pages are always the same.

From those sources, I grab all the URL’s to stories and feed those to the requestqueue. The requestqueue I am persisting, because I don’t want to grab the same story twice. That part works.

How do i achieve this?

Hi @mrwww,

are you running the scraper locally or on the Apify platform?

If locally, make sure that your apify_storage folder is clear of any previous runs’ metadata. You can easily achieve that by running the scraper with the Apify CLI using the following command:

apify run -p

-p stands for purge and will clear your apify_storage folder, except for INPUT.json.

When running on the platform, it should not be an issue, because you’re provided with a clean environment every time.

Let me know if this helps and if not, please provide more info about your environment.

thanks for your quick reply :slight_smile:

I am developing/running locally but it’s intended for the platform. Using -p does not help, it still says all request from the RL/RQ have been processed.

(can confirm that -p does clear the key_value_stores, but state is somewhere somehow still persisted it seems?)

That’s weird. Can you share your code or is it private? Would you be able to send me a link to a Gist or some other means of code sharing to ondra@apify.com ?

Think I’ve found the issue. The URL’s in my RL are being added to my RQ, even though I am making no call to do so?

Do they somehow get merged in the new Apify.PuppeteerCrawler constructor?

Yes, this is how it works when you’re using both the RequestList and the RequestQueue at the same time. Just before processing the request from the list, it will enqueue it to the queue. You can read about it in the docs https://sdk.apify.com/docs/api/puppeteercrawler (4th paragraph).

Yet still, the -p flag clears the request_queues too, so adding the links in the queue should not prevent them from being crawled, unless they were already crawled once.

Right!

So I should use two separate crawlers for this use case?

-p does not purge my RequestQueue folder however. (and i do want to persist it).

Apify.main(async () => {

const dataset = await Apify.openDataset(“blocket-sled-ads”, { forceCloud: true })
const requestList = await new Apify.RequestList({sources}) //sources === with {method, url}
await requestList.initialize()
console.log(“Requestlist is empty: ${await requestList.isEmpty()}”)

const requestQueue = await Apify.openRequestQueue(‘blocket-ad-requests’)

const crawler = new Apify.PuppeteerCrawler({

requestList,
requestQueue,
maxRequestsPerCrawl,
launchPuppeteerOptions,

handlePageFunction: async ({ request, page, response }) => {

  if (page i want to scrape) {
  // eval/code to grab the data
    await dataset.pushData(data)
  }

  if (page with links to pages i want to scrape) {
  
    let links = await page.$eval .... grab all the links from the page
    for (let link of links) requestQueue.addRequest(link)
  }

},

})

await crawler.run()

Ok, I see now. So, whenever you create a named storage, it will survive -p, that’s why your request queues are not clearing. The same applies to the platform. The trouble comes from the fact that a queue only keeps track of requests in a pending and handled state. Once all the requests are handled, the queue will report that it has no more requests to process and since the crawler is wired to finish when there are no more requests to process, the crawl will end.

We have a feature in our backlog called Persistent queue, which would enable building and recrawling existing queues, but its not yet ready for deployment.

Now, that does not mean that there’s no way to do it now. It just needs some workarounds. Locally, you can just rename your handled folder to pending before starting the crawl and the queue will have no idea that it already crawled those pages. Or you could dump your handled queue into a file and then later use that file as a RequestList source.

On the platform, I’d suggest outputting all crawled URLs into the dataset and then using those as a RequestList source for the next crawl. You can prefix the URL with a pound # such as

await Apify.pushData({
    #url: 'https://example.com',
    myData: 'foo',
    myOtherData: 'bar'
})

The items prefixed with # will be hidden from the dataset, unless you ask for seeing them. So the URL will not pollute your data.

Would one of these workarounds work for you? Personally, I’d go for the dataset option, because it’s the same both locally and on platform and you can load the URLs from the dataset into a request list with a one line of code. Instead of { url: 'https://example.com' } you would use { requestsFromUrl: 'DATASET_API_ENDPOINT' } and the URLs will be automatically parsed from the dataset.

See the https://apify.com/docs/api/v2#/reference/datasets/item-collection/get-items especially the fields option, that would allow you to retrieve only the #url from the dataset.

HUGE thanks for your awesome support !

All confusion cleared up!

I just separated the crawlers and am now getting desired behavior, so first just running a crawler for index pages and adding links to a named RQ, then the second crawler working on that RQ.

You’re welcome and I’m glad you got it working!