How to crawl only new items?


#1

Let’s say I want to crawl a news website on schedule, every morning, say 8am, and get just the latest stories…

That should probably mean stories published within the last 24 hours, which may span both last night’s date and this morning’s date…

Let’s assume all story pages will have date/datetime stamps, even if the index pages linking to them do not.

How would I tackle this with a Crawler… ?

Is there any way, perhaps, to…

  1. get only stories that were published within the last 24 hours?

ie. Build some Javascript logic to make saving a story conditional on its datestamp?

  1. or get only stories that I haven’t already crawled?

ie. Only queue a page for crawling if it is not already in the results of the crawler’s last crawl?

Kind of interested in both methods, actually - I could imagine needing to do this in the absence of datetime stamps.

Can anyone share anything already-built that may do something similar?

Thanks very much.


#2

Hi Robert,

with our crawler product -

(1) If dates are located at index pages then you can parse them and enqueue and scrape only the stories published in the last 24h. If not then you have to enqueue and visit all the pages.
(2) This is not possible with Crawler.

I would recommend you to use our Actor product https://www.apify.com/docs/actor . In actor you can:

a) Either use key-value store https://www.apify.com/docs/storage#key-value-store to persist your position between the runs.
b) Use RequestQueue https://www.apify.com/docs/storage#queue that has 30 days data retention (paid accounts). Using this you would have to crawl the entire site only once a month.

Let me know if you have any questions.
Marek


#3

I fear that Actor may be beyond my skillset.
I may have to just muddle through.
eg. Waste some crawls on excess results, or, in some cases…

“(1) If dates are located at index pages then you can parse them and enqueue and scrape only the stories published in the last 24h.”

How would I do this, please?

Another idea - can I set a max number of story pages to crawl? ie. Even if the index page links to 50 stories, say “Hey, Crawler, only crawl the first 20!” Because I can infer the publisher would not publish more than 20 per day.

Thanks.


#4

Lets say you start your crawler once a day. You open index page listing the stories. Then you can check date of every story to be from yesterday and only in this case enqueue it’s detail page using context.enqueuePage(). Don’t use PseudoURLs in this case but context.enqueuePage(). instead.


The same way you can limit the number of story pages to be crawled. Simply enqueue only the number of story pages you want.


#5

You can also use Max result records option in advanced settings of your crawler to limit number of result items.