Let’s say I want to crawl a news website on schedule, every morning, say 8am, and get just the latest stories…
That should probably mean stories published within the last 24 hours, which may span both last night’s date and this morning’s date…
Let’s assume all story pages will have date/datetime stamps, even if the index pages linking to them do not.
How would I tackle this with a Crawler… ?
Is there any way, perhaps, to…
- get only stories that were published within the last 24 hours?
- or get only stories that I haven’t already crawled?
ie. Only queue a page for crawling if it is not already in the results of the crawler’s last crawl?
Kind of interested in both methods, actually - I could imagine needing to do this in the absence of datetime stamps.
Can anyone share anything already-built that may do something similar?
Thanks very much.