By default, apify crawls a page and all the content that isn’t added to the results is lost. To try again, I have to re-crawl the pages. Continually re-crawling pages seems to make the process unnecessarily slow and puts pressure on the proxies. Instead I would like to do the crawl in steps
- determine when the page was last crawled.
- if the page hasn’t been crawled recently (ex: 1x per week max), then re-crawl
- if re-crawled, store ALL the content in the results
- now parse all the stored content from this crawl and past crawls to extract the required data into the results table.
in other words, I want the crawl and the parse to be two totally separate processes. any thoughts if this is possible or how to do it?