Multi-phase crawl


#1

By default, apify crawls a page and all the content that isn’t added to the results is lost. To try again, I have to re-crawl the pages. Continually re-crawling pages seems to make the process unnecessarily slow and puts pressure on the proxies. Instead I would like to do the crawl in steps

  1. determine when the page was last crawled.
  2. if the page hasn’t been crawled recently (ex: 1x per week max), then re-crawl
  3. if re-crawled, store ALL the content in the results
  4. now parse all the stored content from this crawl and past crawls to extract the required data into the results table.

in other words, I want the crawl and the parse to be two totally separate processes. any thoughts if this is possible or how to do it?


#2

Hi @jstanley,

yes it is possible you can use Apify actor for separating crawling process and parsing process. You can easily call other actor or crawler from actors using Apify SDK.