I am quite new to apify so I am still confused on the differences between crawlers, actors and storage.
I think I have grabbed the main differences but I would like to clarify them:
defining a crawler is a comfortable way for scraping webdata:
- When writing a crawler I just have to implement the page_function to get the data from the website
- the crawler supports me with helper functions to configure (handing cookies to crawler, defining pseudo-urls, etc.)
- the crawler fills the request_queue automatically according to the given base-url and the pseudourls.
an actor is kind of a script/program that runs on the apify-platform.
- I can do anything with an actor that I could do implementing a node.js program
- I can implement any of the crawlers (base, pupeteer, etc.) but I have to do more of the necessary steps by hand. (e.g. there are no pseudo-urls - I would have to get all href-tags, match them against regEx and manually fill the request_queue.
Some points that do not look clear from my perspective:
- Is it possible to call a crawler which I have defined on apify-crawlers within an actor?
- how do I use the request-queue to prevent my crawler from revisiting pages on different runs?
Thanks for your help to improve my understanding!