Differences Actors and Crawlers


#1

I am quite new to apify so I am still confused on the differences between crawlers, actors and storage.

I think I have grabbed the main differences but I would like to clarify them:

  • defining a crawler is a comfortable way for scraping webdata:

    • When writing a crawler I just have to implement the page_function to get the data from the website
    • the crawler supports me with helper functions to configure (handing cookies to crawler, defining pseudo-urls, etc.)
    • the crawler fills the request_queue automatically according to the given base-url and the pseudourls.
  • an actor is kind of a script/program that runs on the apify-platform.

    • I can do anything with an actor that I could do implementing a node.js program
    • I can implement any of the crawlers (base, pupeteer, etc.) but I have to do more of the necessary steps by hand. (e.g. there are no pseudo-urls - I would have to get all href-tags, match them against regEx and manually fill the request_queue.

Some points that do not look clear from my perspective:

  • Is it possible to call a crawler which I have defined on apify-crawlers within an actor?
  • how do I use the request-queue to prevent my crawler from revisiting pages on different runs?

Thanks for your help to improve my understanding!


#2

Hello and thanks for your question,

I think you get the idea pretty well. One thing that is not obvious to our users is that crawler was our first platform that is more hard-coded to do just the crawling jobs it does. Actor is our newer platform that can generally run arbitrary code (doesn’t need to be Node.js, you can run any languages there.). Because it Actor is much more flexible, we are moving in its direction and adapting the features the crawler does any more. The best example is our open-source Node.js crawling and automation library Apify SDK.

In the upcoming months, we will completely merge the current crawlers inside the Actor platform and you will be able to choose different types of crawlers and have much more power there. We hope to do this so smoothly that current crawler users won’t notice much.

Now to your questions:

  1. You can call crawler and handle its results within an actor through our API or API JS client. Here is an article that touches this.

  2. You can either create a named request queue that you will reuse over multiple runs but the current limitation is that once a request is processed without an error, you can invoke it again. Another option is to have a named dataset where you store URLs or whole requests and you load this dataset at the beginning of each run and filter your requests.

Let me know if this was clear and feel free to ask any other questions.


#3

In brief …
i believe:=>

crawler : gets html page in huge number

Actor: interface with apify => with “key” for proxy (because is … potentially illegal to do crawl … )