Starting point help

Hello. I’m trying to get familiar with ‘actors’. All I want to create to try this tool out is to point it to a website which has some items listed in a table, with pagination. I’d like the crawler to:

  1. On this “main page”, click on each item in that table of 10 items, and from the destination page of each of those 10 items, get structured data. Where/how should I point to specific HTML elements in that page structure and tell my APify crawler to get them in a structured way?

  2. Once this is done for the 10 or so items in the table, I need to tell the crawler to click on a specific ‘button’ in the “main page”, to proceed to page 2, which is the next 10 items. With a delay, maybe. Where/how can I do this?

This would be the most common application of apify, but the docs seem to be speaking to some engineer type person with just some early detail. Or am I looking in the wrong place? Appreciate some pointers to this typical use case and how to codify/apify it.

Thanks! It looks like a great tool.

Hi @Phoenix_Kiula,

It looks you picked the developer docs for actors. It is good but it is too much complicated for your use case.
If you need to just recursive scraping I recommend to start from this tutorial.

1 Like

Thank you. Will take a closer look when I have some time this evening. First glance seems to suggest it’s not very detailed. I think it’ll help if these docs were written by people who have a teaching approach: take a sample destination page, then walk people with an example of what precisely to set up where, including targeting a specific portion of code on that destination site. That would be very helpful. Instead, these are presumptuous “help” pages.

Thank you for this. But as expected, that Help page doesn’t give much info.

How do I:

  1. Set up precisely which links to click on destination page. In the “Link Selector”? The contextual icon help for this shows using div.mydiff for something that in the destination code is like <div class="mydiff"..., which is fine. How should I enter something that’s a simple a href without any class, or has IDs but with a pattern, or limit this to links inside <td> table cells? Can I enter td a (with a space) in the “Link Selector”?

  2. Where can I specify what to do with the destination pages on the links once clicked?

  3. Where can I specify which precise pieces of html to extract from those destination pages once clicked (in Step #2 above)?

That tutorial has none of this info.

Hi @Phoenix_Kiula, have you read the second part of the tutorial as well? It builds up on the theoretical knowledge in the first part and shows step by step examples to achieve exactly what you want to do.

1 Like

Thanks, yes, I did see this. Early in this page it says

We’ve already scraped number 1 and 2 in the Getting started with Apify scrapers tutorial

But no such thing happened in the first part of the tutorial. It went to the URL and got nothing. These were my settings…

This gets just one link in the “items” in the Dataset, which is the original URL itself. Which means it hasn’t done anything.

Where can I find the help on that before I can move on to the “step 2” of the tutorial? The Help really needs to be a lot clearer.

Much appreciate any pointers.

Your scraper does not visit the detail pages because your Pseudo URL is not correct. This part of the tutorial explains how to create the Pseudo URL correctly. You need the full URL to match, or a regular expression that precedes the fixed part.

http://wwwapps.tc.gc.ca/Saf-Sec-Sur/7/VRDB-BDRV/search-recherche/detail.aspx[.+]

or

[.+]detail.aspx[.+]

but beware that the second one will match any URL that includes detail.aspx, even possibly links outside of the page you are scraping.

Also, it doesn’t seem that you’ve gone through the first part of the tutorial, since your page function still includes the default code that’s added when you first create a Web Scraper task.

Please go through the tutorial step by step first, all the way to the end of the second part. I’m sure that by the time you get there, you’ll have a much better understanding of what’s going on.

1 Like

Thanks a lot. Yes, I had changed that while looking at the log. So the task is in fact finding the URLs, but I’m now trying to identify in the target detail pages the content. By looking at some examples such as the IMDB scraper (jakubbalada/B4Z4i-api-imdb-com), I notice their code has “Label” If conditions. But the latest api-webscraper does not seem to give me the options to create labels.

Anyway my input pageFunction code right now is:

async function pageFunction(context) {
    const { request, log, skipLinks } = context; 
    const { url } = request;
    var $ = context.jQuery;
    log.info(`Scraping link ... ${url}`);

        return {
            url,
            recalldate: $('#BodyContent_LB_RecallDate_d').text(),
        };
}

This is literally just one field from the destination pages, but it’s not being pickedup. Fortunately these target pages have a very clear ID from which I’d like to pick up data. Is this simple pageFunction incorrect? (The Help has no step by step walkthrough for these issues…we need to learn from examples, I suppose, like that IMDB one above, which I found through forums.)

Actually, the tutorial talks about the labels in both the Start URL section and the Pseudo URL section. They are also clearly visible in the code examples throughout both the first and second part of the tutorial.

1 Like

Thank you. That’s my miss. It’s basically working now, wow. I can retrieve specific records. Yay!

Question about regexp. I’d like to target URLs that match the pattern:

(.*)detail.aspx(?!.+(lang=fra))

In simple words, anything containing detail.aspx, but only if the part after that does not include lang=fra because we just need English info. This regexp doesn’t work though. It works if I remove the lang=fra and also replace the circular brackets with square brackets:

[.*]detail.aspx[.+]

This matches all URLs though. So I tried including the ignore-phrase stuff for french with square brackets too:

[.*]detail.aspx[?!.+[lang=fra]]

This doesn’t work. The Help refers to regexbuddy_dot_com, which is some windows software. Any pointers on how to get this right?

Addition: just to confirm my regex was working, I tried it on Regexr: https://regexr.com/4gdks

Thanks!

Yes, your regex is fine. To change it to a Pseudo URL, you only need to enclose the parts that are not simple string, but a regular expression, in brackets. There’s no need to change anything within the regular expression itself to use it in a Pseudo URL.

[(.*)]detail.aspx[(?!.+(lang=fra))]

Now, this will not work and that’s because this PseudoURL does not allow trailing characters. It will actually be parsed into a RegExp like this:

/^(.*)detail\.aspx(?!.+(lang=fra))$/

I suppose the docs are not very clear on this so we should update them. Nevertheless, to make this work, you need to allow trailing characters:

[(.*)]detail.aspx[(?!.+(lang=fra)).*]

Which will produce this regex:

/^(.*)detail\.aspx(?!.+(lang=fra)).*$/

I hope it’s more clear now.

1 Like

Thank you. This is very helpful, much appreciated.

One step at a time. Now my first crawler is working as I’d like it to. Still need to figure out how to click a ‘Next’ button (there’s something in the docs, in somewhat engineering lingo) and to loop through various elements (which I can find through the IMDB example, which I think should be linked from Help docs as well).

Many thanks! It’s an impressive tool.

Thank you as well. Just as a note, if you’re referring to this IMDB scraper, you should not be using it as a definitive guideline, because it’s an old technology that will be phased out in a few months.

Feel free to use it to get ideas for how to do things, but the syntax in Web Scraper is a bit different and it works a bit differently as well. So what works in that IMDB crawler, might not work for you.

You can click the next button using jQuery like this: $('some-button-selector').click() just beware that it will probably not work as expected. The website is using something called Form navigation. Instead of just displaying the next set of results it reloads the page with different parameters. The trouble is, this page reload will cause your page function to end with an error. We will be releasing an update to Web Scraper next week that addresses this particular issue.

Or there is the option to use Puppeteer Scraper. It’s a bit more difficult to work with, but it’s much more powerful than the Web Scraper.

1 Like

Well, I’ll use the IMDB example more for looping through elements on the destination page. Let me check it out.

Might also take a look at pupeteer option if that’s more future proof. Look forward to seeing your new update. Thank you so much!