Skip unwanted subpages


#1

Good morning,

In this website 123pages, I would like to get content from every subpages that are linked to a specific result page (for example www.123pages.fr/c/alfortville/batiment-1/).

I’m going thru few issues with my actual version :

  • The first page is skipped
  • “address” (see below) grab me the full address while I only want the zip code
  • It skips subpages from the list
  • It grabs unwanted pages, like …/b/write-review/… while I only want detail pages
  • The webpage indicate " 326 " results when my csv results only got " 146 "

Thanks in advance for any help!

John

function pageFunction(context) {
    // called on every page the crawler visits, use it to extract data from it
    var $ = context.jQuery;

if (context.request.label === "details") {
    var result = {
    adresse: $(".pagesjaunes .address").text(),
    web: $(".pagesjaunes .weblink").text(),
    email: $(".pagesjaunes .email").text()
};

return result;
} else {
context.skipOutput();
}
}

#2

Dear,

Any help would be grandly appreciated!

Thanks :slight_smile:


#3

Hello @john,

Sorry for the late reply. Let me go through this quickly:

  1. The first page is skipped because you’re only defining the result variable inside the if clause. Therefore, for any other page than details labeled page, no results will be output.
  2. You’re getting the whole address because the .address element includes the whole address. You need to specifically target the <span> that holds the ZIP code, if you need just that code.
  3. Could you be more specific? Are there any errors in the log?
  4. That’s because you also enqueue links within the detail page. Call context.skipLinks() to prevent that. See the docs.
  5. Are you using the simplified version of results or the full list? The full list will have errors that might help you debug the missing items.

Hope this helps!


#4

@mnmkng, thank you very much, now everything works like a charm!

I only have one issue : the first page is skipped in my result list.

I didn’t found a way to solve it yet. Would you mind being more explicit ?


#5

Hello @john,

I’m sorry I misunderstood your question about the first page. Now I understand that you’re not getting results from the first listing page.

The trouble is in the links to details on the first page. For reasons unknown, the detail URLs on the first listing page are different from the ones on the subsequent pages.

page 1: http://www.123pages.fr/fr/b/allo-electricien-boulogne-boulogne-billancourt
page 2: http://www.123pages.fr/b/afdf-boulogne-billancourt

Notice the extra /fr in the page 1 link. It’s not caught by your details’ Pseudo URLs and therefore the links are not enqueued.

Changing the Pseudo URL to http://www.123pages.fr/[(fr/)?b/.+] will help!


#6

@mnmkng everything solved, thank you for your help!