Use sitemap actor to crawl... sitemaps

Hi!

With the actor, cheerio sitemap.xml crawler, its possible to extract all urls in a sitemap.xml to then crawl. In my use case it works fine, but I have approx. 600 different sitemaps (with urls) that I need to crawl.

I found out that the webpage had a sitemapindex.xml where all the sitemap.xml links reside.
Like this:

www.example.com/sitemapindex.xml
-> inside is alle the sitemap.xml urls like this
<sitemap>
<loc> https://www.example.com/page1/sitemap.xml</loc>
</sitemap>
<sitemap>
<loc> https://www.example.com/page2/sitemap.xml</loc>
</sitemap>
<sitemap>
<loc> https://www.example.com/another-page/sitemap.xml</loc>
</sitemap>
…etc

In the sitemap there are links to sitemaps-urls like “another-page” that I don’t need to crawl, but is there any way to change the sourcecode of the actor to act like this:

Have a great day

Simon

Does anybody have an idea how to accomplish this?
I believe it would have something to do with this part (this works for one xml. What I would like is to grab all xml that are under ‘https://singlesite.sitemap.xml’).

Alternatively, how could I change the sourcecode below to accept multiple .xml urls?

Thanks

const Apify = require('apify');
const cheerio = require('cheerio');
const requestPromised = require('request-promise-native');

Apify.main(async () => {
    // Download sitemap
    const xml = await requestPromised({
        url: 'https://singlesite.sitemap.xml',
        headers: {
            'User-Agent': 'curl/7.54.0'
        }
    });
    
    // Parse sitemap and create RequestList from it
    const $ = cheerio.load(xml);
    const sources = [];
    $('loc').each(function (val) {
        const url = $(this).text().trim();
        sources.push({
            url,
            headers: {
                // NOTE: Otherwise the target doesn't allow to download the page!
                'User-Agent': 'curl/7.54.0',
            }
        });
    });

    const requestList = new Apify.RequestList({
        sources,
    });
    await requestList.initialize();
    
    // Crawl each page from sitemap
    const crawler = new Apify.CheerioCrawler({
        requestList,
        handlePageFunction: async ({ $, request }) => {
            console.log(`Processing ${request.url}...`);
            
            await Apify.pushData({
                url: request.url
                //some scraping
            });
        },
    });

    await crawler.run();
    console.log('Done.');
});

Hi @simon,

I think the best way to get all URLs from sitemap is to use BasicCrawler.
You can extract all URLs and enqueue new found sitemaps to queue.
You can check my code:

const Apify = require('apify');
const cheerio = require('cheerio');
const requestPromised = require('request-promise-native');

Apify.main(async () => {
    const requestQueue = await Apify.openRequestQueue();

    await requestQueue.addRequest({ url: 'https://singlesite.sitemap.xml' });

    // Crawl each page from sitemap
    const crawler = new Apify.BasicCrawler({
        requestQueue,
        handleRequestFunction: async ({ request }) => {
            const xml = await requestPromised({
                url: request.url,
                headers: {
                    'User-Agent': 'curl/7.54.0',
                },
            });
            const $ = cheerio.load(xml, { xmlMode: true });
            const sitemapUrls = [];
            const siteUrls = [];

            // Pick all urls from sitemap
            $('url').each(function () {
                const url = $(this).find('loc').text().trim();
                siteUrls.push(url);
            });

            // Pick all sitemap urls from sitemap
            $('sitemap').each(function () {
                const url = $(this).find('loc').text().trim();
                sitemapUrls.push(url);
            });

            for (const sitemapUrl of sitemapUrls) {
                // Enqueue all sitemap to process them
                await requestQueue.addRequest({ url: sitemapUrl });
            }

            await Apify.pushData({
                url: request.url,
                siteUrls,
            });
        },
    });

    await crawler.run();
    console.log('Done.');
});

Hi drobnikj,
Thanks a lot!
The sourcecode looks like it would work, but the end result is a bit strange. The dataset looks like this:

siteUrls/0 siteUrls/1 siteUrls/2 siteUrls/3 …etc (up till 999) url
Line 1
Line 2
etc.
Up til 1344

Then all the urls are placed in that grid, around 90.000 items in total.
The crawler does not crawl the actual urls, only extract them. So in essence I do get all the urls (albeit too many, as I only need those with a certain structure), but it does not crawl the urls.

Btw. I noticed that you wrote that the best way is to use BasicCrawler. I’m not sure what you mean by that? You do use cheerio as I see it. The code that I uploaded was from an actor built by jancurn, so thought that would be a good starting point.

If the original code (the one from jancurn) can be edited to have multiple inputurls, that would make my day! The approx 700 urls that I need from the index are static, so I can manually input them once.

Any help appreciated!

The problem with CheerioCrawler is that it can parse just HTML. It can not work with XML, so we need to use BasicCrawler and parse XML and HTML different way.

I updated the code, it shows how you can split parsing of sitemap in XML and HTML pages in BasicCrawler.

const Apify = require('apify');
const cheerio = require('cheerio');
const requestPromised = require('request-promise-native');

Apify.main(async () => {
    const requestQueue = await Apify.openRequestQueue();

    await requestQueue.addRequest({ url: 'https://auts.cz/sitemap.xml', userData: { label: 'sitemap' } });

    // Crawl each page from sitemap
    const crawler = new Apify.BasicCrawler({
        requestQueue,
        handleRequestFunction: async ({ request }) => {
            const { label } = request.userData;
            console.log(`Processing page ${request.url}`);
            if (label === 'sitemap') {
                const xml = await requestPromised({
                    url: request.url,
                    headers: {
                        'User-Agent': 'curl/7.54.0',
                    },
                });
                const $ = cheerio.load(xml, { xmlMode: true });

                const sitemapUrls = [];
                const siteUrls = [];

                // Pick all urls from sitemap
                $('url').each(function () {
                    const url = $(this).find('loc').text().trim();
                    siteUrls.push(url);
                });

                // Pick all sitemap urls from sitemap
                $('sitemap').each(function () {
                    const url = $(this).find('loc').text().trim();
                    sitemapUrls.push(url);
                });

                for (const siteMapUrl of sitemapUrls) {
                    await requestQueue.addRequest({ url: siteMapUrl, userData: { label: 'sitemap' } });
                }

                for (const siteUrl of siteUrls) {
                    await requestQueue.addRequest({ url: siteUrl });
                }
            } else {
                // There you can parse data from not sitemaps
                const html = await requestPromised({
                    url: request.url,
                });
                const $ = cheerio.load(html);
                await Apify.pushData({
                    url: request.url,
                    title: $('title').text().trim(),
                });
            }
        },
    });

    await crawler.run();
    console.log('Done.');
});