With the actor, cheerio sitemap.xml crawler, its possible to extract all urls in a sitemap.xml to then crawl. In my use case it works fine, but I have approx. 600 different sitemaps (with urls) that I need to crawl.
I found out that the webpage had a sitemapindex.xml where all the sitemap.xml links reside.
-> inside is alle the sitemap.xml urls like this
In the sitemap there are links to sitemaps-urls like “another-page” that I don’t need to crawl, but is there any way to change the sourcecode of the actor to act like this:
- Go to sitemapindex.xml
- Get all sitemap.xml - urls that looks like https://www.example.com/page[.*]
- Pass all the urls to requestList to then crawl
Have a great day