We have a large database of domains that we want to crawl. We want to crawl all the pages under each domain. The catch is that some domains redirect to a different domain, so we want to follow the initial redirection and crawl all pages under the new domain, but we don’t know at this time which domains redirect and where they redirect.
For example, let’s say that one of those domains is “http://x.com” and we use it as the start URL. Visiting “http://x.com” redirects to “http://www.x.com” or to “http://y.com”. In this case, I’d like to continue crawling “http://www.x.com” or “http://y.com”, but not “http://z.com”.
I can’t use a pseudo URL such as “http://y.com/[*]” because I don’t know the redirection domain before I start the crawler. I don’t see how to use interceptRequest() for this purpose either.
How can I solve this problem?