How to crawl a domain if it redirects?

We have a large database of domains that we want to crawl. We want to crawl all the pages under each domain. The catch is that some domains redirect to a different domain, so we want to follow the initial redirection and crawl all pages under the new domain, but we don’t know at this time which domains redirect and where they redirect.

For example, let’s say that one of those domains is “http://x.com” and we use it as the start URL. Visiting “http://x.com” redirects to “http://www.x.com” or to “http://y.com”. In this case, I’d like to continue crawling “http://www.x.com” or “http://y.com”, but not “http://z.com”.

I can’t use a pseudo URL such as “http://y.com/[*]” because I don’t know the redirection domain before I start the crawler. I don’t see how to use interceptRequest() for this purpose either.

How can I solve this problem?

Example:
http://www.9mmpr.com (redirects to http://9mmpr.com)

If the redirect is performed by HTTP response 301 then the crawler will follow it and there’s no way to intercept the request (you will only see that url and loadedUrl are different for your page).

If the redirect is performed in JavaScript (e.g. by setting window.location), then the behaviour depends on whether the JavaScript code is executed before the page is fully loaded (option 1) or after (option 2):

  1. If it’s before, the new page will open in the crawler (similarly as with HTTP redirect). Note that to prevent this behavior, you can use the Don’t load frames and IFRAMEs setting (see docs for more details), which will cause the page to behave as option 2.

  2. If the redirect comes after the page was loaded, then the page will not be redirected and the page navigation requested will be intercepted. You can use the interceptRequest() function to ensure the second page will be loaded by the crawler (by setting newRequest.willLoad = true ). If you set Clickable elements setting to empty string then you can be sure that all navigation requests originate from JavaScript redirects, and not from page clicks.

For example, the web page http://www.9mmpr.com redirects to http://9mmpr.com using HTTP 301 response.

I see how you used interceptRequest() to follow the redirect, but how do I have the crawler continue to follow links within the site, such as “http://9mmpr.com/welcome-to-9mm-pr/” which is a link on the main page “http://9mmpr.com”, but not leave the site via other links, such as “http://wordpress.org” which is also linked from the main page?

If you only want to load pages in to the same domain, you can use the following intercept request function:

function interceptRequest(context, newRequest) {
    
    // get hostname from the current and new page,
    // using a trick from https://gist.github.com/jlong/2428561
    var parser = document.createElement('a');
    parser.href = newRequest.url;
    var newHostname = parser.hostname.toLowerCase();
    parser.href = context.request.loadedUrl;
    var curHostname = parser.hostname.toLowerCase();
    
    // load the new request only if it goes to the same domain as current page
    if( newHostname === curHostname )
        newRequest.willLoad = true;
        
    return newRequest;
}

Just a note that this will work even if Pseudo URLs are empty. Basically you fully control what to load using the intercept request function.

3 Likes

I see, thanks!

I think that will work with a simple modification for start pages where context.request will be “about:blank”, correct?

It will work even for start URLs, because they already have newRequest.willLoad set to true.

what change i have to make in this code if i want to crawl the redirected page too?

If you want to let the crawler visit some other pages, add them to the Crawl pseudo-URLs field.

1 Like

301, “Moved Permanently”

The HTTP response status code 301 Moved Permanently is used for permanent URL redirection, meaning current links or records using the URL that the response is received for should be updated. The 301 response from the Web server should always include an alternative URL to which redirection should occur. If it does, a Web browser will immediately retry the alternative URL. This is the best way to ensure that users and search engines are directed to the correct page. The 301 status code means that a page has permanently moved to a new location.