Unable to crawl some URLs?


#1

I’ve been able to setup and crawl a handful of URLs but some of them just wont work. For example, I’ve been trying to crawl macys.com and no matter what I do I always get this in log and it fails. Any ideas? Are there some sites that you just can’t crawl with Apify?

[2018-02-26 19:37:01.744: S0000002] ON RESOURCE TIMEOUT | response: {"errorCode":408,"errorString":"Network timeout on resource.","headers":[{"name":"Accept","value":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"},{"name":"User-Agent","value":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"}],"id":1,"method":"GET","time":"2018-02-26T19:36:31.627Z","url":"https://www.macys.com/"}
[2018-02-26 19:37:01.744: S0000002] ON RESOURCE ERROR | resourceError: {"errorCode":5,"errorString":"Operation canceled","id":1,"status":null,"statusText":null,"url":"https://www.macys.com/"}
[2018-02-26 19:37:01.745: S0000002] ON LOAD FINISHED | status: fail, url: N/A
[2018-02-26 19:37:01.745: S0000002] ERROR: An exception occurred while processing the web page: 
                                    CrawlerError: The page couldn't be opened (status: fail, url: https://www.macys.com, lastResourceError: {"errorCode":5,"errorString":"Operation canceled","id":1,"status":null,"statusText":null,"url":"https://www.macys.com/"}, lastResourceTimeoutResponse: {"errorCode":408,"errorString":"Network timeout on resource.","headers":[{"name":"Accept","value":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"},{"name":"User-Agent","value":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"}],"id":1,"method":"GET","time":"2018-02-26T19:36:31.627Z","url":"https://www.macys.com/"})
                                    onLoadFinished@phantomjs://platform/crawlercore.js:149:268
                                    phantomjs://platform/crawlercore.js:36:28
                                    [native code]

Possibly I’m hitting some sort of account limits?


#2

Hello @yz4now,
thanks for your post.
The issue with macys.com is that they use a protection against scraping.
We would recommend you to use Actor for this use case (here is an example with crawl page with actor - link) and delete cookies after each visited site, that does the trick.

In case you would need a hand with anything, just let us know.
If you would be interested, we could prepare a crawler configuration for macys, just fill this form and we will let you know about price estimation.

Best,
Vaclav


#3

@yz4now, How do you get this log format?

[2018-02-26 19:37:01.744: S0000002] ON RESOURCE TIMEOUT
.....
[2018-02-26 19:37:01.745: S0000002] ERROR:

#4

This is a standard log format in our Crawler product, which we are replacing by the Actor.