Keyword based Crawler for amazon.com


#1

Trying to use this community crawler, which can use a keyword for the custom data to then crawl amazon.com and return the results.

At the moment I am only getting the product URL and not title, price, along with details from the product url such as sku, asin, upc, mpn, product image url, weight, height, category and brand.

Any suggestions for what to change on the crawler settings would be appreciated.


#2

Hi @cflema ,
I have just uploaded new version of the amazon crawler and extended it with these attributes + fixed missing array.
Now the crawler outputs:
name,
price,
description,
image url,
brand,
category and
asin (sku).

Feel free to get the new version! :slight_smile:

Bests,
Vasek


#3

Hi @Vasek

I will try out the updated Amazon crawler.

What pagination options would you recommend in the crawler, so that as many products by keyword are captured.

The end goal is to capture 200K+ products from Amazon.com.


#4

Just ran the updated crawler using the keyword HEADPHONES in the custom data field, this was all that was returned.

https://api.apifier.com/v1/execs/B4APv3tjBM4dH8Qt2/results?format=xml&simplified=1

Again it was just the URL that was returned in the run results…

Not sure what I am missing to crawl the following data correctly

name
price
description
image url
brand
category
asin
sku
weight


#5

Hi @cflema,
sorry, there was missing bracket in the text() function, works now correctly.
This amount of products will require more proxy IPs, pagination is no problem I will add it to the example crawler.
We can discuss this in email and create custom crawler for your needs, if you need more specific attributes etc.

Here is a result for headphones: https://api.apifier.com/v1/execs/iTGkdpwJ2iP8zMLsN/results?format=xml&simplified=1


#6

Hi @rut.vaclav
Seems to work fine now.
https://api.apifier.com/v1/execs/HeQX9X3CQw9P3shyq/results?format=xml&simplified=1



What settings do I need to use to be able to crawl more than is currently, since only 25 pages where crawled and created 23 results?


#7

Hi @cflema ,
Usually you will always crawl more pages, than you output.
For example this crawler goes to amazon.com page, than reads the data from custom data and goes to the search page, like https://www.amazon.com/s/field-keywords=headphones and here gets the URLs for items and output them.


#8

Hi all,
I have just uploaded new version of Amazon crawler, which works with pagination, so you can get more results and not just first page. But remember if you need to get more results, you will need more proxies.

Bests,
Vasek


#9

Hi @rut.vaclav

Was planning on using either
Shader US 500 (500 US IPs from Shader)
BuyProxies US 100 (100 US IPs from BuyProxies, refreshed monthly)

from the proxy groups.


#10

Hi @rut.vaclav

Also need to capture

Product Dimensions
Item Weight
Shipping Weight
Item model number
from the product information as well in the crawl

Multiple product images URLs (gallery) if a product has more than one product image on the product page.


#11

So the last crawler run results

Pages crawled ​314
Pages outputted ​300

What Max crawling depth number should be set?
Max pages per IP address what is the ideal number for the business plan?

I was trying to create a test of at least 10,000 products.

I set the Parallel crawling processes to 10 in the crawler, which the M plan supports.


#12

Hi @cflema,
the results you have received are ok I would say, I have checked it and it matches the results on amazon.
If you will test 10thousands of products, how fast do you need this? If the crawler will run for a longer time, there are lower requirements for the proxies, but if you need to crawler this amount regularly and fast, we would have to integrate to the crawler function with anticaptcha, which is displayed on amazon, when you reach their limits, unfortunately, we don’t know them.

Can you send me a sample link where you have those attributes you have described? Like size, weight etc. I can make a sample crawler for you with these.

Bests,
Vasek


#13

Hi @rut.vaclav
The results are okay just missing the attributes from the product information section at the bottom of the product listing. I do not have a problem with the crawler running for a couple of hours, but since the plan can run up to 10 Parallel requests, it would make sense to be able to crawl as many products as possible in the shortest amount of time.

Example

Product Dimensions 5.5 x 2.5 x 6.8 inches
Item Weight 3.7 ounces
Shipping Weight 4 ounces
Manufacturer AmazonBasics
ASIN B00NBEWB4U
Origin China
Item model number HP01_v2

I need 300k headphone products crawled from Amazon.com as soon as possible.

Best,
Chris


#14

I did a quick check from the main headphones landing page on Amazon.com

Earbud 22,616

Over-Ear 8,012

On-Ear 10,346

Bluetooth 6,619

Sports & Fitness 11,520

Noise-Canceling 2,308

Pro Audio 67

Total 61,488

165,407 results found on keyword search


#15

Hi @cflema,
I have replied the email you have sent, so have a lookt at it.
Bests,
Vasek


#16

Hi @rut.vaclav,

I tried running the amazon crawler but i’m getting below error:

Error: Crawl cannot be started. (The act owner cannot access proxy group ID SHADERPRIVATE [proxy-group-now-allowed])

Can you pls help me with it.

Thanks,
Dhruv


#17

Hello @dhruvvinayak,
I have updated the crawler settings, sorry about this.
I have also updated the example crawler, so nobody else shouldn’t face the same problem.
In case you would need a hand with something else, feel free to let me know.

Best,
Vaclav


#18

Hi All,

@rut.vaclav Great work!! and Thank you!!

I’m trying to modify the crawler to only search Amazon Fresh.

The crawler works great but returns results across the whole of Amazon.com. You will be surprised how many results banana returns that are not in Amazon fresh. :slight_smile:

I’ve added cookies so I am in an Amazon Fresh delivery zone but I need to restrict the results to Amazon Fresh.

Is their also a way to stop the results being truncated?

Any help would be greatly appreciated


#19

Hello @Jonathan_Gillmor,
from what I have checked, I would suggest change the starting of the actor to something like this:

if (context.request.label === "START") {
        context.customData.split(",").forEach(function(keyword) {
            if (typeof keyword === 'string') {
                var searchString = keyword.trim().replace(/(\s{1,})/g, "+");
                context.enqueuePage({
                    url: "https://www.amazon.com/s/?url=search-alias%3Damazonfresh&field-keywords=" + searchString,
                    label: "SEARCH",
                    interceptRequestData: {
                        keyword: keyword.trim()
                    }
                });
            }
        });
        context.skipOutput();
        context.skipLinks();
        
    }

It searches this way only on the Fresh zone on the Amazon.
Test it out and let me know if this helps.

Best,
Vaclav


#20

thanks i’ll check it out and let you know… thanks again