Consume dataset as start list not working


I’m trying to use a dataset generated by one task as the input for another task and I can’t get it to work.

For context, I’m using the apify/web-scraper Actor to scrape 5 category pages and retrieve the URLs of product pages. I can’t use apify/cheerio-scraper for category pages, because it’s a React-based site that paginates with AJAX. However, I can scrape the product pages with Cheerio, so as that’s much more efficient I’d like to use a separate task to do that with the output of the first task as input to the second.

Since the API has a CSV format for retrieving the dataset from the first crawl this looks easy. As my dataset only has one field (url) when I download the CSV via a browser I get a single column text file. Looks perfect. So I’ve set the start list URL to point to the Apify API. However, when I run the task I get this log output (I’ve obfuscated the task ID):

2019-04-25T09:31:19.403Z WARNING: RequestList: list fetched, but it is empty. {"requestsFromUrl":"***/runs/last/dataset/items?token=*********&status=SUCCEEDED&clean=1&format=csv&skipHeaderRow=true"}
2019-04-25T09:31:19.474Z INFO: Configuration completed. Starting the scrape.

I’ve tried tweaking the skipHeaderRow setting and get the same result every time. Any idea why a file with over 1000 URLs reports as empty?


Scratch this, after trying for hours and then posting here I immediately figured it out. Solution in case anyone sees this: the URLs in the start list need to be fully qualified, I was saving relative URLs!


You are right. You need to have there full url. You can check regex, which we use to parse it from file.