Cannot import apify


#1

I have my Actor hosted on github and added the webhook to build the actor on a push to the repo. Building goes well but when I try to run the actor get en error when trying to import log from utils:

2019-02-09T22:54:22.626Z /usr/src/app/main.js:4
2019-02-09T22:54:22.628Z const { log } = Apify.utils;
2019-02-09T22:54:22.630Z                 ^
2019-02-09T22:54:22.632Z 
2019-02-09T22:54:22.634Z ReferenceError: Apify is not defined
2019-02-09T22:54:22.636Z     at Object.<anonymous> (/usr/src/app/main.js:4:17)
2019-02-09T22:54:22.639Z     at Module._compile (module.js:653:30)
2019-02-09T22:54:22.641Z     at Object.Module._extensions..js (module.js:664:10)
2019-02-09T22:54:22.643Z     at Module.load (module.js:566:32)
2019-02-09T22:54:22.645Z     at tryModuleLoad (module.js:506:12)
2019-02-09T22:54:22.647Z     at Function.Module._load (module.js:498:3)
2019-02-09T22:54:22.649Z     at Function.Module.runMain (module.js:694:10)
2019-02-09T22:54:22.651Z     at startup (bootstrap_node.js:204:16)
2019-02-09T22:54:22.653Z     at bootstrap_node.js:625:3

This is my source code:

const Apify = require('apify');
const { log } = Apify.utils;
const { utils: { enqueueLinks } } = Apify;
log.setLevel(log.LEVELS.DEBUG);

Apify.main(async () =&gt; {
....

By the way: Is it possible to specify different branches for different build-labels? I would like to have a development and production brand?!

Thanks and Regards!


#2

Hi there,

I would be great if you can attach GitHub repo for your issue. It will be easier to debug.

I think you need to define Apify the first.

const Apify = require('apify');
const { log } = Apify.utils;

About splitting build from Github to dev/prod tag:
We have better Github integration in our backlog and you can not do it right now. But I created short KB article how you can do it with some CI/CD tool like bitbucket pipelines. Settings on travis-ci should be almost the same.


#3

Strangely enough it now works when I run the crawler locally.

This is the whole code of the crawler (the github repo is private):

// just a hello world demo for trying build by push on github

const Apify = require('apify');
const { log } = Apify.utils;
const { utils: { enqueueLinks } } = Apify;
log.setLevel(log.LEVELS.DEBUG);

Apify.main(async () => {
    console.log("entered Apify.main");
    const keyValueStores = Apify.client.keyValueStores;

    // Get store with name 'ooenachrichten'.
    //const store = await keyValueStores.getOrCreateStore({
    //    storeName: 'ooenachrichten',
    //});

    const requestQueue = await Apify.openRequestQueue('ooenachrichten');
    await requestQueue.addRequest({ url: 'https://www.nachrichten.at/' });

    // Open a named dataset
    const dataset = await Apify.openDataset('ooenachrichten');

    const handlePageFunction = async ({ request, html, $ }) => {
        console.log("entered Apify.CheerioCrawler");
        console.log('crawling url: ' + request.url);
        
        var crawling_url = request.url;
        var headLine = $("h2").filter(".artikeldetailhead_title").text();
        var teaser = $("h3").filter(".leadtext").text();
        var author = $(".sidebar-autor").text();
        var publication_date = $(".sidebar-datum").text();
        var text = "";

        $(".artikelcontent").each( function() {
            console.log("iterate.....");
            text = $(this).find("p").filter(".ArtikelText").text();        
        });

        //ToDo: starte einen parallelen crawler, der nur die URL für die Kommentare abgrast.
        var comments = []
        $(".artikeldiskussion-text_alle").each( function() { 
            comments.push({
                comment: $(this).find("p").text()
            });
        });    

        dataset.pushData({            
            headline: headLine,        
            teaser: teaser,
            author: author,
            publication_date: publication_date,
            full_text: text,
            comments : comments,
            url : crawling_url
        });
        
        //store.pushData(result);
        console.log('headLine: ' + headLine);

        const options = {
            $,
            requestQueue,
            pseudoUrls: ['http[s?]://[[-\w.]+]nachrichten.at/nachrichten/[.*]'],
            baseUrl: "https://www.nachrichten.at"
        };
        await Apify.utils.enqueueLinks(options);
    }
    
    const crawler = new Apify.CheerioCrawler( {  
        maxRequestsPerCrawl: 500,      
        requestQueue,        
        handlePageFunction,

        handleFailedRequestFunction: async ({ request }) => {
            console.log(`Request ${request.url} failed twice.`);
        }
    })

    await crawler.run();

    console.log('Crawler finished.');
});

But now I get this error message when running the crawler locally which I cannot really interpret? Does this mean, the crawler doesn’t find any urls to follow? This is strange because when I run the same code within a Crawler on the apify platform.

INFO: System info {"apifyVersion":"0.11.8","apifyClientVersion":"0.5.5","osType":"Darwin","nodeVersion":"v11.5.0"}
DEBUG: Apify.events: Environment variable APIFY_ACTOR_EVENTS_WS_URL is not set, no events from Apify platform will be emitted.
entered Apify.main
INFO: AutoscaledPool: Setting max memory of this run to 4096 MB. Use the APIFY_MEMORY_MBYTES environment variable to override it.
DEBUG: AutoscaledPool: scaling up {"oldConcurrency":1,"newConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"maxOverloadedRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"maxOverloadedRatio":0.7,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"maxOverloadedRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"maxOverloadedRatio":0.2,"actualRatio":null}}}
INFO: AutoscaledPool state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"maxOverloadedRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"maxOverloadedRatio":0.7,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"maxOverloadedRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"maxOverloadedRatio":0.2,"actualRatio":null}}}
INFO: BasicCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
Crawler finished.

I have tried to delete the request queue and reinit the apify actor workspace locally. But this doesn’t change anything…


#4

Hi, I’ve tried running your code locally and it seems to be working fine.

Are you sure you deleted the local request queue correctly? It would be in

{PROJECT_FOLDER}/apify_storage/request_queues/ooenachrichten

I also suggest NOT using named request queues unless you specifically need a persistent queue. Using a named queue will cause all runs of the actor to use the same request queue, so once the queue completes, all the runs will finish immediately because there will be no requests to process.

Also, since you’re already using utils.log, i suggest using log.info log.debug etc. instead of console.log for better control over logging.


#5

Thanks! I have reiniated the apify project and run the actor locally again and it works. Maybe there was something cached?


#6

We do not cache anything so it seems unlikely. You can use apify run -p command of the Apify CLI to run the actor with a cleaned apify_storage. It will delete everything except INPUT.json and named storages. If you’re using named request queues, you have to make sure to delete those manually.