Save image inside pageFunction


#1

If you want to save whole images during crawling (not just URLs), you can do that directly in pageFunction without a need for any external library or API just by using base64 encoding.

First, we will create a generic function called encodeImageFromUrl to encode any URL. Note that the function is asynchronous, so we will need to use context.willFinishLater() and context.finish() and pass a callback to the encodeImageFromUrl function.

function encodeImageFromUrl(url, callback){
    var can = document.createElement('canvas');
    var ctx = can.getContext('2d');
    var img = new Image();
    img.onload = function(){
        can.width  = img.width;
        can.height = img.height;
        ctx.drawImage(img, 0, 0, img.width, img.height);
        var data = can.toDataURL()
        callback(data)
    }
    img.src = url;
    img.crossOrigin="anonymous"
}

In this code, we create a canvas element and set its context to 2D. Then we instantiate new image object and add it’s onload callback. We set image source as the URL and crossOrigin as anonymous to allow access to another domain. In onload function, we set up canvas size and then draw the image on the canvas. We use native toDataURL() method to encode the image to base64. The data returned is a huge string. Then we just call the callback with the base64 string as an argument.

The whole pageFunction can look like this

function pageFunction(context) {
    	// called on every page the crawler visits, use it to extract data from it
    var $ = context.jQuery;

    context.willFinishLater()
    var result = {}
    var myUrl = $('my-image-element').attr('src')
    encodeImageFromUrl(myUrl, function(data){
         result.base64 = data
         context.finish(result)
    })

    function encodeImageFromUrl(url, callback){
        var can = document.createElement('canvas');
        var ctx = can.getContext('2d');
        var img = new Image();
        img.onload = function(){
            can.width  = img.width;
            can.height = img.height;
            ctx.drawImage(img, 0, 0, img.width, img.height);
            var data = can.toDataURL()
            callback(data)
        }
        img.src = url;
        img.crossOrigin="anonymous"
    }
}

In the results, under a key "base64", we will find string like this which is base64 representation of our image(the string is huge, this is just a small snippet ):
{"base64":"iVBORw0KGgoAAAANSUhEUgAAAHAAAACnCAYAAADJ29jcAAAACXBIWXMAAAsTAAALEwEAmpwYAAAgAElEQVR4AQAJgPZ/Ad7o5/8AAAAAAgICAAEBAQABAQEAAAAAAAAAAAAAAAAAAgICAAICAgACAgIAAQEBAAAAAAD///8A/v7+AP7+/gAHBwcA////AP7+/gD+/v4A/v7+AAAAAAABAQEAAAAAAAEBAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA+vn5AAYGBg"}

#2

Can anyone confirm this actually works?
In my tests, the run goes and goes with no results, so I have to stop it.


#3

Hi stoickov,

This definitely worked when I created it but keep in mind you have to adapt this to your real code e.g. replace 'my-image-element' with something meaningful.

But this whole things is more a legacy solution, for now definitely use actors for any kind of image handling.