Thursday, November 7, 2013

Saving downloaded files in SlimerJS (and Casper and Phantom)

It seems a common request is to be able to see not just the HTML of the main page that PhantomJS/SlimerJS are downloading, but also all the other files (images, CSS, JavaScript, fonts, etc.) that are being fetched. You can use onResourceReceived to see them being fetched, but not their body.

The situation with PhantomJS is a bit confusing: I believe there is a patch to allow this, but it hasn't been applied yet. There is also a download API being proposed (or possibly already implemented), but that appears to be for the special case of files that have a Content-Disposition: attachment header. (?)

In SlimerJS it is possible to use response.body inside the onResourceReceived handler. However to prevent using too much memory it does not get anything by default. You have to first set page.captureContent to say what you want. You assign an array of regexes to page.captureContent to say which files to receive. The regex is applied to the mime-type. In the example code below I use /.*/ to mean "get everything". Using [/^image/.+$/] should just get images, etc.

The below code sample will download and save all files. It is complete; you just have to edit the url at the top.

var url="http://...";

var fs=require('fs');
var page = require('webpage').create();

fs.makeTree('contents');

page.captureContent = [ /.*/ ];

page.onResourceReceived = function(response) {
//console.log('Response (#' + response.id + ', stage "' + response.stage + '"): ' + JSON.stringify(response));
if(response.stage!="end" || !response.bodySize)return;

var matches = response.url.match(/[/]([^/]+)$/);
var fname = "contents/"+matches[1];

console.log("Saving "+response.bodySize+" bytes to "+fname);
fs.write(fname,response.body);
};

page.onResourceRequested = function(requestData, networkRequest) {
//console.log('Request (#' + requestData.id + '): ' + JSON.stringify(requestData));
};

page.open(url,function(){
    phantom.exit();
    });


It is verbose in that it says what it is saving. If you want it much more verbose, to see what other information is passing back and forth, there are two logging lines commented out.

WARNING: this works in SlimerJS 0.9 (and should work in 0.8.x), but the API may change in future (to keep in sync with PhantomJS).


5 comments:

Dmitry said...

Apparently, saved images are corrupted =(

Anonymous said...

The images are corrupted due to the encoding conversion. Something along the lines of

iconv -f utf-8 -t latin-1 < corrupted.jpg > fixed.jpg

takes care of it.

Anonymous said...

No need for iconv, just pass 'b' to the fs.write command:

fs.write(fname,response.body,'b');

The images will be saved correctly.

Unknown said...

Hey,
have you tried with csv files?
I receive the header with the file but i cant get it
Best,
Daniel

Anonymous said...

I am **absolutely floored** that despite the huge number of requests for capturing the response body that span several years, that it still hasn't been done correctly in phantomjs! This is like 90% of the point of using such a tool in the first place.

So thanks for posting some workarounds to this issue.