Catching the transient cookies: Log in, then crawl
The old way
Sometimes all you need is a quick crawl within a site, which requires to log in first. There are two main techniques I can think about: One is to POST the login form with your script, and get the necessary cookie setting. The second is to login manually with a browser, and then hand over the web cookies to your script. Let’s start with the first (traditional?) method:
You could use WWW::Mechanize for that (not that I’ve tried), or use the good old LWP. Something like:
#!/usr/bin/perl use warnings; use HTTP::Request::Common qw(POST); use HTTP::Cookies; use LWP::UserAgent; $basedir = 'http://www.somesite.com/'; # Create a cookie jar and log into the server $ua = LWP::UserAgent->new; $ua->agent("Mozilla/5.0"); # pretend we are very capable browser $jar = HTTP::Cookies->new(); $ua->cookie_jar($jar); my $req = POST $basedir.'login.php', [ username => 'dracula', password => 'bloodisgood' ]; print "Now logging in...\n"; $res = $ua->request($req); # We're not really interested in the result. # This was only a cookie thing. die "Error: " . $res->status_line . "\n" unless ($res->is_success); # And now we continue to whatever we wanted to do
The problem is that sometimes the login form is complicated. At times it’s obfuscated intentionally, and uses several tricks to make it difficult to automate the login. Sniffing a successful login (your own, I hope) may be helpful, since the correct POST data is there. If the login is through https, just go through the web page, replace all “https” with “http” and make a fake login. It may not login for real (it usually does), but at least you have the dump info.
But the bottom line is that it may be difficult. In some cases, it’s easier to do the login manually, and continue with your script from there.
Cookie stealing basics
So the plan is to login manually, and then give away the web cookies to your script. The target server can’t tell the difference. In extreme cases, you may need to set up the HTTP headers, so that your script’s and the browser send the same ones exactly. I suggest making your script identify itself with exactly the same user agent header as your browser. Some sites check that, and reject your login if there’s no match. Believe that.
There are several examples for this trick. One is using wget and its –load-cookies flag. It’s quick and dirty, and loads cookies from a cookie file in good old Netscape format. Some browsers can export their cookies to such a file (Firefox uses another format internally, for exampe). But there is still one major problem, and that’s the transient cookies.
Who ate my (transient) cookie?
Every cookie, which is sent from the server to the browser (or whatever you have there) has an expiration date. Some cookies are marked to be erased when the browser (ha!) quits. These are transient cookies.
The thing is, that the browser has no reason to write these cookies to the cookie file on the disk. Why write something that will be erased anyhow? So stealing cookies from the cookie file doesn’t help very much, if the crucial cookies are transient. If you can’t stay logged in to a site after shutting down your browser and getting it back on, that site may be using transient cookies for its session.
The only simple way I know to get a hand on those transient cookies, is to dump them into a file while the browser is alive and kicking. The Export Cookies add-on for Firefox does exactly that.
I suppose that wget works properly with the add-on’s output. I haven’t tried. I wanted to do this with Perl.
Importing Netscape cookies to LWP
It was supposed to be simple. The HTTP::Cookies::Netscape should have slurped the cookie file with joy, and taken things from there. But it didn’t. The module, if I may say so, has a problematic programming interface, which is miles away from the common Perl spirit.
The worst thing about it, is that if no cookies are imported because it didn’t like the cookie file, or didn’t find it at all, there is no notification. An empty cookie jar is silently created. I think that any Perl programmer would expect a noisy die() on that event. I mean, if the cookie file wasn’t read, there’s no point going on in 99% of the cases.
A second problem is with transient cookies. Their expiration time in the cookie file is set to 0 (surprise, surprise, they’re not supposed to survive at all), and the module simply discards them. I don’t blame the module for that, since transient cookies aren’t supposed to be found in a cookie file.
I’ve made the necessary changes for making it to work with the Export Cookies add-on, and got a new module, Exported.pm (click to download). I suggest to copy it next to where you find Netscape.pm in your Perl distribution.
Bottom line, the script looks like this:
#!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use HTTP::Cookies::Exported; my $baseurl = 'http://www.somesite.com/juicydata.php'; my $ua = LWP::UserAgent->new; # A user agent string matching your browser is a good idea. $ua->agent('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'); # Not loading file in new(), because we don't want writeback my $cookie_jar = HTTP::Cookies::Exported->new(); $cookie_jar->load('cookies.txt'); $ua->cookie_jar($cookie_jar); my $req = HTTP::Request->new(GET => $baseurl); my $res = $ua->request($req); if ($res->is_success) { my $data = $res->content; # Here we do something with the data } else { die "Fatal error: $res->status_line\n"; }