-
-
Notifications
You must be signed in to change notification settings - Fork 84
Description
A currently running ArchiveBot (ij8nff4nxdw6m6fvocf2d23l, wpull 2.0.3 on Debian Jessie x86_64) revealed an interesting behaviour. The server in question (doperoms.com) returns status code 302 without a Location header for certain URLs, causing "Invalid redirect location" errors in wpull. Among other things, this means that the response isn't written to WARC.
Example request and response through curl:
$ curl -v -H 'Referer: https://doperoms.com/files/roms/msx_1/Thunder+Ball+%281985%29+%28Ascii%29+%28J%29.zip/618301/Thunder+Ball+.zip' 'https://doperoms.com/files/roms/msx_1/Thunder+Ball+%281985%29+%28Ascii%29+%28J%29.zip/618301/Thunder+Ball+.zip'
* About to connect() to doperoms.com port 443 (#0)
* Trying 198.255.114.90...
* connected
* Connected to doperoms.com (198.255.114.90) port 443 (#0)
[... SSL stuff ...]
> GET /files/roms/msx_1/Thunder+Ball+%281985%29+%28Ascii%29+%28J%29.zip/618301/Thunder+Ball+.zip HTTP/1.1
> User-Agent: curl/7.26.0
> Host: doperoms.com
> Accept: */*
> Referer: https://doperoms.com/files/roms/msx_1/Thunder+Ball+%281985%29+%28Ascii%29+%28J%29.zip/618301/Thunder+Ball+.zip
>
* additional stuff not fine transfer.c:1042: 0 0
* HTTP 1.1 or later with persistent connection, pipelining supported
< HTTP/1.1 302 Found
< Date: Mon, 20 Aug 2018 18:07:49 GMT
< Server: Apache/2
< X-Powered-By: PHP/5.4.16
< Set-Cookie: PHPSESSID=05n6lblpbq8031cvjeph7h9351; expires=Mon, 20-Aug-2018 21:07:49 GMT; path=/
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
< Pragma: no-cache
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
[... content ...]
According to section 6.4.3 of RFC 7231,
The server SHOULD generate a Location header field in the response
containing a URI reference for the different URI.
In other words, it is not mandatory to include a Location header in a 302 response. wpull should therefore accept such responses and consider the item completed successfully (i.e. invoke document scraping, write to WARC, mark as "done" in the URL table).
Although I only mentioned 302 above, I assume that the same holds for other 3xx codes as well. I have not tested this though.
Moreover, I think that the complete responses should always be written to WARC regardless of whether wpull considers it valid for reasons similar to this particular case.