Copyright © 2004 Anders Bruun Olsen
| Revision History | ||
|---|---|---|
| Revision 1 | 26-07-2004 | ABO |
| Added part about predefined substitution keywords. | ||
| Revision 2 | 28-07-2004 | ABO |
| Regexps are now case-insensitive, added info about this. | ||
| Revision 3 | 07-08-2004 | ABO |
| Added section about configuration options, added info about logging and updated section about verbose-level. | ||
| Revision 4 | 29-08-2004 | ABO |
| Added configuration option non_tty_error_reporting. | ||
| Revision 5 | 05-09-2004 | ABO |
| Removed non_tty_error_reporting in favor of just using verbose setting. | ||
| Revision 6 | 06-11-2004 | ABO |
| Updated documentation to cover version 0.7.2 | ||
Table of Contents
List of Examples
This is the documentation for Webcomics Collector (Collector for short).
To use Webcomics Collector you just need Python 2.3+. Please note that versions before 0.7.2 have slightly different requirements.
Please note that Webcomics Collector was developed on a Linux platform, and while work has been done to try to ensure that it works on both Posix platforms and Windows, the developers have neither the time nor the stomache to test it on Windows. You are encouraged to submit experiences with running Collector on Windows platforms, both successes and problems to the mailinglist such that problems may be adressed and successes be enjoyed.
Gentoo users can just "emerge webcomics-collector" to install collector. There are no packages available for any other Linux distribution at the moment, but packagers are very welcome to get in touch.
Non-Gentoo users will have to install collector manually, which luckily is very easy, thanks to distutils. First unpack it and then run setup.py with the "install" argument. This will install the files needed.
The first time collector is run, it will create it's config directory ~/.collector.
Table of Contents
Now that you have gotten collector installed, it's time to learn how to use it.
When using Webcomics Collector there are certain phrases and words that have very defined meanings when used in context with Webcomics Collector.
The words "webcomic" and "comic" means an online comic as a whole, not a specific strip.
The word strip means the individual (often daily) editions of a comic. Sometimes a strip consists of more than one image file, but it is still the same strip.
A definition (or webcomic definition) is a collection of information about a specific webcomic telling collector how and where to fetch strips. This is often shortened to "def".
The "comicslist" is the list of webcomics that you the user want to download.
Collector has several modes of operation, it can do full archive downloads of most webcomics, download backlog if it has been a while since it's last run and just plain download of the newest strips.
In ~/.collector several files might exist. These are the possible files and their purpose:
collector.cfg: Your configuration file, se the Configuration section for more information.
comics.classes: A local version of comics.classes, it contains comics classes that defs can inherit. For more information see the "Making Webcomic Definitions" chapter.
comics.classes.cache: This is a cached version of comics.classes (both the systemwide version and a possible local one). It makes Collector load faster because it will only have to reparse comics.classes when it's mtime has changed. It is safe to delete this file, since it will just be recreated if it does not exist.
comics.def: A local version of comics.def, For more information see the "Making Webcomic Definitions" chapter.
comics.def.cache: Like comics.classes.cache, only this is for comics.def instead.
log: This is the default logfile that Collector will write to. It is safe to delete if you do not care about the content.
Collector is configured through the config file ~/.collector/collector.cfg. In this file the following options can be set:
comics
This is a comma-seperated list of comics that Collector is setup to use.
comicsdir
This option sets the directory that will hold the downloaded strips. Remember to set this to a dir that is accessible to your frontend (for Collectorweb, use a dir that is accessible through http!).
useragent
This optional option allows you to set a custom User-Agent header to be used whenever Collector makes an HTTP connection.
logfile
This option sets the path to a logfile to use. Use the word SYSLOG to log to syslog instead of a file. Please note that your syslogger must be listening on localhost port 514 for this to work. If logfile is not set then ~/.collector/log will be used.
loglevel
This option sets the level to log to. The valid values are: DEBUG, INFO, WARNING, ERROR and CRITICAL. Where DEBUG will log ALOT of information and CRITICAL will only log the very worst stuff, nothing short of catastrophic. It has a default value of WARNING.
Collector contains defs for more webcomics than you are probably interested in following, so you need to tell Collector which webcomics you want to download by adding them to your comicslist.
Running Collector with the -l parameter will give you a list of all the webcomics that Collector knows. This list is formatted in a way that is excellent for grep'ing, so it should be easy to find out if the webcomic you want is there.
The -L parameter lists the webcomics that you have on your comicslist.
You can add and remove webcomics to/from your comicslist with the -A and -R parameters. Any webcomics specified after these (as many as you want) will be added/removed to your comicslist. Remember to add quotes around them if there are spaces in the name of the webcomic, like so:
Any additional comics specified should be seperated with space. If no comics are specified, Collector will take this to mean all comics. That means that -R without any arguments will clear your comicslist! Not to worry however, the stripfiles you have downloaded won't be deleted, and adding your chosen webcomics again is quite painless. Using -A with no arguments will in addition add all the know webcomics to your comicslist. That is probably not what you want.
Collector has three different download modes: latest, archive and backlog.
The "latest" mode is invoked by default and will fetch the newest strips for the webcomics in your comicslist.
By running Collector with the -a parameter you invoke the "archive" mode which will download the archived strips for all webcomics in your comicslist of the type "search". The type "lateststatic" only allow downloading the latest as the name specifies.
"Archive" mode will start with the newest strip and work it's way backwards using the "previous" links of the webcomic in a webcrawler/bot like manor. When no more "previous" links are found, download is ended.
If you for one reason or another hasn't run Collector in a little while, the "backlog" mode makes Collector download the newest strip and then move backwards in the archive until it reaches the first strip that it has already downloaded. Once that is reached, you are once again up to date and download is ended. The "backlog" mode is invoked with -b.
If any webcomics are specified after either mode (including "latest" mode, where no parameters are required), only those will be processed. If none are specified, all webcomics in your comicslist will be used. Remember that the specified webcomics must be in your comicslist and that they must be seperated by space and any webcomics with spaces in their names must be quoted.
Any stripfiles already downloaded won't be downloaded again. If you want to redownload strips you can add the -r parameter.
It is possible to ask Collector to be verbose and give more information about what it is doing, this is done by adding the -v parameter as such:
The -q parameter tells Collector to keep it's output to a minimum and -D tells it to give as much output as possible. This only affects terminal output. Logging to a logfile or syslog depends upon the loglevel configuration option.
When running Collector from a script or in a background job like cron it does not output anything unless -v or -D is used. Everything is logged to a logfile or syslog idependently of the verbose-level.
To automatically have Collector check the webcomics in your comicslist on fixed intervals, you can use a standard unix service like cron. Just set up a cronjob running every two hours or so (don't set it to run too often, that wouldn't be fair to the webcomic authors). Since crons are different for different platforms, it isn't possible to explain this in details here, consult your chosen OS's documentation. Here is however an example that works with dcron:
That will run Collector on all webcomics in your comicslist every two hours, fetching the newest strips unless these have already been downloaded.
Table of Contents
Definitions are the heart of Webcomics Collector. If the webcomics you want are not in Collector, you can make defs for them yourself. It's not hard, although it does require some knowledge about regular expressions. If you are not good at regular expressions you should go to the Python Regular Expression Syntax and read up on them. If you are completely unfamiliar with regular expressions you can go to this tutorial to learn about them. The best way to learn how to make defs are by looking at existing ones. Start by looking at comics.def (in /etc/collector or somewhere like that) and get used to the syntax. Then read on here to learn how to read defs and make new ones.
Definitions are written in a format often known as Windows INI format. This format was chosen because it is very simple and because Python had a readily available interface for it. Some enhancements had to be made to it, such as substitution keywords, because the builtin way of doing it didn't do all that was needed.
Example 3.1. Example of webcomic definition
[Angst Technology] type = search authors = Barry T. Smith homepage = http://www.inktank.com/AT imgexp = <div align=\"center\"><IMG SRC=\"(/images/AT/cartoons/.+?)\"></div> previousexp = <A HREF=\"(/AT/index\.cfm\?nav=\d+)\"><IMG SRC=\"/images/nav_last\.gif\".+?></A>
The name of the comic is written between [ and ] marking everything from that point until the next [ ] as being options for that comic def.
So far there are two types of definitions in collector: "lateststatic" and "search".
The "lateststatic" type is for webcomics with no archive accessible through next/previous links, but with a URL for the newest strip that never changes. Hagar the Horrible is an example of this.
"lateststatic" type can use the following options:
The "search" type is for webcomics with an archive accessible through next/previous links. The webpages will be searched using regular expressions to find the path to the stripimages and the link to the previous page in the archive.
"search" type can use the following options:
Type is specified using the "type" option in the def.
Here follows an explanation of all available options. Remember to escape quotes and questionmarks in regular expressions and remember that parentheses mark the data to be extracted. Also note that all regular expressions used in Collector are case-insensitive. It is recommended that either lowercase-only or uppercase-only expressions are used to allow for easier readability.
The altimgexp option is similar to imgexp. If it exists, it will be tried if imgexp yields no results. This is for those webcomics that for one reason or another changes format for their tags halfway through their archive.
The altpreviousexp option is similar to previousexp. If it exists, it will be tried if previousexp yields no results. This is for those webcomics that for one reason or another changes format for their tags halfway through their archive.
Some webcomics do not put their newest strip on a page available through a static URL. Examples of this can be "User Friendly" where if you want the newest strip in large size you need to follow a link.
The archiveexp option is a regular expression which is applied to the homepage URL. Whatever is extracted is put together with archiveurl to form a new archiveurl which Collector then uses when fetching the newest strip or uses as a startbase when doing archive/backlog fetching. If archiveurl does not exist, homepage will be used instead when adding the result.
This option used to be called frontpageexp.
The archiveurl option sets the URL of the page containing the newest strip. If archiveurl is not set, the value of the homepage option will be used instead. This option can usually be left out since most webcomics has their newest strip on the front of their homepage. If the URL to the page containing the newest strip is not static, archiveexp can be used to extract the URL from the homepage.
This option was formerly called frontpageurl and used to be required.
The authors option sets the names of the author (or authors) of the given webcomic. All signs can be used except for less-than and greater-than.
The class option specifies a class that this comic inherits options from. See the Classes section for more information.
The imgexp option is a regular expression for extracting the URL to the stripfiles. The result extracted will be applied to the URL of the page imgexp was used on with the urljoin function, so make sure you extract the entire content of the src="" attribute of the <img> tag! The URL of the page imgexp was used on, will be used as referer, eliminating the need for the old referer option, which used to set a static referer.
The imgname option is a regular expression that extracts the name to be used for the current strip(s). This is for those comics that have the stripfiles handled by a php-script or something like that, which makes it impossible to just use the extracted data from imgexp to figure out a name.
The imgurl option is only used very rarely in comics of the type "search". Normally if a webpage references images outside of it's own domain it has to specify the full URL, not just a relative one. But because of the unholy <base href=""> tag, this behaviour can be changed. Instead of doing alot of irritating parsing of the HTML of every page to find out if this tag appears, the imgurl option allows setting a different base that the imgexp results will be applied to, instead of applying to the URL of the page being processed. One of the only strips using this is "Sherman's Lagoon".
With comics of the type "lateststatic" the imgurl option contains the static URL to the newest strip.
The pathlevels option is for comics where just downloading the files named as they are will result in nameclashes. Comics might choose a namescheme like this: YYYY/MM/DD.png where YYYY is the year, MM is the month and DD is the date. That would result in a lot of 01.png, 02.png and so forth files. Pathlevels tells Collector how many levels to include. For our example a pathlevels of 3 would result in a name called YYYY_MM_DD.gif which would then be unique. If pathlevels is not set, the default value 1 will be used.
The previousexp option is a regular expression being used to extract the link to the previous page in the archive. The extracted data will be urljoin'ed with the URL for the page it was extracted from.
The type option specifies what type of comic this is. The valid values are "lateststatic" and "search". See the Types section for more information.
Some webcomics (usually those hosted by the same provider, such as SF Gate and Keenspot) have HTML pages with syntax that looks remarkably like each other. All comics hosted by Keenspot for instance has their archive as date-numbered HTML pages in /d/ and the stripfiles are in /comics/. Even the HTML tags look exactly the same on most of them. This makes making defs for them quite easy. By making a "keenspot" class, most of the defs for webcomics hosted by Keenspot can be shortened to just the class, authors and homepage options.
When using the class option, the type option should go in the class, not in the comic def.
Classes are defined in comics.classes and follow the exact same rules as comics defs in comics.def. Except that none of them accept the class option.
All options set in a class can be locally overridden in defs that use that class just by setting the same option with a new value in the def. So if the previousexp in the "keenspot" class doesn't work for the comic you are trying to write a def for, you can still use the "keenspot" class and just override the previousexp option.
Any words with {{ and }} around them are substitution keywords. The word within the {{ }} should be the name of an option within the same def or a predefined keyword. If you set archiveurl = {{homepage}}/archive/lastest.html then {{homepage}} will be replaced by the value of the homepage option.
This also works across class/def such that any option set in the def can be used by the class it inherits. This is used in amongst others, the sfgate class where the homepage is rather predictable. The homepage option uses {{urlname}} and in the defs the urlname option is set, and thus the correct homepage value will be created. The urlname option is a name that was chosen because it sounded good. Any option names may be used, as long as they aren't used by Collector for anything. Also they must be single words with no special characters in them, only alphanumeric characters. Other than that, it's just to go ahead and make whatever options are needed.
There are currently 3 predefined keywords, which translate to these commonly used regular expressions:
ext: (?:jpg|JPG|jpeg|JPEG|jpe|JPE|gif|GIF|png|PNG)
filename: [a-zA-Z0-9_\-%]+?\.(?:jpg|JPG|jpeg|JPEG|jpe|JPE|gif|GIF|png|PNG)
signs: [a-zA-Z0-9_\-%]+?
As some may notice, signs is almost equivalent to \w, although \w does not contain the dash (-) which is why signs is there, since alot of webcomics uses dashes in their filenames.
When Collector downloads strip-images they are saved in a dir with the same name as the webcomic, located in the "comicsdir", which is defined in collector.cfg (see the "Configuration" section).
Some webcomics archives will use the original filenames, others will use names based on extractions from the URLs and still others will just use a running numbered scheme controlled by Collector. This all depends upon the def.
Because of these different naming schemes (which make it easy to check if files have already been downloaded) there is a need for a way to record the order in which these strips are supposed to go, and also there are some webcomics that have multiple images in each strip, so sometimes images need to be grouped together. This is handled in a file called strips.dat, which is located in each archive directory.
The format of strips.dat is really straightforward:
Each line represents one strip with the oldest at the top and newest at the bottom. The order of the strips is thus dictated by what line they are listed on.
Each line is divided into fields by the pipe-operator (|).
The first field in each line is a direct URL to this strip in the archive on the webcomic's own website, if this is available.
Each field after the first represents one imagefile that is a part of this strip.
Each field containing an imagefile is divided into two subfields by a slash (/), with the filename on the lefthand side of the slash and the file's MD5 checksum on the other side.
And that is it, frontends will then read the def-files to obtain such information about a webcomic as the authors and the URL to the website and parsing the strips.dat file will give information about the strips.



