Making Webcomic Definitions (Valid for version 0.5.1+)
Definitions are the heart of Webcomics Collector. They tell the script how to download from a particular webcomic. Writing definitions isn't that difficult but does require a little knowledge about regular expressions.
Definitions recide in comics.def and comics.classes, which live in /etc/collector (and users can add to them through ~/.collector/comics.[def|classes] which will be loaded together with the systemwide defs) and you should go and take a look in them right now to get comfortable with the syntax if you wish to make your own definitions.
Here is definition which we shall go through:
[Calvin and Hobbes]
type = search
authors = Bill Watterson
homepage = http://www.ucomics.com/calvinandhobbes
frontpageurl = {homepage}
imgurl = http://images.ucomics.com/comics/ch/
imgexp = <img src=\"http://images.ucomics.com/comics/ch/(\d+/ch\d+\.gif)\" width=\"\d+\" height=\"\d+\" border=\"0\">
previousurl = {homepage}
previousexp = <a href = \"http://www\.ucomics\.com/calvinandhobbes(/\d+/\d+/\d+/)\" onClick=\"this\.href=FCx\(this\.href\)\;\">previous date</a>
The name is enclosed in [] and is used for identifying the webcomic.
type can be either "lateststatic" or "search". The "lateststatic" type is for comics that only provide the latest strip with no easily accessible archive and also uses a filename that is static (see comics like Hagar the Horrible). The "search" type is for most webcomics and makes Collector search the pages for information needed to download newest strip and archive.
authors and homepage should speak for themselves. Make sure that homepage does not end with a slash!
frontpageurl is the direct URL for the newest strip. If the URL isn't static (see webcomics like User Friendly) a regexp can be used to find it. This is specified with frontpageexp. The text collected with frontpageexp is appended to frontpageurl.
imgurl is the partial URL to the imagefiles. The regexp in imgexp is used to grab the filename which is then appended to imgurl to get the full URL to an imagefile. If imgexp isn't found, altimgexp will be tried. This is for those comics that has a different syntax for some strips.
For the "lateststatic" type comics, the filetype may not be present in the filename. imgtype sets this information.
previousurl is the partial URL to the archive of the webcomic. The data grabbed with previousexp is appended to previousurl to get the URL to the strip right before the current one. If previousexp isn't found and altpreviousexp exists, then altpreviousexp will be tried and if a result is found, this will be used. If neither is found the archive download will be ended.
For those comics that require a HTTP Referer header, referer sets a static URL to be used. Later on support for a dynamic Referer header may be added if any comics needing it is ever found.
Remember to escape quotes and questionmarks in regexps and remember that parentheses mark the data to be extracted. \d represents digits and . represents any character. It is usually quite important to put an unescaped questionmark after a dot exp to make it "non-greedy", otherwise it might not work as expected. An example of a dot-expression is .+? which matches any characters of a length of zero or more and is non-greedy.
The {homepage} construction makes it possible to specify the value of another option like in our example where frontpageurl is the same as homepage.
It is also possible to use classes if some definitions look alot like each other, like those hosted by SF Gate. Let's look at how SF Gate comics are handled:
[Beetle Bailey] class = sfgate authors = Mort Walker urlname = Beetle_Bailey
And the sfgate class:
[sfgate]
type = lateststatic
homepage = http://www.sfgate.com/cgi-bin/article.cgi?file=/comics/{urlname}.dtl
imgurl = http://pst.rbma.com/content/{urlname}
imgtype = gif
referer = {homepage}
As you can see the homepage URL is very much alike in all SF Gate hosted comics, so a common homepage value is made with {urlname} being replaced by the urlname value from the definition for the comic using it. The values from the class is just added to the definition and the rest should be quite obvious.
You can add your own definitions to ~/.collector/comics.[def|classes]. If you make any definitions/classes with the same names as definitions/classes in the systemwide files they will be merged with the systemwide definitions/classes. This means that you can override single values in a definition/class if you want to. You can't however remove a value, other than by setting it to a blank value.
If you make any new definitions/classes, please email them to me or create a bugtracker entry, so that I can add them to the definitions in the official comics.[def|classes].



