How to Write Your Own Comic Strip Resource Locator Module

The following applies to netcomics version 0.10 & higher. The 'func' field is new for version 0.6. The 'name' field has been done away with, but netcomics will still use it and makes it possible for you to not have to supply a 'title' and 'type', but that's not recommended for maintaince purposes. Although not recommended, you may still use the old way (versions 0.1 to 0.4) because I've kept netcomics as backwards compatible as possible.

RLI: Resource Locator Information

The following table describes the RLI (Resource Locator Information) hash structure.

Required? Field Format Example Description

yes title Full Name "My Comic" The name of the comic strip. You may have spaces, but don't include apostrophes or other symbols that don't work well for file names.

no author Full Name(s) "Jane Doe & John Doe" The name(s) of the author(s) of the comic strip. You may include practically any characters. You can embed HTML in it if you want it formatted special or include a special link.

yes type file type 'gif' The file type of the comic strip to be downloaded. This field is used to create the name of the file.

no main URL "http://www.comp.com/" This URL is used only in creating the webpage as the hyperlink of the comic's name.

no archives URL "http://www.comp.com/archive/" This URL is used only in creating the webpage as the hyperlink for the comic's archives.

yes base URL "http://www.comp.com/" This should be the part of the URL that is common between the page's URL and the comic's URL. Typically, this is just the main URL for the website.

no pagebase URL "http://pages.comp.com/" Used if the base URL for the HTML page to be parsed needs to be different than the bases for the expr or func attributes.

no exprbase URL "http://images.comp.com/" Used if the base URL for the actual image needs to be different than the bases for the page or func attributes.

no funcbase URL "http://security.comp.com/" Used if the base URL to be used in a user-defined function needs to be different than the bases for the page or expr attributes.

yes page latter half of
a URL strftime("archives/mycomic-
%y%m%d.html",@ltime) This field is completes the URL when appended to base for the comic, or an HTML page that contains a link to another URL used to get the comic. When it points to the comic itself, the field, exprs, is set to undef, or is not specified at all. When it points to a document containing some other reference that can be used to locate the comic, the field, exprs is used.

no exprs regular
expressions "(comics/mycomic-\\d+.gif)"
This field is an array of regular expressions used to find the last part of the URL for the comic. Once a starting web page has been downloaded using the concatenation of base and page, the elements of the array are used, one by one, to match on some text in the page downloaded. The last regular expression in the array is used as the one to create a URL to download the comic itself.

Each element of the array is a regular expression. It is required that a pair of parenthesis is left around the part of the text being matched on to be used as the completing part of the next URL to obtain the comic strip or another web page containing another reference to be matched on. There only needs to be one set of parenthesis in each regular expression. If more than one pair exists, only the one that causes $1 to be defined in the perl code will be used (IOW, nothing fancy here. netcomics doesn't have any special logic, and is coded to only deal with $1).

no func subroutine
reference sub {
return ("${date}a.gif",
"${date}b.gif");
}
This field provides the ability for multiple files to be downloaded from a website for a single comic, and for the RLI to be dynamically regenerated. This function is run by netcomics' engine after the initial page & expressions are run on it if either of those fields are provided. Thet text from the last page downloaded is passed into the function.

The function can either return a list of strings where each string is a relative URL to the base, or it can return a hash containing new and updated fields for the RLI. If an array is returned, each item in the array is used as a relative URL used to download the final images.

If the function returns a hash instead, the RLI is fully reprocessed. This gives you the power to have a site initially looked at to find out even if a comic for the provided date exists, and then restructure the RLI accordingly. You'll need to use this technique if a comic has multiple images and a caption you want to grab. To do this, the function would scan the text for the caption and then return a hash where one field is the caption and the other field is another function which will grab the relative urls.

Most often, the function will be created by a module as a closure. A closure is an anonymous subroutine that is created with some specific data. For example, if you're writing a closure that is to return two relative URLs, each with some text in it that is created from the date, the data you need to embed in the subroutine when you create it needs to be the date given to the RLI function in some form. Note that Perl doesn't seem to maintain references to variables in a closure created from within a closure, so in order to have a function return an RLI with a new function, you'll have to write the second function such that it doesn't depend on any variables or so that it uses global variables.

See the module files madam-n-eve, sluggy_freelance, goats and bad_boys for some examples.

no size array of
2 ints 600
This is the default width and height of the downloaded image that is used when the webpage is created. If the size of the file cannot be determined (Image::Size not installed, the system's file command doesn't report the image file's dimensions, or it was specified to not download the comics), then this array (width,height), is used. If not provided, and the comics's image size cannot be determined, then the created webpage will not include the dimensions of the file, and therefore the webpage will not load as quickly depending on the web browser.

no back subroutine
reference sub {}
This is a backup RLI function that will be executed if the comic is failed to be downloaded. The reference to an RLI hash returned by this function will then be used to try and download the comic again. Be careful to not create the possibility of an infinite loop! See the peanuts and dibert functions for examples.

no referer string "http://site.com/"
If the site requires a referer that's different than the base part of the final url, provide this field. Note that it is used for every HTTP request.

no volatile array [qw/title author/]
If there are any fields which are normally static that are changed by a func, list them. For example, if the comic occasionally has a guest author, you can have a func return a hash with the new author field & value. If a comic has both a subtitle & a caption, and you want to include the subtitle in the title field, you need to specify that the title field is volatile so that the value in the perisistent rli file is ensure to be used.

Required?	Field	Format	Example	Description
yes	title	Full Name	"My Comic"	The name of the comic strip. You may have spaces, but don't include apostrophes or other symbols that don't work well for file names.
no	author	Full Name(s)	"Jane Doe & John Doe"	The name(s) of the author(s) of the comic strip. You may include practically any characters. You can embed HTML in it if you want it formatted special or include a special link.
yes	type	file type	'gif'	The file type of the comic strip to be downloaded. This field is used to create the name of the file.
no	main	URL	"http://www.comp.com/"	This URL is used only in creating the webpage as the hyperlink of the comic's name.
no	archives	URL	"http://www.comp.com/archive/"	This URL is used only in creating the webpage as the hyperlink for the comic's archives.
yes	base	URL	"http://www.comp.com/"	This should be the part of the URL that is common between the page's URL and the comic's URL. Typically, this is just the main URL for the website.
no	pagebase	URL	"http://pages.comp.com/"	Used if the base URL for the HTML page to be parsed needs to be different than the bases for the expr or func attributes.
no	exprbase	URL	"http://images.comp.com/"	Used if the base URL for the actual image needs to be different than the bases for the page or func attributes.
no	funcbase	URL	"http://security.comp.com/"	Used if the base URL to be used in a user-defined function needs to be different than the bases for the page or expr attributes.
yes	page	latter half of a URL	strftime("archives/mycomic- %y%m%d.html",@ltime)	This field is completes the URL when appended to base for the comic, or an HTML page that contains a link to another URL used to get the comic. When it points to the comic itself, the field, exprs, is set to undef, or is not specified at all. When it points to a document containing some other reference that can be used to locate the comic, the field, exprs is used.
no	exprs	regular expressions	"(comics/mycomic-\\d+.gif)"	This field is an array of regular expressions used to find the last part of the URL for the comic. Once a starting web page has been downloaded using the concatenation of base and page, the elements of the array are used, one by one, to match on some text in the page downloaded. The last regular expression in the array is used as the one to create a URL to download the comic itself. Each element of the array is a regular expression. It is required that a pair of parenthesis is left around the part of the text being matched on to be used as the completing part of the next URL to obtain the comic strip or another web page containing another reference to be matched on. There only needs to be one set of parenthesis in each regular expression. If more than one pair exists, only the one that causes $1 to be defined in the perl code will be used (IOW, nothing fancy here. netcomics doesn't have any special logic, and is coded to only deal with $1).
no	func	subroutine reference	sub { return ("${date}a.gif", "${date}b.gif"); }	This field provides the ability for multiple files to be downloaded from a website for a single comic, and for the RLI to be dynamically regenerated. This function is run by netcomics' engine after the initial page & expressions are run on it if either of those fields are provided. Thet text from the last page downloaded is passed into the function. The function can either return a list of strings where each string is a relative URL to the base, or it can return a hash containing new and updated fields for the RLI. If an array is returned, each item in the array is used as a relative URL used to download the final images. If the function returns a hash instead, the RLI is fully reprocessed. This gives you the power to have a site initially looked at to find out even if a comic for the provided date exists, and then restructure the RLI accordingly. You'll need to use this technique if a comic has multiple images and a caption you want to grab. To do this, the function would scan the text for the caption and then return a hash where one field is the caption and the other field is another function which will grab the relative urls. Most often, the function will be created by a module as a closure. A closure is an anonymous subroutine that is created with some specific data. For example, if you're writing a closure that is to return two relative URLs, each with some text in it that is created from the date, the data you need to embed in the subroutine when you create it needs to be the date given to the RLI function in some form. Note that Perl doesn't seem to maintain references to variables in a closure created from within a closure, so in order to have a function return an RLI with a new function, you'll have to write the second function such that it doesn't depend on any variables or so that it uses global variables. See the module files madam-n-eve, sluggy_freelance, goats and bad_boys for some examples.
no	size	array of 2 ints	600	This is the default width and height of the downloaded image that is used when the webpage is created. If the size of the file cannot be determined (Image::Size not installed, the system's file command doesn't report the image file's dimensions, or it was specified to not download the comics), then this array (width,height), is used. If not provided, and the comics's image size cannot be determined, then the created webpage will not include the dimensions of the file, and therefore the webpage will not load as quickly depending on the web browser.
no	back	subroutine reference	sub {}	This is a backup RLI function that will be executed if the comic is failed to be downloaded. The reference to an RLI hash returned by this function will then be used to try and download the comic again. Be careful to not create the possibility of an infinite loop! See the peanuts and dibert functions for examples.
no	referer	string	"http://site.com/"	If the site requires a referer that's different than the base part of the final url, provide this field. Note that it is used for every HTTP request.
no	volatile	array	[qw/title author/]	If there are any fields which are normally static that are changed by a func, list them. For example, if the comic occasionally has a guest author, you can have a func return a hash with the new author field & value. If a comic has both a subtitle & a caption, and you want to include the subtitle in the title field, you need to specify that the title field is volatile so that the value in the perisistent rli file is ensure to be used.

%hof: Hash of Functions

The global %hof hash (associative array) contains the names of all of the functions that return RLI hashes, and the number of days in the past the comic is available using the RLI returned. The keys are the names of the functions, and the values are the numbers of days.

Add your function's name to this hash with code like this:

$hof{"myfunction"} = 0;

That example is for a comic whose latest strip is available on the same day. If you have many functions in one module file to add to %hof, use code like this:

foreach (qw(f1 f2 f3 f4)) {
    $hof{$_} = 7;
}

In that example, every function returns RLI's whose comics they retrieve are available a week after they're published.

Creating a Module File

An RLI function takes one argument: the time as returned by time(). Your function should expect this to be the exact time that it is supposed to use to determine the date of the comic to retrieve. It is easiest to use the POSIX time functions to create strings using the time given. Here's an example:

#-*-Perl-*-

#Add the name of the subroutines to the hash of functions
#with the number of days from today the comic is available
$hof{"uf"} = 0;

#UserFriendly http://www.userfriendly.org/cartoons/
sub uf {
    my ($time,$color) = @_;
    return undef if $time < 879724800; #1st comic: Nov 17, 1997
    my @ltime = gmtime($time);
    my @monthconv = ("jan", "feb", "mar", "apr", "may", "jun",
                     "jul", "aug", "sep", "oct", "nov", "dec");
    my $b = $monthconv[$ltime[4]];

    my $rec = {
	'title' => "User Friendly",
	'author' => "Illiad",
	'type' => 'gif',
	'main' => "http://www.userfriendly.org/static/",
	'archives' => "http://www.userfriendly.org/cartoons/archives/",
	'base' => "http://www.userfriendly.org",
	'page' => strftime("/cartoons/archives/%y$b/%Y%m%d.html",
			   @ltime),
	'exprs' => [strftime("(\\/cartoons\\/archives\\/%y$b\\/(uf)?\\d+\\.gif)",
			     @ltime)],
	'size' => ($ltime[6] == 0 ? [720,529] : [720,274])
	};
    return $rec;
}

1;

The values that are passed to the function are the UTC time value (number of seconds since 00:00:00, January 1, 1970) as created by mkgmtime() in netcomics, and whether or not color is preferred. If your comic strip is available in both b/w and color, then use the second argument to the function to determine what type to get (1 for color, 0 for b/w) (see dilbert for an example).

Note that in the above example, the time is checked against a large integer that is described at being Nov 17, 1997. Use the mktime script included in the contrib directory to create this number so you can easily & quickly test to make sure the requested date is not past the start of the archives.

Do not use localtime()! Use only gmtime() in a module because a future enhancement will be to had logic to handle knowing what timezone the comic is in and what time during the day it is typically updated. (As far as translating the UTC time to a date is concerned, both gmtime() and localtime() produce the same results as long as everything that deals with dates in netcomics works with the same function.) Do not put any special handling of checking the time in modules for when the comic is updated during the day. That special handling will be added to netcomics later. If you want to save that information with as best chance as possible to not have to go back and edit the module in the future, you can go ahead and add two fields to your module's RLI structure: 'tz' and 'hour' (see the doodie module). Those fields won't be utilized right now, but those are what I think the fields will be when timezone support is added.

Your function should return the RLI (as a reference, not as the hash itself), and it will be used by netcomics. Put the function in a file that is readable to all users in /usr/lib/netcomics, and netcomics will find it and use it without you needing to make any modification to its code. Or you can put it in some other directory like ~/lib/netcomics and use the -m option on the commandline with that directory name.

And finally, you must have as the last line in the file:

1;

If you don't, the module will not load properly.

Netcomics Maintainers <netcomics-maintainers@lists.sf.net>

Last modified: Thu May 30 08:33:14 CDT 2002