Turns out neither of my entries made it past the first round. Bugs in both. Too bad. Nonetheless, here's my writeup, just because I enjoyed the problem so much. The code I submitted had all of these features but with a whole lotta bugs.
Results of my program (after substantial post-contest improvement); does your program find the optimal results below? These results are from a 450MHz Celeron.
Input file | Best result | Savings | Runtime |
000-example | 13,058 bytes (optimal) | 884 bytes | 5.73s |
003-third-step | 4,068 bytes (optimal) | 1,480 bytes | 24.4s |
004-random | 9,999 bytes (optimal) | 3,021 bytes | 6h 41m 11.21s |
005-icfp2000 | 57,200 bytes (optimal) | 7,736 bytes | 2h 42m 28.4s |
006-exhaustive | 242,363 bytes (best so far) | 1,027,013 bytes | |
100-hand | 3,383 bytes (optimal) | 1,009 bytes | 15.03s |
101-validate-big | 78,819 bytes (best so far) | 56,062 bytes | |
102-validate-small | 78,819 bytes (best so far) | 18,664 bytes | |
103-the-random-returns | 5,138 bytes (optimal) | 6,247 bytes | 16m 19.84s |
200-hevea | 1,460,856 bytes (best so far) | 1,178,486 bytes | |
201-suite | 21,540 bytes (optimal) | 7,274 bytes | 13m 46.15s |
202-hand-again | 3,383 bytes (optimal) | 1,009 bytes | 15.03s |
203-fractal-big | 140,975 bytes (best so far) | 784,894 bytes |
Each token has a state. That state includes the decorations for that token and a bit indicating whether the token has a printable character in it or is composed solely of spaces.
We extended the set of sizes (0..9) with an additional value (F) indicating the root size. We extended the set of colors with two additional values, (F) indicating the root color, and (E) indicating the ability to use any color (for spaces that are not underlined).
Two adjacent characters can be included in the same token if their decorations are compatible. Since color does not matter for spaces that are not underlined, and the B, EM, TT, I, S attributes do not matter for spaces in general, we wrote a function congruent() between two states to indicate whether they are compatible or not.
Building the sequence of tokens then consisted of parsing the input, deciding for each character if it needs to be output (whitespace suppression), and then adding that character to the last token or starting a new token.
We added sentinel tokens at the front and back consisting of the default decorations and marked as having characters to simply a lot of code.
It turned out that writing the code to maintain this nesting structure was one of the most difficult parts of the project, although it shouldn't have been. The problem was that the characters in a token can come from differently nested subdocuments, with different decorations (due to spaces).
Also, I really messed up in building the operations for composing subdocuments into documents. In my submission, I had the notions of left- and right-rotations. Let us define a subsequence to be one with no outer tags; that is, it does not begin with a tag and end with a matching close tag. (This is true of subsequences in my program). A right rotation of subsequences s1 and s2 would generate s1 alpha s2 alpha', where alpha is a sequence of open tags, and alpha' is the sequence of matching close tags. A left rotation would generate beta s1 beta' s2. I thought all documents could be composed of rotations from their subsequences. If you need to do alpha s1 alpha' beta s2 beta', you just rotate s1 under the preceding subsequence s0, and ditto s2.
Turns out this was wrong, and I did not realize that until a few hours before the end of the contest. But by that time, rotations were all over my code and I could not easily fix things.
The current code properly uses just the adjacency operator. That is, given two subsequences s1 and s2, the adjacency operator generates the combined subsequence alpha s1 alpha' beta s2 beta' for various values of alpha and beta. We'll give more details on this in the next section.
If we see a sequence of adjacent (but not nested) tokens a b c d e, for instance, we reduce it to a binary tree of adjacent tokens. We could reduce it to (for instance) ((((a b) c) d) e), but in order to obtain more opportunities for optimization of short subsequences, we instead reduce it into a balanced binary tree (((a b) (c d)) e).
Remember, we defined a subsequence to be one without any outer matched tags. We did this to give us maximum freedom on the state that a subsequence can be in. It turns out that there are typically not too many states that can be generated in this fashion by optimal subsequences, and we only track those states.
So essentially we build a big table of opt[i][len][state], where i is the index of the leftmost token of the subsequence, and len is the count of tokens in the subsequence, and state is the required state that the subsequence must be interpreted in.
Rather than enumerating all states, we only list those states that are relevant for a particular subsequence. A state is relevant for a particular subsequence if there is no optimal encoding for that subsequence with matched tags at the front and back. [We only approximate this relevance metric, but that's an implementation detail.]
For instance, the original input <r><B>x</B></r> is a red bold x in the root size. The subsequence corresponding to this single token is (red, default size, bold, "x"). Note there are no tags in this subsequence, just a state.
Another token, generated perhaps by <1><B><I>y</I></B></1>, generates the subsequence (default color, size 1, bold, italic, "y").
In order to generate a subsequence comprising these two, we figure out the differences in decorations, and enumerate the possible ways that the two subsequences can be combined. There are three orthogonal parts to this: color, size, and other.
For instance, for color, the state of the enclosing subsequence must be the root color. There is no way to go from a state of a defined color (red) to the root color using tags. So we know that, with respect to color, the subsequence must look like "<r>x</r>y", and that the color of the subsequence must be root.
If both tokens had a defined color, we would have two different possible enclosing states with respect to color, with the two different relevant tags.
Size is done the same, and again, the combined subsequence context must be root size, and the tags with respect to size must be "x<1>y</1>".
The other tags are slightly more complicated. We consider all possibilities using PL and not using PL for the other attributes. If we don't use PL, we only consider the differences in the attributes. In this case, the possibilties are:
Spaces complicate things a fair bit here, but not horribly, and we won't bore you with the details.
<2>a<TT>p<5>p<PL><2>9</2><8>p<5>S</5></8></PL>z</5>r</TT></2>
One subtle but very powerful optimization is the immediate collapse of subexpressions that are mincost-optimal. We define the mincost as the cost of the shortest tags necessary to transition from the state of one token to the next. Thus, between the token <B>x</B> and y, we need at least a <B> and </B> in any subsequence where they are adjacent. A <PL> could be used too (if the outer state were bold) but that would not be mincost. For <B> and </B>, we use a mincost of 7; we charge the length of both the beginning and end tags. We compute the mincost between each pair of tokens.
If we find a subsequence a alpha b, where the states of a and b are the same, and they contain a character (and thus don't have any don't cares in any portion of the state), and the cost of this subsequence is half the sum of the mincost values for all adjacent tokens in the subsequence, then we know that we can collapse that subsequence into just a immediately; we do not need to consider any subsequences that break up this subsequence. For instance, in x<B>y</B>z, the mincost between x and y is 7, and between y and z is seven. Also, x and z are in the same state. We know we can consider this to be just the single token x; there are no solutions better than those solutions containing x<B>y</B>z. But this optimization allows us to collapse common things very effectively.
I have downcoded most of it into C. This version optimizes example.txt in 6 seconds on my slow 450MHz machine.
Interesting things you will find in the C version include fractal iteration (to minimize swapping when the heap exceeds physical memory), a clever binary search using bitwise logic (simpler and faster than the conventional one), the use of Shell sort with good increments (taken almost literally from Harbison and Steele), as well as the more usual dynamic arrays, queues, hash tables, and the like. All of this is not yet documented, but it will be soon.
I also plan to write an OCaml version just so I can get in touch with my inner functional programmer. (I'm an old Scheme hacker but I've not been gloriously delighted with the state of Scheme these days.)
My buggy lightning submission and my buggy regular submission are given here for completeness. You'll notice quite a difference between these submissions and the code given above; I've spent a lot of time past the contest fixing up these programs. One reason is that my submissions were so lame, and another is just because it's such an interesting problem.
In all the code, I use a 16-bit state variable. The high nybble are the color, with F meaning root color and E meaning any color. The next nybble is the size, with F meaning root size. Bit 0x80 indicates this state includes a printable character that is not shadowed by a PL, so the low bits are important. Bits 0x40 and 0x20 are U, bit 0x10 is TT, bit 0x8 is S, bit 0x4 is I, bit 0x2 is EM, and bit 0x1 is B. Literal constants in hex litter the program; if I were writing this for real, I'd use S|TT|B etc. all over instead of all the literal constants.
My email address is rokicki@cs.stanford.edu.