Implementation notes Language with only one symbol. ---------------------------------------------------------------------- 0. Concept I wanted to make a language with just one command. Actually I wanted to make a thing that contained lots of identical lines, where you would delete a certain subset of lines to select different functions. But I realized that programming with long lists of to-be-deleted line numbers is equivalent to programming with mixed blank/command lines, just with extra steps. So I settled on an encoder that translates input to single-command programs. ---------------------------------------------------------------------- 1. Language design It's possible to design a single command language around unary encoding, but that would be horribly inefficient. But if we make use of the whitespaces around that one command, we can get at least binary encoding. C is mostly not a whitespace sensitive language, but it does provide __LINE__, so I have designed the whole language around using __LINE__ to encode blank lines. The "design notes" section of the packaged remarks.md described most of the tradeoffs in detail, here I will just comment on the things that I did not do: - Huffman encoding: instead of the deflate-based compression, initially I was thinking more of a per-byte compression scheme, and Huffman encoding was an obvious choice. This was quickly abandoned because it's not possible to get better than an 8x compression ratio (need at least 1 bit per byte), and considering all the overhead needed to implement the single-command language decoder, output size will likely be very large, so this encoding scheme was not attractive. It's also unattractive for the fact that Huffman encoding requires preprocessing the data to generate an optimal dictionary, or come with a fixed dictionary baked in. So the result is either not optimal or not streamable. - Variable length encoding: the reason why Huffman encoding seemed attractive was that it's possible to spend fewer bits for certain bytes. I thought maybe I could still shave some bits off of each byte if I embraced variable length encoding, considering that most ASCII text can fit in 6 bits. I ended up not using variable length instructions because I thought having branches in the language was more interesting, and it's really troublesome to figure out the branch target if the instructions are of variable length. By sticking to fixed-length instructions that always end with a 1 bit, I can guarantee that there would be a non-empty line just before the start of the instruction I wanted to branch to, and tweak some state machines to make sure the instructions flow correctly after the branch. ---------------------------------------------------------------------- 2. Language implementation The decoder comes in two parts: the trivial part is outputting a single byte, which basically adds bits by inferring current bit position from __LINE__ and flush all the bits on the 9th line. There is a minor optimization in XORing all bytes with 0x20 first, otherwise outputting bytes is completely straightforward. The more convoluted part involves implementing loops. True loops like for/while/do are not possible because all these constructs require a pair of matching things, and a single-command language can only guarantee one of the pair. Instead, I used setjmp wrapped inside a switch statement, and generated unique case labels using __LINE__. This allows the code to jump to a wide range of targets via longjmp. Both the trivial and convoluted parts are implemented in a single long statement, with a generous use of comma and ternary operators. The one command is defined to call a function with that one statement. Initially I didn't have that separate function at all, and the one command expands to the long expression directly. Compile time was terrible but that was kind of OK, what was not OK was that the output executable size grew significantly. So now the one command expands to a case label and function call. ---------------------------------------------------------------------- 3. Language encoder Converting a byte stream into this single-command language involves maintaining a ring buffer of output bytes, and generate output-literal or repeat-previous-substring instructions as appropriate. A few conditions must be met for outputting repeat instructions: - The repeated substring is of at least length 3. - The repeated substring is within addressable distance. - Repeat instruction must not cause a stack overflow at run time. A production LZW implementation will likely maintain some dictionary to speed up matching, but since sliding window is relatively small, I just implemented it with a couple of nested loops. The implementation can process input at a rate of about 640K/s, which is not very fast at all, but I don't anticipate people using this tool to encode more than a few megabytes of data. ---------------------------------------------------------------------- 4. Polish and detail Encoder has been subject to a comprehensive set of tests inputs large and small. Also, both the encoder and generated output has been verified to be acceptable to GCC and Clang as both C and C++. Once I felt that I have sufficient test coverage, I did the usual polishing work of converting the code to ASCII art. This time in two flavors: - Fern, as originally planned. When I think about a single-command scenario, the first thing that came to mind was Fern casting zoltraak repeatedly, so I decided on "zoltraak" as the output token and Fern as the layout before I wrote any code. - Can of spam. Because repeating the same string over and over was essentially spam for anyone who has seen that Monty Python sketch. So I added an alternative version that outputs "spam" instead of "zoltraak". This time around, I also improved on the tool to insert arbitrary CRC32 into the code. It used to be something that takes a few minutes of brute force, but now I can do it in just a few seconds. ---------------------------------------------------------------------- 5. Finally... I was disappointed that Marcille didn't make it to the list of winners. I don't know if the judges didn't have Sixel-enabled terminals or if they find organically shaped dungeons boring. Anyways, that disappointment drove me to complete this entry, which I submitted on the first day when the contest was open. It was about a month and a half worth of work. Fern ends up winning "Most magical word" award :)