Implementation notes There is no design with this type of programs, you just write it, and tweak the code along the way. That said, there are some nontrivial bits in these programs, some are slightly different from things I wrote before, most are different from code that people usually write, hence this file here. Think of this as a tutorial, perhaps. It's more like a diary. ---------------------------------------------------------------------- 0. Concept A program that: - Outputs Youmu when compiled as C. - Outputs Yuyuko when compiled as C++. Furthermore, the output of those shall be valid C/C++ programs that performs the same task. In other words, a pair of polyglot quines. This concept was decided very early on. It was not random... the main reason why I haven't updated my homepage for more than half a year is because of all the weekends spent playing Touhou Eiyashou, and having finally completed it on Lunatic last week, it was enough motivation to write such programs. Personal bias here, I thought Youmu and Yuyuko form the best couple from Eiyashou, and they were also the first team to make it past Lunatic for me. Also related is my personal bias for C++, which I totally believe to be the language of the youkai, and C is only half as dangerous. It had to be Youmu and Yuyuko, and it had to be C and C++. ---------------------------------------------------------------------- 1. Polyglot This program makes use of a certain 3 character comment sequence to differentiate C and C++ modes. It's no magic if you already know what it is, so this part is left as an exercise for the reader. Unfortunately, a lot of C compilers implement certain C++ features as extensions, and don't tell the users about it. In this case, you will always get yuyuko.c on stdout. "gcc -ansi" is found to do The Right Thing. ---------------------------------------------------------------------- 2. Quine This part explains this string: "ebfcfdfefffgb" A typical quine needs to do something like this: - output some header - output encoded source - output decoded source For Youmu and Yuyuko, the sequence is: - header - encoded source - encoded layout data - decoded source Since strings are quoted in C/C++, the sequence becomes: - header - encoded source - quote/separator - encoded layout data - quote/separator - decoded source Then we made the observation that we can simply create a table of strings, and include the quotes themselves as part of the string table, if we encode them the same way as the other source text. Thus we can create a string table that contains the following: a. encoded source b. encoded layout data c. encoded header d. encoded quote/separator And the output sequence would be (lowercase for encoded data as is, uppercase for decoded data): C a D b D c D d D B Of course, this sequence can be stored in the string table also. And if we output it right before the decoded source, the encoded/decoded outputs interleave perfectly, and we can control the output printer to alternate between encoded and decoded output with simple XOR logic. Using the knowledge above, and that the string table in the final program start with index 'b', you should be able to decode this sequence: E b F c F d F e F f F g B ---------------------------------------------------------------------- 3. Source encoding This part explains the string: "'x)/d ... Up to: ... v-q,>x" The source needs to be encoded so that backslashes and quotes don't show up in strings. This is trivially solved by XORing everything with 5. 5 is the only one that works for the characters used in these programs, see test_encode_text.pl. ---------------------------------------------------------------------- 4. Layout encoding This part explains the two strings: "B)G!B ... Up to: ... 9C!.K" Encoding the layout is more difficult than encoding the source, we need some compression, otherwise there would be no room for code. Then we need a compression scheme that takes small amount of space in terms of data and decoder size. This is actually where the majority of the code is. Four different encodings were tried, see encode_template.pl and decode_template.pl in the subdirectories. Some gory details of the encodings are described below. v1: simple run length encoding, encode long runs of whitespaces and characters as a single character, with the ASCII value indicating the length. This works generally well when the template is mostly large blobs of characters/space, but doesn't work with the line-based templates used in this project. v2: encode long runs of spaces as a single character, then encode 6 character+space mix as a 1 character bitmask. Because we will only use the bitmask approach when there are characters, so the first bit will be implicitly set. This doesn't work at the end of line because we will either need to pad the end of lines with spaces to fill the 6 bits (which doesn't go well when we have to format quoted strings), or we will need to define extra code for less than 6 character long patterns, which increases both the code and data size. v3: a mix of v1 and v2, encode long runs of characters and spaces as single characters, and keep the 6 bit encodings for alternative characters. This works fairly well, except one small implementation problem... v4: an encoding that's identical to v3, except the character ranges used to encode the long runs are different. This is because the v3 encoding used the #..7 range for characters, and the many 3 character sequences in the template resulted in lots of C++ digraphs like "%:". By swapping the ranges for character and whitespace, the digraph problem is solved, and this is the final encoding used in the program. ---------------------------------------------------------------------- 5. Output This part explains the non-data bits, from #include to end of file. There are two streams of data, the code and the layout, and therefore two ways to generate output: - Follow characters in the code stream, consume and decode characters from the layout stream, then print the code characters. - Follow and decode characters in the layout stream, consume and print characters from the code stream. The latter is much simpler to implement than the former, because we can always consume one character at the time. Whereas if the output were driven by the code stream, we would need extra state to know when we need to advance the layout stream pointer. The decoder algorithm is fairly trivial, and follows from the encoding schemes described previously. Easy to read Perl implementations can be found in the subdirectories. The C/C++ version is a bit more clever in that it avoids using the "else" keyword (or really, all keywords in general), and also had to handle line continuations inside quotes, but otherwise it's not too complicated. Unformatted source can be found in generate_source.pl. A more readable source and comment source may be found in v3/youmu0.c, but things have changed quite a bit since then. ---------------------------------------------------------------------- 6. Format source Working out how to tweak the code to get a working prototype is one thing, getting the code to look like the final formatted version is still a few more days worth of work. This is because I had to account for syntactic correctness in two programs as I move bytes around. Two programs, simultaneously. There is no easy way to do it. At least, I haven't found one for C/C++ yet. Following the tried and true approach from previous years: - Generate templates that are large enough to hold all data (see template/*). For Youmu and Yuyuko, I have reserved about 100 bytes worth of headroom. - Rearrange code and template iteratively until things worked out. This is partially automated by Perl scripts, which merges the data streams to ensure that the latest changes are linked in. ---------------------------------------------------------------------- 7. Documentation I don't usually write things like this for my homepage. Not because I wanted to keep the inner workings of my programs abstruse and esoteric, it's more because I was lazy. That, and I thought my code is totally obvious, just like how everyone else thinks about their own code ^_^; These days I make an effort to reserve time to write documentation. I wonder if anyone actually read them? But with this final step, I declare youmu.c and yuyuko.c to be complete. Next Night... -- omoikane@uguu.org - http://uguu.org/