Implementation notes

There is no design with this type of programs, you just write it, and
tweak the code along the way.

That said, there are some nontrivial bits in these programs, some are
slightly different from things I wrote before, most are different from
code that people usually write, hence this file here.  Think of this
as a tutorial, perhaps.  It's more like a diary.

----------------------------------------------------------------------
0. Concept

A program that:
 - Outputs Youmu when compiled as C.
 - Outputs Yuyuko when compiled as C++.

Furthermore, the output of those shall be valid C/C++ programs that
performs the same task.  In other words, a pair of polyglot quines.

This concept was decided very early on.  It was not random... the main
reason why I haven't updated my homepage for more than half a year is
because of all the weekends spent playing Touhou Eiyashou, and having
finally completed it on Lunatic last week, it was enough motivation to
write such programs.

Personal bias here, I thought Youmu and Yuyuko form the best couple
from Eiyashou, and they were also the first team to make it past
Lunatic for me.  Also related is my personal bias for C++, which I
totally believe to be the language of the youkai, and C is only half
as dangerous.  It had to be Youmu and Yuyuko, and it had to be C and
C++.

----------------------------------------------------------------------
1. Polyglot

This program makes use of a certain 3 character comment sequence to
differentiate C and C++ modes.  It's no magic if you already know what
it is, so this part is left as an exercise for the reader.

Unfortunately, a lot of C compilers implement certain C++ features as
extensions, and don't tell the users about it.  In this case, you will
always get yuyuko.c on stdout.  "gcc -ansi" is found to do The Right
Thing.

----------------------------------------------------------------------
2. Quine

This part explains this string:
   "ebfcfdfefffgb"

A typical quine needs to do something like this:
 - output some header
 - output encoded source
 - output decoded source

For Youmu and Yuyuko, the sequence is:
 - header
 - encoded source
 - encoded layout data
 - decoded source

Since strings are quoted in C/C++, the sequence becomes:
 - header
 - encoded source
 - quote/separator
 - encoded layout data
 - quote/separator
 - decoded source

Then we made the observation that we can simply create a table of
strings, and include the quotes themselves as part of the string
table, if we encode them the same way as the other source text.  Thus
we can create a string table that contains the following:
 a. encoded source
 b. encoded layout data
 c. encoded header
 d. encoded quote/separator

And the output sequence would be (lowercase for encoded data as is,
uppercase for decoded data):

   C a D b D c D d D B

Of course, this sequence can be stored in the string table also.  And
if we output it right before the decoded source, the encoded/decoded
outputs interleave perfectly, and we can control the output printer to
alternate between encoded and decoded output with simple XOR logic.

Using the knowledge above, and that the string table in the final
program start with index 'b', you should be able to decode this
sequence:

   E b F c F d F e F f F g B

----------------------------------------------------------------------
3. Source encoding

This part explains the string:
   "'x)/d ...

Up to:
   ... v-q,>x"

The source needs to be encoded so that backslashes and quotes don't
show up in strings.  This is trivially solved by XORing everything
with 5.  5 is the only one that works for the characters used in these
programs, see test_encode_text.pl.

----------------------------------------------------------------------
4. Layout encoding

This part explains the two strings:
   "B)G!B ...

Up to:
   ... 9C!.K"

Encoding the layout is more difficult than encoding the source, we
need some compression, otherwise there would be no room for code.  Then
we need a compression scheme that takes small amount of space in terms
of data and decoder size.  This is actually where the majority of the
code is.

Four different encodings were tried, see encode_template.pl and
decode_template.pl in the subdirectories.  Some gory details of the
encodings are described below.

v1: simple run length encoding, encode long runs of whitespaces and
characters as a single character, with the ASCII value indicating the
length.  This works generally well when the template is mostly large
blobs of characters/space, but doesn't work with the line-based
templates used in this project.

v2: encode long runs of spaces as a single character, then encode 6
character+space mix as a 1 character bitmask.  Because we will only
use the bitmask approach when there are characters, so the first bit
will be implicitly set.  This doesn't work at the end of line because
we will either need to pad the end of lines with spaces to fill the 6
bits (which doesn't go well when we have to format quoted strings), or
we will need to define extra code for less than 6 character long
patterns, which increases both the code and data size.

v3: a mix of v1 and v2, encode long runs of characters and spaces as
single characters, and keep the 6 bit encodings for alternative
characters.  This works fairly well, except one small implementation
problem...

v4: an encoding that's identical to v3, except the character ranges
used to encode the long runs are different.  This is because the v3
encoding used the #..7 range for characters, and the many 3 character
sequences in the template resulted in lots of C++ digraphs like "%:".
By swapping the ranges for character and whitespace, the digraph
problem is solved, and this is the final encoding used in the program.

----------------------------------------------------------------------
5. Output

This part explains the non-data bits, from #include to end of file.

There are two streams of data, the code and the layout, and therefore
two ways to generate output:

 - Follow characters in the code stream, consume and decode characters
   from the layout stream, then print the code characters.
 - Follow and decode characters in the layout stream, consume and
   print characters from the code stream.

The latter is much simpler to implement than the former, because we
can always consume one character at the time.  Whereas if the output
were driven by the code stream, we would need extra state to know when
we need to advance the layout stream pointer.

The decoder algorithm is fairly trivial, and follows from the encoding
schemes described previously.  Easy to read Perl implementations can
be found in the subdirectories.  The C/C++ version is a bit more
clever in that it avoids using the "else" keyword (or really, all
keywords in general), and also had to handle line continuations inside
quotes, but otherwise it's not too complicated.

Unformatted source can be found in generate_source.pl.  A more
readable source and comment source may be found in v3/youmu0.c, but
things have changed quite a bit since then.

----------------------------------------------------------------------
6. Format source

Working out how to tweak the code to get a working prototype is one
thing, getting the code to look like the final formatted version is
still a few more days worth of work.  This is because I had to account
for syntactic correctness in two programs as I move bytes around.  Two
programs, simultaneously.

There is no easy way to do it.  At least, I haven't found one for
C/C++ yet.  Following the tried and true approach from previous years:

 - Generate templates that are large enough to hold all data (see
   template/*).  For Youmu and Yuyuko, I have reserved about 100 bytes
   worth of headroom.
 - Rearrange code and template iteratively until things worked out.
   This is partially automated by Perl scripts, which merges the data
   streams to ensure that the latest changes are linked in.

----------------------------------------------------------------------
7. Documentation

I don't usually write things like this for my homepage.  Not because I
wanted to keep the inner workings of my programs abstruse and
esoteric, it's more because I was lazy.  That, and I thought my code
is totally obvious, just like how everyone else thinks about their own
code ^_^;

These days I make an effort to reserve time to write documentation.  I
wonder if anyone actually read them?  But with this final step, I
declare youmu.c and yuyuko.c to be complete.

Next Night...

--
omoikane@uguu.org - http://uguu.org/