L (2005-04-17)



      ./l <training-data.txt> <width> <message>

      Where <training-data.txt> is the sample data used to generate
      noise, <width> is the text width in characters, and the
      remaining arguments are the message to be encoded.

      Training data should contain enough words of varying length.
      Poor training data will produce more noisy output, or infinite
      loops in the worse case.


      ./l <encoded-data.txt> [width]

      Where <encoded-data.txt> is the text containing the hidden
      message, and [width] is the text width (80 if not specified).

      [width] should be same number used for encoding.  To prevent
      line wrapping on some terminals, the actual width used is one
      character less.

Known issues

   - Some input causes infinite loop.
   - Assume sizeof(char*) <= size(int).
   - ASCII dependent.


   L is a tool for message hiding, or steganography.  For example,
   examine the text below:

      apples not portable and absolute is a eat other actions
      reading consists of only you show some areas of source thru
      gods enough time for questions only death with a optional
      we are included

   ... which doesn't seem to make much sense, but if the above text
   were formatted to wrap around in 40 columns:

      apples not portable and absolute is a
      eat other actions reading consists of
      only you show some areas of source thru
      gods enough time for questions only
      death with a optional we are included

   Here the hidden message is visible in the first column, from bottom
   to top:

      "death gods only eat apples"

   To encode the above, you would do something like:

      ./l data.txt 40 death gods only eat apples > enc.txt

   Then maybe set the line width to something else:

      ./l enc.txt 60 > enc2.txt

   To decode, run the output file through L again with the width used
   to encode the text:

      ./l enc2.txt 40

   So the first word in each column is the signal, the rest is all
   noise.  To make a better forest to hide the tree, the noises are
   generated such that they resemble natural language.  How it works:

      1. Start a line with a word from the message to be encoded.
      2. Select the next word that is likely to follow the previous
         word (from the training data).
      3. Repeat step 2 until right margin is reached, defined to be
         the column for which adding the next word in message would
         cause line wrapping to occur.
      4. Repeat steps 1-3 until all words in message have been

   Selecting words in step 2 is a tricky business... one metric to
   consider is the "naturalness" of the sentences, because by making
   more sense, the sentences would appear to be less conspicuous, and
   the hidden message becomes less obvious.  The other approach is to
   have them made no sense at all, and let L select words randomly.
   The final implementation is a decent balance between time/space and
   output quality.

   Also, depending on the training text, L could reach a state where
   there are no words of proper length to fill the line up to the
   margin and not go over the limit.  If there is not enough variety
   in word lengths, L could either take a long time or loop forever.
   This shouldn't be an issue most of the time...

   For training data, guidelines.txt from this contest is an usable
   file.  Packaged info file is in fact generated using "cat
   guidelines.txt rules.txt", since I am sure that those files can be
   legally distributed.  Though L really would have preferred a larger
   data set.  Text files from digital libraries like Project Gutenberg
   would be very good, especially since you can tune the noise to a
   particular writing style.

   Most files would do, really.  Including the source file or the
   binary executable.  In fact, there are probably less chances of
   infinite loop with these files, though your message usually becomes
   very obvious in those cases.  You can also try mixing languages,
   for example:

      sum sous on ne veut n raison se sous le coeur
      ergo un peu dieu est une roseau ne se raisons
      cogito point en nous pensons veiller n mais se le

   ... latin hidden inside french looking noise, how about that?  ^_^;

   Inspired by the scheme used in Page.8 of Death Note.  The source is
   meant to resemble the character L from the series.  Originally used
   with Japanese, but this program only supports English-like
   languages where words are separated by whitespaces, and composed of
   ASCII characters mostly in the A to Z range.