L (2005-04-17)
Synopsis
Encode:
./l <training-data.txt> <width> <message>
Where <training-data.txt> is the sample data used to generate
noise, <width> is the text width in characters, and the
remaining arguments are the message to be encoded.
Training data should contain enough words of varying length.
Poor training data will produce more noisy output, or infinite
loops in the worse case.
Decode:
./l <encoded-data.txt> [width]
Where <encoded-data.txt> is the text containing the hidden
message, and [width] is the text width (80 if not specified).
[width] should be same number used for encoding. To prevent
line wrapping on some terminals, the actual width used is one
character less.
Known issues
- Some input causes infinite loop.
- Assume sizeof(char*) <= size(int).
- ASCII dependent.
Details
L is a tool for message hiding, or steganography. For example,
examine the text below:
apples not portable and absolute is a eat other actions
reading consists of only you show some areas of source thru
gods enough time for questions only death with a optional
we are included
... which doesn't seem to make much sense, but if the above text
were formatted to wrap around in 40 columns:
apples not portable and absolute is a
eat other actions reading consists of
only you show some areas of source thru
gods enough time for questions only
death with a optional we are included
Here the hidden message is visible in the first column, from bottom
to top:
"death gods only eat apples"
To encode the above, you would do something like:
./l data.txt 40 death gods only eat apples > enc.txt
Then maybe set the line width to something else:
./l enc.txt 60 > enc2.txt
To decode, run the output file through L again with the width used
to encode the text:
./l enc2.txt 40
So the first word in each column is the signal, the rest is all
noise. To make a better forest to hide the tree, the noises are
generated such that they resemble natural language. How it works:
1. Start a line with a word from the message to be encoded.
2. Select the next word that is likely to follow the previous
word (from the training data).
3. Repeat step 2 until right margin is reached, defined to be
the column for which adding the next word in message would
cause line wrapping to occur.
4. Repeat steps 1-3 until all words in message have been
consumed.
Selecting words in step 2 is a tricky business... one metric to
consider is the "naturalness" of the sentences, because by making
more sense, the sentences would appear to be less conspicuous, and
the hidden message becomes less obvious. The other approach is to
have them made no sense at all, and let L select words randomly.
The final implementation is a decent balance between time/space and
output quality.
Also, depending on the training text, L could reach a state where
there are no words of proper length to fill the line up to the
margin and not go over the limit. If there is not enough variety
in word lengths, L could either take a long time or loop forever.
This shouldn't be an issue most of the time...
For training data, guidelines.txt from this contest is an usable
file. Packaged info file is in fact generated using "cat
guidelines.txt rules.txt", since I am sure that those files can be
legally distributed. Though L really would have preferred a larger
data set. Text files from digital libraries like Project Gutenberg
would be very good, especially since you can tune the noise to a
particular writing style.
Most files would do, really. Including the source file or the
binary executable. In fact, there are probably less chances of
infinite loop with these files, though your message usually becomes
very obvious in those cases. You can also try mixing languages,
for example:
sum sous on ne veut n raison se sous le coeur
ergo un peu dieu est une roseau ne se raisons
cogito point en nous pensons veiller n mais se le
... latin hidden inside french looking noise, how about that? ^_^;
Inspired by the scheme used in Page.8 of Death Note. The source is
meant to resemble the character L from the series. Originally used
with Japanese, but this program only supports English-like
languages where words are separated by whitespaces, and composed of
ASCII characters mostly in the A to Z range.