L (2005-04-17) Synopsis Encode: ./l <training-data.txt> <width> <message> Where <training-data.txt> is the sample data used to generate noise, <width> is the text width in characters, and the remaining arguments are the message to be encoded. Training data should contain enough words of varying length. Poor training data will produce more noisy output, or infinite loops in the worse case. Decode: ./l <encoded-data.txt> [width] Where <encoded-data.txt> is the text containing the hidden message, and [width] is the text width (80 if not specified). [width] should be same number used for encoding. To prevent line wrapping on some terminals, the actual width used is one character less. Known issues - Some input causes infinite loop. - Assume sizeof(char*) <= size(int). - ASCII dependent. Details L is a tool for message hiding, or steganography. For example, examine the text below: apples not portable and absolute is a eat other actions reading consists of only you show some areas of source thru gods enough time for questions only death with a optional we are included ... which doesn't seem to make much sense, but if the above text were formatted to wrap around in 40 columns: apples not portable and absolute is a eat other actions reading consists of only you show some areas of source thru gods enough time for questions only death with a optional we are included Here the hidden message is visible in the first column, from bottom to top: "death gods only eat apples" To encode the above, you would do something like: ./l data.txt 40 death gods only eat apples > enc.txt Then maybe set the line width to something else: ./l enc.txt 60 > enc2.txt To decode, run the output file through L again with the width used to encode the text: ./l enc2.txt 40 So the first word in each column is the signal, the rest is all noise. To make a better forest to hide the tree, the noises are generated such that they resemble natural language. How it works: 1. Start a line with a word from the message to be encoded. 2. Select the next word that is likely to follow the previous word (from the training data). 3. Repeat step 2 until right margin is reached, defined to be the column for which adding the next word in message would cause line wrapping to occur. 4. Repeat steps 1-3 until all words in message have been consumed. Selecting words in step 2 is a tricky business... one metric to consider is the "naturalness" of the sentences, because by making more sense, the sentences would appear to be less conspicuous, and the hidden message becomes less obvious. The other approach is to have them made no sense at all, and let L select words randomly. The final implementation is a decent balance between time/space and output quality. Also, depending on the training text, L could reach a state where there are no words of proper length to fill the line up to the margin and not go over the limit. If there is not enough variety in word lengths, L could either take a long time or loop forever. This shouldn't be an issue most of the time... For training data, guidelines.txt from this contest is an usable file. Packaged info file is in fact generated using "cat guidelines.txt rules.txt", since I am sure that those files can be legally distributed. Though L really would have preferred a larger data set. Text files from digital libraries like Project Gutenberg would be very good, especially since you can tune the noise to a particular writing style. Most files would do, really. Including the source file or the binary executable. In fact, there are probably less chances of infinite loop with these files, though your message usually becomes very obvious in those cases. You can also try mixing languages, for example: sum sous on ne veut n raison se sous le coeur ergo un peu dieu est une roseau ne se raisons cogito point en nous pensons veiller n mais se le ... latin hidden inside french looking noise, how about that? ^_^; Inspired by the scheme used in Page.8 of Death Note. The source is meant to resemble the character L from the series. Originally used with Japanese, but this program only supports English-like languages where words are separated by whitespaces, and composed of ASCII characters mostly in the A to Z range.