Yume (2004-03-25)
uniq utility.
Synopsis
yume [OPTION] [INPUT [OUTPUT]]
Description
Discard all duplicate lines from INPUT (or stdin), writing the
remaining lines to OUTPUT (or stdout). Ordering of the original
input is preserved. If "-" is used for INPUT, stdin is used. "-"
can not be used for OUTPUT.
-i Ignore case (default: case sensitive).
-c Include number of occurrences with output (default: no).
-h N Set history size to N lines (default: infinite).
To emulate original uniq behavior, use "-h 1".
If N is negative, only the first -N lines are kept in
history. For example, to filter out contents of
ignore.txt from input.txt:
cat ignore.txt input.txt | \
./yume -h -`wc -l < ignore.txt` | \
tail -`wc -l < ignore.txt`
-l expr Set expression for left marker (default: start of line).
-r expr Set expression for right marker (default: end of line).
Expressions are character based strings, executed after
the initial (default) expressions. See marker
expressions section below.
-f N Skip first N fields.
Equivalent to "-l ^S(sS)M -r $", where M = N - 1.
"-f 0" is the same as "-f 1".
-s N Skip first N characters.
Equivalent to "-l ^M -r $", where M = N - 1.
"-s 0" is the same as "-s 1".
-w N Compare only the first N characters.
Equivalent to "-l ^ -r ^M", where M = N - 1.
-k Use the entire line for comparison when marker
expression failed ("-l", "-r", "-f", "-s", or "-w").
This is more consistent with uniq's behavior.
Default is to drop the line completely.
-u Print only unique lines.
-d Print only duplicate lines.
Default behavior is to print all unique lines, along
with the first line of each duplicate line:
Input lines = foo bar bar baz
Output lines = foo bar baz
When "-u" is specified, none of the duplicate lines
will be printed:
Input lines = foo bar bar baz
Output lines = foo baz
When "-d" is specified, only the first line of each
duplicate line will be printed:
Input lines = foo bar bar baz
Output lines = bar
When both "-u" and "-d" are used, "-u" is in effect.
For options which take a parameter, the space between the flag and
the argument is optional. e.g. "-h1" and "-h 1" are the same.
Marker expressions
+ Set current direction to left to right.
- Set current direction to right to left.
^ Set current position to start of line, and set direction
to left to right.
$ Set current position to end of line, and set direction to
right to left.
@ Use current position for the other marker.
This allows selecting a region more easily, when the
markers are nearby. For example: "-r ^,@," will select
the region between the first and second commas.
Note that left marker is always set before the right one,
so setting something like '-l $-2@-2' probably won't do
what you expect. Always use '@' in the right marker
expression for more predictable behavior.
( ) N Repeat enclosed expression N times.
")" not followed by a positive integer results in a silent
error.
Note that "(expr)0" has the same effect as "(expr)1".
Repeat count is limited by size of signed integer (2^31).
Nesting is limited to 8 levels.
(digit) Skip forward (digit) number of characters,
s Search forward for next whitespace (isspace).
d Search forward for next decimal digit (isdigit).
a Search forward for next alpha character (isalpha).
i Search forward for next alphanumeric character (isalnum).
S Search forward for next non-whitespace.
D Search forward for next non-digit.
A Search forward for next non-alpha character.
I Search forward for next non-alphanumeric character.
/(char) Search forward for next (char).
(char) Search forward for next (char), unless it matches any
other opcodes.
If the same character search is executed
twice, the cursor is moved one character
forward first. Thus -1-1- and --- have the
same effect.
It might not always be what you expect. For example,
"aiaiai" might not move the cursor at all, because Yume
sees each opcode as different character.
Initial expression for left marker: ^
Initial expression for right marker: $
Marker expressions are used to generalize functions that were in
the original uniq. There are no plans to expand this expression
set, users should preprocess text files in a higher level language
for that (such as perl/sed/awk).
Errors in executing the expression (e.g. failed character searches,
unmatched parentheses) results in the line being completely ignored,
unless "-k" is specified. Dropped lines still counts as a line, so
history parameters are still in effect.
Examples
Make Yume run like uniq:
yume -h1 -k
Count number of hits for unique visitors in Apache log:
yume -c -r '^ ' /var/log/httpd/access_log
Ignore local visitors in Apache log:
echo 127.0.0. > ignore.txt
echo 192.168.0. >> ignore.txt
cat ignore.txt /var/log/httpd/access_log | \
yume -h -2 -r '^...'
Filter duplicate messages in kernel log, ignore timestamps:
yume -l ':: ' /var/log/messages
See the different instances of SSHD that started:
yume -r '^/s/s/h/d[@]' /var/log/secure
Find all unique words 12 characters or longer:
yume -l '(i)12^' /usr/share/dict/words
Error messages
read(%s) Can not open file for reading.
write(%s) Can not open file for writing.
%s? Unrecognized option.
%s __? Not enough arguments for option.
out of memory Out of memory.
I/O errors are silently ignored.
Incompatibilities
1. Feature Extension: By default, history is of infinite length
instead of just one line. This causes Yume to eliminate
duplicate lines globally instead of just nearby duplicate lines.
Specify "-h" to override this behavior.
Input: uniq: yume: yume -h1:
dup dup dup dup
keep keep keep keep
dup dup dup
Comparing every line against every other line is not a major
speed penalty -- all existing lines are indexed by their CRCs,
and the CRCs are used as a quick way to reject different lines.
The main reason to use Yume over uniq is this infinite history
feature, so that you don't have to run 'sort|uniq' and losing
the ordering of lines.
2. Feature Extension: Yume adds generalized marker expressions
("-l" and "-r") that is not in uniq, and use that to generalize
the existing options ("-s", "-w" and "-f"). This allows for
more complex filtering without invoking an external program.
3. uniq allows "-s", "-w" and "-f" options to be used together,
Yume treats them as mutually exclusive (and thus the option
specified later overrides the earlier ones). To achieve the
same effect, use marker expressions instead.
Input: uniq -f1 -s2: yume -f1 -s2: yume -lSs2:
1 x dup 1 x dup 1 x dup 1 x dup
23 y dup 23 y dup
4. For field ("-f") comparisons, Yume starts on the first
non-whitespace in the proper field, while uniq starts on the
first whitespace. I am keeping Yume's incompatible behavior
because I think that's usually more useful. The original uniq
behavior can be emulated with marker expressions.
Input: uniq -f1: yume -f1: yume -lSs:
123 dup 123 dup 123 dup 123 dup
456 dup 456 dup 456 dup
789 dup
Also, if there aren't enough fields, uniq treats the line like
an empty line (the first is printed, the rest are marked as
duplicates). Yume drops the line completely (not even the first
line is printed).
5. Yume ignores end of line sequences before filtering, uniq treats
them as significant. Thus two lines that differs by end of line
only will be treated by Yume as duplicates, but not with uniq.
There is no way to override this behavior.
Input: uniq: yume:
dup\r dup\r dup\r
dup\n dup\n
dup\r\n dup\r\n
This might not be obvious:
Input: uniq: yume:
\n\r\r \n\r\r \n
uniq seems to treat the file as one line, but Yume sees three
lines, and removes the two duplicate lines.
6. If the last line in the original file does not end in newline,
uniq outputs an extra newline, Yume doesn't. There is no way to
override this behavior.
Input: uniq: yume:
line<EOF> line\n<EOF> line<EOF>
7. Long option style (e.g. --unique instead of -u) are not
supported.
8. GNU's extension ("-D") is not supported.
9. There is no command line help or version display, but options
are mostly compatible with the uniq ones (except for
incompatibilities mentioned above). Keep this manual around or
use "man uniq" instead.
Miscellaneous
Yume uniq Yume uniq Yume uniq... try say that 3 times fast ^_^;
More features -> more pixels -> better looking ASCII art.
Template is based on Kikuchi Yume from "Mahoutsukai ni taisetsu na
koto", but actually all of the code was written while listening to
Kokoro Toshokan OST...
--
omoikane@uguu.org - http://uguu.org/