Hex dump

From Rosetta Code
Revision as of 15:05, 29 October 2023 by Jgrprior (talk | contribs) (New draft task with Python example.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Hex dump is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

A hex dump is a textual representation of bytes in a file.

hexdump is a command-line tool that can dump bytes from a file in a variety of formats, including hexadecimal, octal and ASCII.

hexdump's canonical format displays, on each line:

  • a byte offset in hexadecimal,
  • up to 16 bytes in hexadecimal separated by spaces, with an extra space between the 8th and 9th byte,
  • the same 16 bytes interpreted as ASCII characters, with non-printing and non-ascii characters replaced with a dot (.), surrounded by pipes (|).

The last line shows a final byte count.

For example, the string "Rosetta Code is a programming chrestomathy site 😀." encoded in UTF-16 (little-endian - the first two bytes are the byte order mark), displayed in the canonical format is:

00000000  ff fe 52 00 6f 00 73 00  65 00 74 00 74 00 61 00  |..R.o.s.e.t.t.a.|
00000010  20 00 43 00 6f 00 64 00  65 00 20 00 69 00 73 00  | .C.o.d.e. .i.s.|
00000020  20 00 61 00 20 00 70 00  72 00 6f 00 67 00 72 00  | .a. .p.r.o.g.r.|
00000030  61 00 6d 00 6d 00 69 00  6e 00 67 00 20 00 63 00  |a.m.m.i.n.g. .c.|
00000040  68 00 72 00 65 00 73 00  74 00 6f 00 6d 00 61 00  |h.r.e.s.t.o.m.a.|
00000050  74 00 68 00 79 00 20 00  73 00 69 00 74 00 65 00  |t.h.y. .s.i.t.e.|
00000060  20 00 3d d8 00 de 2e 00                           | .=.....|
00000068
Task

Implement a hexdump-like program that:

  • outputs in the canonical format,
  • takes an optional offset in bytes from which to start,
  • takes an optional length in bytes after which it will stop.

Demonstrate your implementation by showing the canonical hex dump of the example above, plus any other examples you find useful.

Stretch

xxd is another command-line tool similar to hexdump. It offers a binary mode where bytes are displayed in bits instead of hexadecimal.

Implement a binary mode. For this task, in binary mode, the example above should be displayed like this:

00000000  11111111 11111110 01010010 00000000 01101111 00000000  |..R.o.|
00000006  01110011 00000000 01100101 00000000 01110100 00000000  |s.e.t.|
0000000c  01110100 00000000 01100001 00000000 00100000 00000000  |t.a. .|
00000012  01000011 00000000 01101111 00000000 01100100 00000000  |C.o.d.|
00000018  01100101 00000000 00100000 00000000 01101001 00000000  |e. .i.|
0000001e  01110011 00000000 00100000 00000000 01100001 00000000  |s. .a.|
00000024  00100000 00000000 01110000 00000000 01110010 00000000  | .p.r.|
0000002a  01101111 00000000 01100111 00000000 01110010 00000000  |o.g.r.|
00000030  01100001 00000000 01101101 00000000 01101101 00000000  |a.m.m.|
00000036  01101001 00000000 01101110 00000000 01100111 00000000  |i.n.g.|
0000003c  00100000 00000000 01100011 00000000 01101000 00000000  | .c.h.|
00000042  01110010 00000000 01100101 00000000 01110011 00000000  |r.e.s.|
00000048  01110100 00000000 01101111 00000000 01101101 00000000  |t.o.m.|
0000004e  01100001 00000000 01110100 00000000 01101000 00000000  |a.t.h.|
00000054  01111001 00000000 00100000 00000000 01110011 00000000  |y. .s.|
0000005a  01101001 00000000 01110100 00000000 01100101 00000000  |i.t.e.|
00000060  00100000 00000000 00111101 11011000 00000000 11011110  | .=...|
00000066  00101110 00000000                                      |..|
00000068

Other hexdump/xxd features and a command line interface to your program are optional.


Python

"""Display bytes in a file like hexdump or xxd."""
import abc
import math
from io import BufferedIOBase
from itertools import islice
from typing import Iterable
from typing import Iterator
from typing import Sequence
from typing import Tuple
from typing import TypeVar


READ_SIZE = 2048


class Formatter(abc.ABC):
    """Base class for hex dump formatters."""

    @abc.abstractmethod
    def __call__(self, data: Sequence[int]) -> str:
        """"""

    @property
    @abc.abstractmethod
    def bytes_per_line(self) -> int:
        """"""


class CanonicalFormatter(Formatter):
    bytes_per_line = 16

    def __call__(self, data: Sequence[int]) -> str:
        assert len(data) <= 16
        hex = f"{bytes(data[:8]).hex(' ')}  {bytes(data[8:]).hex(' ')}".ljust(48)
        ascii_ = "".join(chr(b) if b > 31 and b < 127 else "." for b in data)
        return f"{hex}  |{ascii_}|"


class BinaryFormatter(Formatter):
    bytes_per_line = 6

    def __call__(self, data: Sequence[int]) -> str:
        assert len(data) <= 6
        bits = " ".join(bin(b)[2:].rjust(8, "0") for b in data).ljust(53)
        ascii_ = "".join(chr(b) if b > 31 and b < 127 else "." for b in data)
        return f"{bits}  |{ascii_}|"


canonicalFormatter = CanonicalFormatter()
binaryFormatter = BinaryFormatter()

T = TypeVar("T")


def group(it: Iterable[T], n: int) -> Iterator[Tuple[T, ...]]:
    """Split iterable _it_ in to groups of size _n_.

    The last group might contain less than _n_ items.
    """
    _it = iter(it)
    while True:
        g = tuple(islice(_it, n))
        if not g:
            break
        yield g


def hex_dump(
    f: BufferedIOBase,
    *,
    skip: int = 0,
    length: int = math.inf,  # type: ignore
    format: Formatter = canonicalFormatter,
) -> Iterator[str]:
    """Generate a textual representation of bytes in _f_, one line at a time."""
    f.seek(skip)
    offset = 0
    byte_count = 0
    previous_line = ""
    identical_chunk = False

    while byte_count < length:
        # Read at most READ_SIZE bytes at a time.
        data = f.read(READ_SIZE)

        # Stop if we've run out of data.
        if not data:
            break

        # Discard excess bytes if we've overshot length.
        if byte_count + len(data) > length:
            data = data[: length - byte_count]

        # One line per chunk
        for chunk in group(data, format.bytes_per_line):
            line = format(chunk)
            if previous_line == line:
                if identical_chunk is False:
                    identical_chunk = True
                    yield "*"
            else:
                previous_line = line
                identical_chunk = False
                yield f"{offset:0>8x}  {line}"

            offset += format.bytes_per_line
            byte_count += len(chunk)

    # Final byte count
    yield f"{byte_count:0>8x}"


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(
        prog="hex_dump.py",
        description="Display bytes in a file.",
    )

    parser.add_argument(
        "file",
        type=argparse.FileType(mode="rb"),
        metavar="FILE",
        help="target file to dump",
    )

    parser.add_argument(
        "-b",
        "--binary",
        action="store_true",
        help="display bytes in binary instead of hex",
    )

    parser.add_argument(
        "-s",
        "--skip",
        type=int,
        default=0,
        help="skip SKIP bytes from the beginning",
    )

    parser.add_argument(
        "-n",
        "--length",
        type=int,
        default=math.inf,
        help="read up to LENGTH bytes",
    )

    args = parser.parse_args()
    formatter = binaryFormatter if args.binary else canonicalFormatter

    for line in hex_dump(
        args.file,
        format=formatter,
        skip=args.skip,
        length=args.length,
    ):
        print(line)
Output:
$ python hex_dump.py example_utf16.txt
00000000  ff fe 52 00 6f 00 73 00  65 00 74 00 74 00 61 00  |..R.o.s.e.t.t.a.|
00000010  20 00 43 00 6f 00 64 00  65 00 20 00 69 00 73 00  | .C.o.d.e. .i.s.|
00000020  20 00 61 00 20 00 70 00  72 00 6f 00 67 00 72 00  | .a. .p.r.o.g.r.|
00000030  61 00 6d 00 6d 00 69 00  6e 00 67 00 20 00 63 00  |a.m.m.i.n.g. .c.|
00000040  68 00 72 00 65 00 73 00  74 00 6f 00 6d 00 61 00  |h.r.e.s.t.o.m.a.|
00000050  74 00 68 00 79 00 20 00  73 00 69 00 74 00 65 00  |t.h.y. .s.i.t.e.|
00000060  20 00 3d d8 00 de 2e 00                           | .=.....|
00000068
$ python hex_dump.py example_utf16.txt -b
00000000  11111111 11111110 01010010 00000000 01101111 00000000  |..R.o.|
00000006  01110011 00000000 01100101 00000000 01110100 00000000  |s.e.t.|
0000000c  01110100 00000000 01100001 00000000 00100000 00000000  |t.a. .|
00000012  01000011 00000000 01101111 00000000 01100100 00000000  |C.o.d.|
00000018  01100101 00000000 00100000 00000000 01101001 00000000  |e. .i.|
0000001e  01110011 00000000 00100000 00000000 01100001 00000000  |s. .a.|
00000024  00100000 00000000 01110000 00000000 01110010 00000000  | .p.r.|
0000002a  01101111 00000000 01100111 00000000 01110010 00000000  |o.g.r.|
00000030  01100001 00000000 01101101 00000000 01101101 00000000  |a.m.m.|
00000036  01101001 00000000 01101110 00000000 01100111 00000000  |i.n.g.|
0000003c  00100000 00000000 01100011 00000000 01101000 00000000  | .c.h.|
00000042  01110010 00000000 01100101 00000000 01110011 00000000  |r.e.s.|
00000048  01110100 00000000 01101111 00000000 01101101 00000000  |t.o.m.|
0000004e  01100001 00000000 01110100 00000000 01101000 00000000  |a.t.h.|
00000054  01111001 00000000 00100000 00000000 01110011 00000000  |y. .s.|
0000005a  01101001 00000000 01110100 00000000 01100101 00000000  |i.t.e.|
00000060  00100000 00000000 00111101 11011000 00000000 11011110  | .=...|
00000066  00101110 00000000                                      |..|
00000068