libcharsetProgrammer’s Manual

charset_encode(3)

Name

charset_encode, charset_decode — Character set encoding

Synopsis

#include <charset.h>

char *
charset_encode(const char *charset, size_t charset_size,
               const void *in, size_t in_size, size_t *out_size);

void *
charset_decode(const char *charset, size_t charset_size,
               const char *in, size_t in_size, size_t *out_size);

Description

This facility maps an input stream of arbitary bytes onto a limited character set. This is intended to be useful for encoding streams when tunneling over protocols with a restricted set of legal characters (such as tunneling arbitary data over syslog, for example).

The current algorithim assumes the range of the unencoded data is larger or equal to the range of the encoded data’s character set.

This is best suited to encoding arbitary data spread evenly across all bits in a byte (such as compressed streams). An escape-based system might be more suited to data which is distributed mostly around a few bytes (such as, say, text). No compression is performed.

In the current implementation, larger character sets only give a logarithmic decrease in encoding size for each power of two they pass.

The charset_encode function encodes the given arbitary input data in a form decodable by charset_decode, when given the same character set. The encoded form is restricted to a subset of the characters in the given character set.

The charset_decode function decodes a sequence of bytes as encoded by charset_encode.

For both encoding and decoding, the character set is expected to be lexographically sorted per charset_sort.

Return Value

For charset_encode, a pointer to the allocated memory is returned, the size of which is written to out_size. On error the value of *out_size indeterminate.

For charset_decode, a pointer is returned to a newly-allocated buffer containing the decoded data (with arbitary byte values). This buffer must be freed with free.

On failure these functions returns NULL and errno is set accordingly.

Caveats

The number of bits required to express charset_size characters is floored to the nearest power of two (e.g. two bits for a character set of “ABCDE”, because it floors to a count of four unique values).

Future directions

Currently this code limits itself to a power of two of the character set (so in practice this is base 128, 64, 32, 16, and so on). It is possible to make use of other character set sizes, however that is non-trivial, and is left as future work.

See Also

charset, charset_sort.