RFC 4648: Base64 Encoding, Tutorial in C

27 Sep 2018

↪ Background

Base64 encoding can be found in many modern applications like browsers or e-mail Clients. Stating RFC 4648, base64 “is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII”. However, it is also very handy in case you want to edit binary data with your text editor, i.e. adding your Public SSH Key to the authorized_key file on a remote server or if you want to store a JPEG file within a text database.

About a year ago I thought this might be a good example to practice my C skills a bit more, now I had a bit of time to rework the code and write a little article about it.

It requires some basic knowledge about bit shifting and bit masks but there are good articles and a couple of reference implementations (GNU coreutils, mutt) out there, hence base64 has got everything for a little C exercise.

↪ Encoding

We will use the following 64+1 character table, the RFC states some other variants, like URL encoding, but we won’t care about that for the moment.

"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/="

The first 64 charactes are used for encoding, the 65th “=” is used for padding the output whose input is not a multiple of 3.

In the example we use the string “gfuzz” to describe the encoding function, please note that you can use any bytes as input, not only US-ASCII but also non printable characters.

g => 103 -> 0x67 -> 01100111
f => 102 -> 0x66 -> 01100110
u => 117 -> 0x75 -> 01110101
z => 122 -> 0x7A -> 01111010
z => 122 -> 0x7A -> 01111010

On a binary level we join the first 3 characters together to form:

        g        f        u
    01100111 01100110 01110101

These 24bits will be divided into 4*6 bits, forming:

    011001 110110 011001 110101
      25     54     25     53

Here is a C example, input is a fixed char array of length 3.

↪ Code

    uint32_t element = 0;
    for(int i = 0; i < len; i++)
    {
        element <<= 8;
        element |= (input[i] & 0xFF);
    }

↪ Explanation

Shift a 32bit integer by 8 bits to the left
XOR the current input char into the 32bit integer
Limit the bitmask to 0xFF (1111 1111), otherwise it breaks input of non-printable characters. These resulting numbers will be the character positions of our base64 input table from above:
```
  011001 110110 011001 110101
    25     54     25     53
    Z       2      Z      1
```

↪ Code snippet

    for(int i = 24; i > (24-(8*len)); i-=6)
    {
        uint8_t bi = ((element >> (i-6)) & 0x3F);
        output[index++] = BASE64TABLE[bi];
    }

↪ Explanation

Right shift the concatenated input ‘element’ by the corresponding sextet to extract the needed 6 bits (0011 1111 / 0x3F) for the base64 index
Apply the resulting index to the table, remember character in output and number of base64 characters (in variable ‘index’) We continue with the remaining 2 chars ‘zz’ from our input to finish the encoding while demonstrating padding:
```
      z        z      
  01111010 01111010 00000000

  011110 100111 101000 000000
     30    39     40     
      e     n      o     =
```
Join the characters together and add a 0-byte at the end of the input to fill up to 24bits. Divided by 4 we retrieve 3 base64 characters from our table, the padded last sixted will be replaced by the 65th character from our base64 table which is ‘=’.

↪ Example in C

    while(index < 4)
        output[index++] = BASE64TABLE[64];

↪ Explanation

Since we remember the number of encoded characters we can easily add the needed padding characters from the base64 table. Note that having a 1 byte input would result in two padding characters, thus the single character G becomes Rw== in base64.

The final encoded output string is ‘Z2Z1eno=’ (8 characters).

Please also note that my final implementation does implicit line feeding after 76 characters, contrary to the RFC:

“Implementations MUST NOT add line feeds to base-encoded data unless the specification referring to this document explicitly directs base encoders to add line feeds after a specific number of characters.”

↪ Decoding

Once we have done the encoding part, the decoding part is quite easy, just revert the functions above. Lets take a base64 input with a multiple of 4, i.e. the output from the previous example “Z2Z1eno=”.

The only difference is that this time we take each of the 4 character input block and do a numerical lookup in the base64 table, then rejoin the resulting numbers to a 3 byte block to retrieve the initial plain text:

       Z      2      Z      1
      25     54     25     53
    011001 110110 011001 110101
    
    01100111 01100110 01110101
      103      102      117
       g        f        u

The table lookup

    static const char *BASE64TABLE =
        "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";

    for(int i = 0; i < 66; i++)
        if(BASE64TABLE[i] == c)
            return i;
    return -1;

↪ Explanation

Iterate over the base64 character table, in case of a match we return the postion i
In case of a non-base64 character (aka ‘invalid input’) we return -1 to indicate the error

↪ The decoding part

    for(int i = 0; i < 4; i++)
    {
        if(input[i] != '=')
            len++;
        base64_table = getIndex(input[i]);
        if(base64_table == -1)
            return -1;    /* indicates invalid input */
        element <<= 6;
        element |= (base64_table  & 0x3F);
    }
    for(int i = 24; i > (24-(6*len)); i-=8)
        output[index++] = (element >> (i-8)) & 0xFF;

↪ Explanation

For each 6bit input block we do a lookup, left-shift element by 6 bits and XOR the input sextet
We shift the input byte block by the appropriate multiple of 8 to take out the single output characters via bitmask 0xFF We need to take account of padding characters ‘=’ and skip them appropriately.

↪ Who said mathematics?

In most cases the base64 produces an overhead of ~33%. Including padding, the total size of an input of n bytes can described by: \(4 \lceil \frac{1}{3}n \rceil\)

The brackets mean that we round up, so return the next upper integer of the resulting real number.

↪ Code

A basic library with a reference program that encodes and decodes (-d) data from stdin can be found here https://gogs.gfuzz.de/oliver.peter/libbase64

↪ Lessons learned

Bit shifting is pretty fast, computers are made for these kind of things :-)
An early version of my encoding function casted a char array of size 3 (24 bits) to a unsigned 32bit integer pointer. The missing 8bits get filled up randomly and lead to unexpected behaviour. Although, casting from larger to smaller memory space is OK.
Always expecting valid input is a bad idea and breaks things sooner or later. Handling invalid input properly makes about one third of the whole program.
“goto” instructions have a bad reputation, I use them very seldomly, but if the jump point is just a few lines away from the instruction I think it’s okay to use them

hoschi@gfuzz:~$

Archive

About

RSS

RFC 4648: Base64 Encoding, Tutorial in C

27 Sep 2018

↪ Background

↪ Encoding

↪ Code

↪ Explanation

↪ Code snippet

↪ Explanation

↪ Example in C

↪ Explanation

↪ Decoding

↪ Explanation

↪ The decoding part

↪ Explanation

↪ Who said mathematics?

↪ Code

↪ Lessons learned