hoschi@gfuzz:~$

RFC 4648: Base64 Encoding, Tutorial in C

Background

Base64 encoding can be found in many modern applications like browsers or e-mail Clients. Stating RFC 4648, base64 “is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII”. However, it is also very handy in case you want to edit binary data with your text editor, i.e. adding your Public SSH Key to the authorized_key file on a remote server or if you want to store a JPEG file within a text database.

About a year ago I thought this might be a good example to practice my C skills a bit more, now I had a bit of time to rework the code and write a little article about it.

It requires some basic knowledge about bit shifting and bit masks but there are good articles and a couple of reference implementations (GNU coreutils, mutt) out there, hence base64 has got everything for a little C exercise.

Encoding

We will use the following 64+1 character table, the RFC states some other variants, like URL encoding, but we won’t care about that for the moment.

"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/="

The first 64 charactes are used for encoding, the 65th “=” is used for padding the output whose input is not a multiple of 3.

In the example we use the string “gfuzz” to describe the encoding function, please note that you can use any bytes as input, not only US-ASCII but also non printable characters.

    1 g => 103 -> 0x67 -> 01100111
    2 f => 102 -> 0x66 -> 01100110
    3 u => 117 -> 0x75 -> 01110101
    4 z => 122 -> 0x7A -> 01111010
    5 z => 122 -> 0x7A -> 01111010

On a binary level we join the first 3 characters together to form:

        g        f        u
    01100111 01100110 01110101

These 24bits will be divided into 4*6 bits, forming:

    011001 110110 011001 110101
      25     54     25     53

Here is a C example, input is a fixed char array of length 3.

Code

    uint32_t element = 0;
    for(int i = 0; i < len; i++)
    {
        element <<= 8;
        element |= (input[i] & 0xFF);
    }

Explanation

  • Shift a 32bit integer by 8 bits to the left
  • XOR the current input char into the 32bit integer
  • Limit the bitmask to 0xFF (1111 1111), otherwise it breaks input of non-printable characters. These resulting numbers will be the character positions of our base64 input table from above:
      011001 110110 011001 110101
        25     54     25     53
        Z       2      Z      1
    

Code snippet

    for(int i = 24; i > (24-(8*len)); i-=6)
    {
        uint8_t bi = ((element >> (i-6)) & 0x3F);
        output[index++] = BASE64TABLE[bi];
    }

Explanation

  • Right shift the concatenated input ‘element’ by the corresponding sextet to extract the needed 6 bits (0011 1111 / 0x3F) for the base64 index
  • Apply the resulting index to the table, remember character in output and number of base64 characters (in variable ‘index’) We continue with the remaining 2 chars ‘zz’ from our input to finish the encoding while demonstrating padding:
          z        z      
      01111010 01111010 00000000
    
      011110 100111 101000 000000
         30    39     40     
          e     n      o     =
    

    Join the characters together and add a 0-byte at the end of the input to fill up to 24bits. Divided by 4 we retrieve 3 base64 characters from our table, the padded last sixted will be replaced by the 65th character from our base64 table which is ‘=’.

Example in C

    while(index < 4)
        output[index++] = BASE64TABLE[64];

Explanation

  • Since we remember the number of encoded characters we can easily add the needed padding characters from the base64 table. Note that having a 1 byte input would result in two padding characters, thus the single character G becomes Rw== in base64.

The final encoded output string is ‘Z2Z1eno=’ (8 characters).

Please also note that my final implementation does implicit line feeding after 76 characters, contrary to the RFC:

“Implementations MUST NOT add line feeds to base-encoded data unless the specification referring to this document explicitly directs base encoders to add line feeds after a specific number of characters.”

Decoding

Once we have done the encoding part, the decoding part is quite easy, just revert the functions above. Lets take a base64 input with a multiple of 4, i.e. the output from the previous example “Z2Z1eno=”.

The only difference is that this time we take each of the 4 character input block and do a numerical lookup in the base64 table, then rejoin the resulting numbers to a 3 byte block to retrieve the initial plain text:

       Z      2      Z      1
      25     54     25     53
    011001 110110 011001 110101
    
    01100111 01100110 01110101
      103      102      117
       g        f        u

The table lookup

    static const char *BASE64TABLE =
        "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";

    for(int i = 0; i < 66; i++)
        if(BASE64TABLE[i] == c)
            return i;
    return -1;

Explanation

  • Iterate over the base64 character table, in case of a match we return the postion i
  • In case of a non-base64 character (aka ‘invalid input’) we return -1 to indicate the error

The decoding part

    for(int i = 0; i < 4; i++)
    {
        if(input[i] != '=')
            len++;
        base64_table = getIndex(input[i]);
        if(base64_table == -1)
            return -1;    /* indicates invalid input */
        element <<= 6;
        element |= (base64_table  & 0x3F);
    }
    for(int i = 24; i > (24-(6*len)); i-=8)
        output[index++] = (element >> (i-8)) & 0xFF;

Explanation

  • For each 6bit input block we do a lookup, left-shift element by 6 bits and XOR the input sextet
  • We shift the input byte block by the appropriate multiple of 8 to take out the single output characters via bitmask 0xFF We need to take account of padding characters ‘=’ and skip them appropriately.

Who said mathematics?

In most cases the base64 produces an overhead of ~33%. Including padding, the total size of an input of n bytes can described by: \(4 \lceil \frac{1}{3}n \rceil\)

The brackets mean that we round up, so return the next upper integer of the resulting real number.

Code

A basic library with a reference program that encodes and decodes (-d) data from stdin can be found here https://gogs.gfuzz.de/oliver.peter/libbase64

Lessons learned

  • Bit shifting is pretty fast, computers are made for these kind of things :-)
  • An early version of my encoding function casted a char array of size 3 (24 bits) to a unsigned 32bit integer pointer. The missing 8bits get filled up randomly and lead to unexpected behaviour. Although, casting from larger to smaller memory space is OK.
  • Always expecting valid input is a bad idea and breaks things sooner or later. Handling invalid input properly makes about one third of the whole program.
  • “goto” instructions have a bad reputation, I use them very seldomly, but if the jump point is just a few lines away from the instruction I think it’s okay to use them