RFC 4648: Base64 Encoding, Tutorial in C
27 Sep 2018
↪ Background
Base64 encoding can be found in many modern applications like browsers or e-mail Clients. Stating RFC 4648, base64 “is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII”. However, it is also very handy in case you want to edit binary data with your text editor, i.e. adding your Public SSH Key to the authorized_key file on a remote server or if you want to store a JPEG file within a text database.
About a year ago I thought this might be a good example to practice my C skills a bit more, now I had a bit of time to rework the code and write a little article about it.
It requires some basic knowledge about bit shifting and bit masks but there are good articles and a couple of reference implementations (GNU coreutils, mutt) out there, hence base64 has got everything for a little C exercise.
↪ Encoding
We will use the following 64+1 character table, the RFC states some other variants, like URL encoding, but we won’t care about that for the moment.
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/="
The first 64 charactes are used for encoding, the 65th “=” is used for padding the output whose input is not a multiple of 3.
In the example we use the string “gfuzz” to describe the encoding function, please note that you can use any bytes as input, not only US-ASCII but also non printable characters.
1 g => 103 -> 0x67 -> 01100111
2 f => 102 -> 0x66 -> 01100110
3 u => 117 -> 0x75 -> 01110101
4 z => 122 -> 0x7A -> 01111010
5 z => 122 -> 0x7A -> 01111010
On a binary level we join the first 3 characters together to form:
g f u
01100111 01100110 01110101
These 24bits will be divided into 4*6 bits, forming:
011001 110110 011001 110101
25 54 25 53
Here is a C example, input is a fixed char array of length 3.
↪ Code
uint32_t element = 0;
for(int i = 0; i < len; i++)
{
element <<= 8;
element |= (input[i] & 0xFF);
}
↪ Explanation
- Shift a 32bit integer by 8 bits to the left
- XOR the current input char into the 32bit integer
- Limit the bitmask to 0xFF (1111 1111), otherwise it breaks input of non-printable characters.
These resulting numbers will be the character positions of our base64 input table from above:
011001 110110 011001 110101 25 54 25 53 Z 2 Z 1
↪ Code snippet
for(int i = 24; i > (24-(8*len)); i-=6)
{
uint8_t bi = ((element >> (i-6)) & 0x3F);
output[index++] = BASE64TABLE[bi];
}
↪ Explanation
- Right shift the concatenated input ‘element’ by the corresponding sextet to extract the needed 6 bits (0011 1111 / 0x3F) for the base64 index
- Apply the resulting index to the table, remember character in output and number of base64 characters (in variable ‘index’)
We continue with the remaining 2 chars ‘zz’ from our input to finish
the encoding while demonstrating padding:
z z 01111010 01111010 00000000 011110 100111 101000 000000 30 39 40 e n o =
Join the characters together and add a 0-byte at the end of the input to fill up to 24bits. Divided by 4 we retrieve 3 base64 characters from our table, the padded last sixted will be replaced by the 65th character from our base64 table which is ‘=’.
↪ Example in C
while(index < 4)
output[index++] = BASE64TABLE[64];
↪ Explanation
- Since we remember the number of encoded characters we can easily add the needed padding characters from the base64 table. Note that having a 1 byte input would result in two padding characters, thus the single character G becomes Rw== in base64.
The final encoded output string is ‘Z2Z1eno=’ (8 characters).
Please also note that my final implementation does implicit line feeding after 76 characters, contrary to the RFC:
“Implementations MUST NOT add line feeds to base-encoded data unless the specification referring to this document explicitly directs base encoders to add line feeds after a specific number of characters.”
↪ Decoding
Once we have done the encoding part, the decoding part is quite easy, just revert the functions above. Lets take a base64 input with a multiple of 4, i.e. the output from the previous example “Z2Z1eno=”.
The only difference is that this time we take each of the 4 character input block and do a numerical lookup in the base64 table, then rejoin the resulting numbers to a 3 byte block to retrieve the initial plain text:
Z 2 Z 1
25 54 25 53
011001 110110 011001 110101
01100111 01100110 01110101
103 102 117
g f u
The table lookup
static const char *BASE64TABLE =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";
for(int i = 0; i < 66; i++)
if(BASE64TABLE[i] == c)
return i;
return -1;
↪ Explanation
- Iterate over the base64 character table, in case of a match we return the postion i
- In case of a non-base64 character (aka ‘invalid input’) we return -1 to indicate the error
↪ The decoding part
for(int i = 0; i < 4; i++)
{
if(input[i] != '=')
len++;
base64_table = getIndex(input[i]);
if(base64_table == -1)
return -1; /* indicates invalid input */
element <<= 6;
element |= (base64_table & 0x3F);
}
for(int i = 24; i > (24-(6*len)); i-=8)
output[index++] = (element >> (i-8)) & 0xFF;
↪ Explanation
- For each 6bit input block we do a lookup, left-shift element by 6 bits and XOR the input sextet
- We shift the input byte block by the appropriate multiple of 8 to take out the single output characters via bitmask 0xFF We need to take account of padding characters ‘=’ and skip them appropriately.
↪ Who said mathematics?
In most cases the base64 produces an overhead of ~33%. Including padding, the total size of an input of n bytes can described by: \(4 \lceil \frac{1}{3}n \rceil\)
The brackets mean that we round up, so return the next upper integer of the resulting real number.
↪ Code
A basic library with a reference program that encodes and decodes (-d) data from stdin can be found here https://gogs.gfuzz.de/oliver.peter/libbase64
↪ Lessons learned
- Bit shifting is pretty fast, computers are made for these kind of things :-)
- An early version of my encoding function casted a char array of size 3 (24 bits) to a unsigned 32bit integer pointer. The missing 8bits get filled up randomly and lead to unexpected behaviour. Although, casting from larger to smaller memory space is OK.
- Always expecting valid input is a bad idea and breaks things sooner or later. Handling invalid input properly makes about one third of the whole program.
- “goto” instructions have a bad reputation, I use them very seldomly, but if the jump point is just a few lines away from the instruction I think it’s okay to use them