I am interested in the following:
Is there a list of characters that would never occur as part of a base 64 encoded string?
For example *
Here is what I could turn up: RFC 4648
It includes this convenient table:
Table 1: The Base 64 Alphabet
Value Encoding Value Encoding Value Encoding Value Encoding
0 A 17 R 34 i 51 z
1 B 18 S 35 j 52 0
2 C 19 T 36 k 53 1
3 D 20 U 37 l 54 2
4 E 21 V 38 m 55 3
5 F 22 W 39 n 56 4
6 G 23 X 40 o 57 5
7 H 24 Y 41 p 58 6
8 I 25 Z 42 q 59 7
9 J 26 a 43 r 60 8
10 K 27 b 44 s 61 9
11 L 28 c 45 t 62 +
12 M 29 d 46 u 63 /
13 N 30 e 47 v
14 O 31 f 48 w (pad) =
15 P 32 g 49 x
16 Q 33 h 50 y
So a regular expression that matches any character that should never appear in Base 64 encodings would be:
[^A-Za-z0-9+/=]
However, as kapeps answer points out, this is only the recommendation. Specific implementations might choose a different set of 64 characters. (In fact, even the linked RFC contains an alternative table for URL and filename safe encoding, which replaces character 62 and 63 with -
and _
respectively). So I guess it really depends on the implementation that created the encoding.
You are probably safe with the other answers in most situations, but according to the Wikipedia article on Base64 there shouldn't be a definite list you can rely on:
The particular choice of character set selected for the 64 characters required for the base varies between implementations.
RFC 4648 mentions other alphabets, such as the "URL and Filename safe" Base 64 Alphabet, where +
and /
are replaced with -
and _
.
There's a table of Base64 variants which use different characters. Keep in mind that there are implementation specific rules about line separators, which you can find in the same table. Some implementations like Mime even allow (and ignore) characters that are not in the alphabet.
Base64 only contains A–Z
, a–z
, 0–9
, +
, /
and =
.
So the list of characters not to be used is: all possible characters minus the ones mentioned above.
For special purposes .
and _
are possible, too.
https://en.wikipedia.org/wiki/Base64#Design
MIME's Base64 implementation uses A–Z, a–z, and 0–9 for the first 62 values
So for the most part you should expect only alphanumeric characters. The example table in this article shows '+' and '-' also; it's unlikely you would see '*'.
You can use http://www.motobit.com/util/base64-decoder-encoder.asp to convert to Base64 for example, and for '*' this returns "Kg=="