There is absolutely no correlation between the binary values that make up the encoding for a some letters of English, and the PCM values that would make up a sampled version of the sound of someone saying the encoded word.
If you want to play back the sound of someone saying "red", you will first have to sample it and store the resulting bits somewhere, then feed them to your output at an appropriate bitrate. The sampled sound is likely to be a lot larger than just the ASCII representation of "red" (which is 24 bits).
There are integrated chips that contain such samples and that actually can generate sound given an ASCII-encoded word, one example is this one. Unless you have such a chip connected to your MCU, your question is not making a lot of sense.