Digital audio is stored as a sequence of numbers, called samples. Example:
5, 18, 6, -4, -12, -3, 7, 14, 4
If you plot these numbers as points on a Cartesian graph: the sample value determines the position along the Y axis, and the sample's sequence number (0, 1, 2, 3, etc) determines the position along the X axis. The X axis is just a monotonically increasing number line.
Now trace a line through the points you've just plotted.
Congratulations, you have just rendered the waveform of your digital audio. :-)
The Y axis is amplitude and the X axis is time.
"Sample rate" determines how quickly the playback device (e.g. soundcard) advances through the samples. This is the "time value" of a sample. For example CD quality digital audio traverses 44,100 samples every second, reading the amplitude (Y axis value) at every sample point.
† The discussion above ignores compression. Compression changes little about the essential nature of digital audio. Much like zipping up a bitmap image doesn't change the core nature of a bitmap image. (The topic of audio compression is a rich one - I don't mean to oversimplify it, it's just that all compressed audio is eventually uncompressed before it is rendered -- that is, played as audible sound or drawn as a waveform -- at which point its compressed origins are of little consequence.)