Maybe a longshot this...
I have a draw.io network diagram where each node refers to an image and accompanying audio track (here). Currently I simply have the audio em