I want to capture and transcribe to text the spoken word "internally" (device audio streaming while listening to a podcast or watching a YouTube video) and "e