Increase/decrease audio play speed of AudioInputStream with Java

核能气质少年 提交于 2019-12-01 02:20:31

Short answer:

For speeding up a single person speaking, use my Sonic.java native Java implementation of my Sonic algorithm. An example of how to use it is in Main.Java. A C-language version of the same algorithm is used by Android's AudioTrack. For speeding up music or movies, find a WSOLA based library.

Bloated answer:

Speeding up speech is more complex than it sounds. Simply increasing the sample rate without adjusting the samples will cause speakers to sound like chipmunks. There are basically two good schemes for linearly speeding up speech that I have listened to: fixed-frame based schemes like WSOLA, and pitch-synchronous schemes like PICOLA, which is used by Sonic for speeds up to 2X. One other scheme I've listened to is FFT-based, and IMO those implementations should be avoided. I hear rumor that FFT-based can be done well, but no open-source version I am aware of was usable the last time I checked, probably in 2014.

I had to invent a new algorithm for speeds greater than 2X, since PICOLA simply drops entire pitch periods, which works well so long as you don't drop two pitch periods in a row. For faster than 2X, Sonic mixes in a portion of samples from each input pitch period, retaining some frequency information from each. This works well for most speech, though some languages such as Hungarian appear to have parts of speech so short that even PICOLA mangles some phonemes. However, the general rule that you can drop one pitch period without mangling phonemes seems to work well most of the time.

Pitch-synchronous schemes focus on one speaker, and will generally make that speaker clearer than fixed-frame schemes, at the expense of butchering non-speech sounds. However, the improvement of pitch synchronous schemes over fixed-frame schemes is hard to hear at speeds less than about 1.5X for most speakers. This is because fixed-frame algorithms like WSOLA basically emulate pitch synchronous schems like PICOLA when there is only one speaker and no more than one pitch period needs to be dropped per frame. The math works out basically the same in this case if WSOLA is tuned well to the speaker. For example, if it is able to select a sound segment of +/- one frame in time, then a 50ms fixed frame will allow WSOLA to emulate PICOLA for most speakers who have a fundamental pitch > 100 Hz. However, a male with a deep voice of say 95 Hz would be butchered with WSOLA using those settings. Also, parts of speech, such as at the end of a sentence, where our fundamental pitch drops significantly can also be butchered by WSOLA when parameters are not optimally tuned. Also, WSOLA generally falls apart for speeds greater than 2X, where like PICOLA, it starts dropping multiple pitch periods in a row.

On the positive side, WSOLA will make most sounds including music understandable, if not high fidelity. Taking non-harmonic multi-voice sounds and changing the speed without introducing substantial distortion is impossible with overlap-and-add (OLA) schemes like WSOLA and PICOLA. Doing this well would require separating the different voices, changing their speeds independently, and mixing the results together. However, most music is harmonic enough to sound OK with WSOLA.

It turns out that the poor quality of WSOLA at > 2X is one reason folks rarely listen at higher speeds than 2X. Folks simply don't like it. Once Audible.com switched from WSOLA to a Sonic-like algorithm on Android, they were able to increase the supported speed range from 2X to 3X. I haven't listened on iOS in the last few years, but as of 2014, Audible.com on iOS was miserable to listen to at 3X speed, since they used the built-in iOS WSOLA library. They've likely fixed it since then.

Looking at the library you linked, it doesn't seem like a good place to start specifically for this playback speed issue; is there any reason you aren't using AudioTrack? It seems to support everything you need.

EDIT 1: AudioTrack is Android specific, but the OP's question is Desktop javaSE based; I will only leave it here for future reference.

1. Using AudioTrack and adjusting playback speed (Android)

Thanks to an answer on another SO post (here), there is a class posted which uses the built in AudioTrack to handle speed adjustment during playback.

public class AudioActivity extends Activity {
AudioTrack audio = new AudioTrack(AudioManager.STREAM_MUSIC,
        44100,
        AudioFormat.CHANNEL_OUT_STEREO,
        AudioFormat.ENCODING_PCM_16BIT,
        intSize, //size of pcm file to read in bytes
        AudioTrack.MODE_STATIC);

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);

        //read track from file
        File file = new File(getFilesDir(), fileName);

        int size = (int) file.length();
        byte[] data = new byte[size];

        try {
            FileInputStream fileInputStream = new FileInputStream(file);
            fileInputStream.read(data, 0, size);
            fileInputStream.close();

            audio.write(data, 0, data.length);
        } catch (IOException e) {}
    }

    //change playback speed by factor
    void changeSpeed(double factor) {
        audio.setPlaybackRate((int) (audio.getPlaybackRate() * factor));
    }
}

This just uses a file to stream the whole file in one write command, but you could adjust it otherwise (the setPlayBackRate method is the main part you need).

2. Applying your own playback speed adjustment

The theory of adjusting playback speed is with two methods:

  • Adjust the sample rate
  • Change the number of samples per unit time

Since you are using the initial sample rate (because I'm assuming you have to reset the library and stop the audio when you adjust the sample rate?), you will have to adjust the number of samples per unit time.

For example, to speed up an audio buffer's playback you can use this pseudo code (Python-style), found thanks to Coobird (here).

original_samples = [0, 0.1, 0.2, 0.3, 0.4, 0.5]

def faster(samples):
    new_samples = []
    for i = 0 to samples.length:
        if i is even:
            new_samples.add(0.5 * (samples[i] + samples[i+1]))
    return new_samples

faster_samples = faster(original_samples)

This is just one example of speeding up the playback and is not the only algorithm on how to do so, but one to get started on. Once you have calculated your sped up buffer you can then write this to your audio output and the data will playback at whatever scaling you choose to apply.

To slow down the audio, apply the opposite by adding data points between the current buffer values with interpolation as desired.

Please note that when adjusting playback speed it is often worth low pass filtering at the maximum frequency desired to avoid unnecessary artifacts.


As you can see the second attempt is a much more challenging task as it requires you implementing such functionality yourself, so I would probably use the first but thought it was worth mentioning the second.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!