I want to transcode a large file using FFMPEG and store the result directly on AWS S3. This will be done inside of an AWS Lambda that has limited tmp space
Since the goal is to take a stream of bytes from S3 and output it also to S3, it is not necessary to use the HTTP capabilities of ffmpeg. ffmpeg being built as a command line tool that can take it's input from stdin and output to stdout/stderr, it is more simple to use these capabilities than to try to have ffmpeg handle the HTTP reading/writing. You just have to connect an HTTP stream (that reads from S3) to ffmpegs' stdin and connect its stdout to another stream (that writes to S3). See here for more information on ffmpeg piping.
The most simple implementation would look like this:
var s3Client = new AmazonS3Client(RegionEndpoint.USEast1);
var startInfo = new ProcessStartInfo
{
FileName = "ffmpeg",
Arguments = $"-i pipe:0 -y -vn -ar 44100 -ab 192k -f mp3 pipe:1",
CreateNoWindow = true,
RedirectStandardInput = false,
RedirectStandardOutput = false,
UseShellExecute = false,
RedirectStandardInput = true,
RedirectStandardOutput = true,
};
using (var process = new Process { StartInfo = startInfo })
{
// Get a stream to an object stored on S3.
var s3InputObject = await s3Client.GetObjectAsync(new GetObjectRequest
{
BucketName = "my-bucket",
Key = "input.wav",
});
process.Start();
// Store the output of ffmpeg directly on S3 in a background thread
// since I don't 'await'.
var uploadTask = s3Client.PutObjectAsync(new PutObjectRequest
{
BucketName = "my-bucket",
Key = "output.wav",
InputStream = process.StandardOutput.BaseStream,
});
// Feed the S3 input stream into ffmpeg
await s3Object.ResponseStream.CopyToAsync(process.StandardInput.BaseStream);
process.StandardInput.Close();
// Wait for ffmpeg to be done
await uploadTask;
process.WaitForExit();
}
This snippet gives an idea of how to pipe the input/output of ffmpeg.
Unfortunately, this code does not work. The call to PutObjectAsync
will throw an exception that says Could not determine content length
. Yes, that's true, S3 only allows upload of files of known sizes, we can't use PutObjectAsync
since we don't know how big will be the output of ffmpeg.
The idea to workaround this is to use S3 multipart upload. So instead of directly feeding the ffmpeg directly to S3, you write it in a memory buffer (let's say 25 MB) that is not too big (so that it won't consume all the memory of the AWS lambda that will run this code). When the buffer is full, you upload the buffer to S3 using a multipart upload. Then, once ffmpeg is done transcoding the input file, you take what's left in the current memory buffer, upload this last buffer to S3 and then simply call CompleteMultipartUpload. This will take all the 25MB parts and merge them in a single file.
That's it. With this strategy it is possible to read a file from S3, transcode it and store it on the fly in S3 without storing anything locally. It is therefore possible to transcode large files in an AWS lambda that uses a very minimal quantity of memory and virtually no disk space.
This was implemented successfully. I will try to see if this code can be shared.
Warning: as mentioned in a comment, the result that we get is not 100% identical if we stream the output of ffmpeg or if we let ffmpeg write himself to a local file. When writing to a local file, ffmpeg has the ability to seek back to the beginning of the file when it is done transcoding. It can then update the file metadata with some results of the transcoding. I don't know what's the impact of not having this updated metadata.
The AWS CLI actually has a feature for doing exactly what @mabead described above. Since the CLI is not installed in lambda by default, you will need to include it, probably as a layer, but if you have ffmpeg installed already, you obviously know how to do this.
Basically, it looks like this (without ffmpeg options):
aws s3 cp s3://source-bucket/source.mp4 - | ffmpeg -i - -f matroska - | aws s3 cp - s3://dest-bucket/output.mkv
You can include a dash ('-') as either the source or filename in both the CLI and ffmpeg commands. So this this case, we're saying read from S3 into STDOUT, pipe that to ffmpeg STDIN, write ffmpeg output to STDOUT, pipe that to S3 destination.
I generally only work with Video files, so I don't have a lot of experience with straight audio so you'll have to give it a shot. One thing I have noticed is that certain container formats don't work for the output side of this. For example, if I try to write to an mp4 file in S3 I see the following error:
muxer does not support non seekable output Could not write header for
output file #0 (incorrect codec parameters ?): Invalid argument Error
initializing output stream 0:0 --
I think this is probably the same issue as the comment about not being able to update the header with results of the final encode. You'll have to see what happens with mp3.
I use the ffmpeg
pipe access protocol like @mabead mentioned in his answer and everything works fine. I actually target the file via url and it seems to work. .mp4
will cause some issues because you need to be able to seek back to the beginning of the output to write headers after encoding is finished. Adding -movflags frag_keyframe+empty_moov
fixed that for my use case. Hope this code helps:
ffmpeg -i https://notreal-bucket.s3-us-west-1.amazonaws.com/video/video.mp4 -f mp4 -movflags frag_keyframe+empty_moov pipe:1 | aws s3 cp - s3://notreal-bucket/video/output.mp4
ffmpeg docs - pipe