Skip to main content

Video transcoding instance performance test

KakaoCloud offers the vt1a instance type, designed to provide exceptional performance for real-time video transcoding tasks. In this tutorial, we will run example scripts on a vt1a instance to transcode files from H.264 to HEVC formats in parallel and compare the results based on the instance sizes.

info
  • Estimated time: 30 ~ 60 minutes
  • Recommended operating system: Ubuntu
  • Region: kr-central-2

Before you begin

In this tutorial, we will use KakaoCloud's video transcoding instance type, vt1a, to demonstrate how to input a single video file and output multiple transcoded files. This example covers using a single input file multiple times to generate transcoded results based on the number of input files.
The parallel processing capacity varies depending on the number of U30 accelerator cards allocated per vt1a instance size and the corresponding number of devices, resulting in different completion times.

Prerequisites

Create VPC and subnet

Before creating an instance, you must create the VPC and subnet where the instance will be created. If you don’t have a VPC and subnet, refer to the Create VPC and Create Subnet documents to create them.

Create key pair

A key pair must be created to access the instance via SSH connection. Refer to the Create key pair document, create a key pair, and store the private key issued at the time of creation.

Create security group

To secure the instance at the network level, create a security group and add appropriate inbound rules. Refer to the Create security group document, and add the following rules to the inbound rules.

ProtocolSourcePort numberRule description
TCP{Your public IP}/32     22     Allow SSH access to instances connected to this security group from your local PC
Check my public IP

Click the button below to check your current public IP.

Procedures

Step 1. Create instance

  1. In the KakaoCloud console, select Beyond Compute Service > Virtual Machine.

  2. Go to the Instance menu, then click [Create instance].

  3. Create a vt1a instance as shown below. In this example, we will create vt1a.4xlarge, vt1a.8xlarge, and vt1a.32xlarge instances for performance comparison.

    ItemSetting
    Basic information- Name: Custom
    - Count: 1
    ImageSelect Ubuntu 22.04 under the Basic tab
    - Refer to the SDK-supported OS information below
    Instance typeCreate vt1a.4xlarge, vt1a.8xlarge, and vt1a.32xlarge
    Volume- Root volume: Default settings
    Key pairSelect the pre-created key pair from the prerequisites
    Network- VPC: Select the VPC created in the prerequisites
    - Subnet: Select the subnet created in the prerequisites
    - Security group: Select the security group created in the prerequisites
SDK-supported OS information

As of 2024, the latest version of the Xilinx Video SDK 3.0 is validated to work with the following OS and kernel versions:

  • Ubuntu 22.04 (Kernel 5.15)
  • Ubuntu 20.04.4 (Kernel 5.13)
  • Ubuntu 20.04.3 (Kernel 5.11)
  • Ubuntu 20.04.1 (Kernel 5.4)
  • Ubuntu 20.04.0 (Kernel 5.4)
  • Ubuntu 18.04.5 (Kernel 5.4)

Step 2. Associate public IP

Assign and attach a public IP to each of the created instances to enable SSH access.

  1. In the KakaoCloud console, select Beyond Compute Service > Virtual Machine.

  2. Go to the Instance menu, then select [More] > Associate public IP for the created instance.

  3. Click [Confirm] to complete the public IP attachment. Repeat this for each instance.

Step 3. Connect via SSH

To install the Xilinx Video SDK on the instance, you need to connect via SSH. This section explains how to do so.

  1. In the KakaoCloud console, select Beyond Compute Service > Virtual Machine.

  2. Go to the Instance menu, then select [More] > [Use SSH to connect] for the created instance to view the SSH command and settings.

  3. On your local PC, enter the following SSH connection command to connect to each instance. For detailed instructions on SSH connection, refer to Connect instance.

    ssh -i ${key-pair-name}.pem ubuntu@${public-ip-addr}
    ParameterDescription
    key-pair-nameThe private key file name for the key pair specified when creating the instance
    public-ip-addrThe public IP address assigned and attached to the instance

Step 4. Install Xilinx Video SDK

Xilinx Video SDK is a software stack that enables users to efficiently leverage the hardware acceleration capabilities of Xilinx video codec units to support high-density, real-time video transcoding, such as live streaming video. This SDK includes precompiled versions of FFmpeg and GStreamer, integrating video transcoding plugins for Xilinx devices, providing hardware acceleration for video encoding, decoding, and video upscaling.

Before running the video transcoding example, you need to install the Xilinx Video SDK. Follow the Xilinx Video SDK installation guide to complete the installation.

Step 5. Set up the runtime environment

Refer to the Set up runtime environment to complete the runtime environment setup.

Now, the environment is ready to run the example.

Performance comparison by instance size

The Alveo U30 accelerator card supported by the vt1a instance features hardware-accelerated H.264/AVC and H.265/HEVC codecs, which support SDR (8-bit) and HDR (10-bit) profiles. These profiles are efficient for decoding on a wide range of end-user devices, from previous-generation mobile handsets to the latest generation with ultra-high-resolution displays.
Each U30 accelerator card is configured with two Xilinx devices per card. The vt1a.4xlarge, vt1a.8xlarge, and vt1a.32xlarge instances are equipped with 1, 2, and 8 U30 accelerator cards, respectively, resulting in 2, 4, and 16 devices.

The script used in this tutorial takes a sample H.264 file and processes it based on the number of input files specified by the user, allocating them to the available devices for parallel processing. The output files are transcoded to HEVC and saved in the specified directory, with the time taken to complete the process displayed.

Since each vt1a instance has a different maximum number of devices, the number of files processed in parallel will vary for the same input, allowing you to compare processing times and observe the performance differences across instance sizes.

Example script code

Modify the parameter values in the code below according to your environment and save it as a script file. The sample input file used in the example script can be downloaded from the link below. In this tutorial, the script file is named ffmpg_basic.sh.

Download sample input file ↗️

#!/bin/bash

if [[ $# -ne 2 ]]; then
echo "[ERROR] Incorrect arguments supplied."
echo "Usage: $(basename $0) <the number of input(int)> <the number of device(int)>"
exit 1
fi


INPUT_FILE=$HOME/videos/sample_4kp60fps.h264 # Path and filename of the sample H.264 file for the example
INPUT_CNT=$1 # Number of times the file will be processed in parallel
INPUT_ARR=$(seq 0 $((INPUT_CNT-1))) # Sequential index numbers to be used in the output file names, based on the INPUT_CNT value
DEV_CNT=$2 # Number of devices to use

DETECTED_DEV_CNT=`xbutil examine | grep -c xilinx_u30`
if [[ ${DEV_CNT} -gt ${DETECTED_DEV_CNT} ]]; then
echo "[ERROR] Incorrect arguments supplied."
echo "<the number of device(int)> can not be larger than the actual number of devices(${DETECTED_DEV_CNT})"
exit 1
fi

DEV_IDX=0
for i in ${INPUT_ARR}
do
if [ ${DEV_IDX} -eq ${DEV_CNT} ]; then
echo
echo "All Device is in use. Waiting until devices is available..."
wait
sleep 1
echo
DEV_IDX=0
fi

OUTPUT_FILE=$HOME/videos/output_hevc_${i}.mp4 # Path and filename where the processed output file will be saved
LOG_FILE=$HOME/logs/log${i}.out # Path and filename where the log file for the processed output will be saved
rm -rf ${OUTPUT_FILE} ${LOG_FILE}
cmd="nohup ffmpeg -xlnx_hwdev ${DEV_IDX} -c:v mpsoc_vcu_h264 -i ${INPUT_FILE} -f mp4 -c:v mpsoc_vcu_hevc -y ${OUTPUT_FILE} > ${LOG_FILE} 2>&1 &"
echo $cmd
echo
eval "$cmd"
DEV_IDX=$((DEV_IDX + 1))

done

wait
echo
echo "All Done."
ParameterDescription
INPUT_FILEPath and filename of the sample H.264 file for the example
- Example: INPUT_FILE=$HOME/videos/sample_4kp60fps.h264
INPUT_CNTNumber of times the file will be processed in parallel
- Example: INPUT_CNT=8
DEV_CNTNumber of devices to use
- Example: DEV_CNT=2
⚠️ You cannot input a value that exceeds the maximum number of devices based on the vt1a instance size. An error will occur if the value is exceeded.
OUTPUT_FILEPath and filename where the processed output file will be saved
- Example: OUTPUT_FILE=$HOME/videos/output_hevc_${i}.mp4
⚠️ The directory specified in this value (e.g., $HOME/videos) must be created in advance for the script to run successfully.
LOG_FILEPath and filename where the log file for the processed output will be saved
- Example: $HOME/logs/log${i}.out
⚠️ The directory specified in this value (e.g., $HOME/logs) must be created in advance for the script to run successfully.

Step 1. Run example script on vt1a.4xlarge

  1. Before running the example script on the vt1a.4xlarge instance, set the values for each parameter as shown below:

    ParameterValue
    INPUT_FILE$HOME/videos/sample_4kp60fps.h264
    INPUT_CNT16
    DEV_CNT2
  2. Since the maximum number of devices available on vt1a.4xlarge is 2, the 16 input files will be processed in 8 batches of 2 files each. Enter the following commands to set the parameter values and run the script:

    chmod +x ffmpg_basic.sh # Make the script executable
    ./ffmpg_basic.sh 16 2 # Run the script with 16 input files and 2 devices


    ```bash
    INPUT_CNT=16
    DEV_CNT=2
    time bash ffmpg_basic.sh ${INPUT_CNT} ${DEV_CNT}
  3. After the script file is executed, check the output results.

    ...

    All Done.

    real 1m17.518s
    user 0m3.671s
    sys 0m6.926s
    InformationDescription
    realRepresents the actual elapsed time, meaning the total execution time measured from when the script starts to when it completes
    userRefers to the CPU time spent executing user-mode code within the process, meaning the cumulative time the CPU spent during operations outside the kernel
    sysRefers to the CPU time spent executing code within the kernel, meaning the CPU time used for system calls within the kernel
  4. To verify the actual processing time for all input files, check the real value.

Step 2. Run example script on vt1a.8xlarge

  1. After completing the setup as in the previous step, run the script file. This time, modify the DEV_CNT parameter as shown below:

    ParameterValue
    INPUT_FILE$HOME/videos/sample_4kp60fps.h264
    INPUT_CNT16
    DEV_CNT4
  2. Set the parameter values and run the script as follows:

    INPUT_CNT=16
    DEV_CNT=4
    time bash ffmpg_basic.sh ${INPUT_CNT} ${DEV_CNT}
  3. Check the output results. When comparing the actual elapsed time (real), you will notice that the process completed approximately twice as fast compared to the vt1a.4xlarge instance.

    ...

    All Done.

    real 0m38.295s
    user 0m3.726s
    sys 0m7.019s

Step 3. Run example script on vt1a.32xlarge

  1. After completing the setup as in the previous steps, run the script file. This time, modify the DEV_CNT parameter as shown below:

    ParameterValue
    INPUT_FILE$HOME/videos/sample_4kp60fps.h264
    INPUT_CNT16
    DEV_CNT16
  2. Set the parameter values and run the script as follows:

    INPUT_CNT=16
    DEV_CNT=16
    time bash ffmpg_basic.sh ${INPUT_CNT} ${DEV_CNT}
  3. Check the output results. When comparing the actual elapsed time (real), you will notice that the process completed approximately 8 times faster than on the vt1a.4xlarge instance and about 4 times faster than on the vt1a.8xlarge instance.

    ...

    All Done.

    real 0m10.246s
    user 0m3.907s
    sys 0m8.851s

Results analysis

When comparing the actual elapsed time for transcoding 16 input files into output files using the maximum number of devices on the vt1a.4xlarge, vt1a.8xlarge, and vt1a.32xlarge instances, the results are as follows:

Instance sizeProcessing time
vt1a.4xlargeApproximately 77 seconds
vt1a.8xlargeApproximately 38 seconds
vt1a.32xlargeApproximately 10 seconds

Since the vt1a.32xlarge has the same number of devices as input files, it processed all 16 input files simultaneously in parallel.
On the other hand, the vt1a.4xlarge and vt1a.8xlarge instances, with 2 and 4 devices respectively, were limited in the number of input files they could process simultaneously. As a result, these instances had to split the files into multiple groups for sequential processing: the 4xlarge processed in 8 batches of 2 files, and the 8xlarge processed in 4 batches of 4 files. This caused an increase in processing time.
For workloads that require parallel processing, such as high-density real-time video transcoding, it is recommended to use larger instances like the vt1a.32xlarge, which offers a sufficient number of devices.

In high-density real-time video transcoding tasks, the efficiency of parallel processing is critical. Therefore, using larger instances like the vt1a.32xlarge, which can process more files simultaneously, is more efficient.