Skip to main content

Test video transcoding instance performance

KakaoCloud offers the vt1a instance as one of its instance types, optimized for high performance in real-time video transcoding. In this tutorial, you will run an example to efficiently transcode videos using vt1a instance resources and compare these results with those from other instance types.

info
  • Estimated time: 30-60 minutes
  • User environment:
    • Recommended OS: Ubuntu
    • Region: kr-central-2

About this scenario

In this tutorial, you will use Kakao Cloud’s vt1a video transcoding instance to efficiently transcode videos and compare the results with those from other instance types to illustrate the advantages of the video transcoding instance type.

Prework

Create VPC and subnet

Before creating an instance, you must create a VPC and subnet in which the instance will be created. If you do not have a VPC and subnet, refer to the Create VPC and Create subnet documentation to create them.

Create key pair

A key pair must be generated to connect to the instance through SSH. Refer to the Create key pair documentation to create a key pair and store the private key issued during generation.

Create security group

To ensure high security at the instance level, create a security group and add appropriate inbound rules. Refer to the Create security group documentation to create a security group and add the following rules to the inbound rules.

ProtocolSourcePort numberRule description
TCP   {User Public IP}/3222    Allow SSH access from the local PC to instances connected to this security group
Check my public IP

Click the button below to check your current public IP.

Procedures

Step 1. Create instance

  1. From the Kakao Cloud Console, select the Virtual Machine menu under Beyond Compute Service.

  2. Go to the Instance menu, and click the [Create instance] button.

  3. Create a vt1a instance as shown below. This example uses vt1a.4xlarge.

    ItemConfiguration
    Basic info- Name: User-defined
    - Count: 1
    ImageSelect Ubuntu 22.04 in the Basic tab
    - Refer to SDK-supported OS information below.
    Instance typevt1a.4xlarge
    Volume- Root volume: Use default value
    Key pairSelect the pre-created key pair from prework
    Network- VPC: Select the pre-created VPC from prework
    - Subnet: Select the pre-created subnet from prework
    - Security group: Select the pre-created security group from prework
SDK-supported OS information

As of 2024, the latest version Xilinx Video SDK 3.0 operates properly on the OS and kernel versions below.

  • Ubuntu 22.04 (Kernel 5.15)
  • Ubuntu 20.04.4 (Kernel 5.13)
  • Ubuntu 20.04.3 (Kernel 5.11)
  • Ubuntu 20.04.1 (Kernel 5.4)
  • Ubuntu 20.04.0 (Kernel 5.4)
  • Ubuntu 18.04.5 (Kernel 5.4)

Step 2. Associate public IP

Assign and associate a public IP to each created instance for SSH access.

  1. From the Kakao Cloud Console, select the Virtual Machine menu under Beyond Compute Service.

  2. Go to the Instance menu, and select [More] > Associate public IP for the created instance.

  3. Select the [Confirm] button to complete the public IP association process for each instance.

Step 3. Connect via SSH

To install the Xilinx Video SDK on the instance, connect and access it through SSH as described below.

  1. From the Kakao Cloud Console, select the Virtual Machine menu under Beyond Compute Service.

  2. Go to the Instance menu, and select [More] > Connect via SSH for the created instance to review the command and settings for SSH connection.

  3. On your local PC, enter the following command to connect via SSH to each instance. For more details on SSH connections, refer to Connect instance.

    ssh -i ${key-pair-name}.pem ubuntu@${public-ip-addr}
    ParameterDescription
    key-pair-nameThe name of the private key file for the key pair specified during instance creation
    public-ip-addrThe public IP address associated with the instance

Step 4. Install Xilinx Video SDK

The Xilinx Video SDK is a software stack that enables users to take advantage of hardware-accelerated features in Xilinx video codec units, supporting high-density real-time video transcoding for tasks like live streaming. This SDK includes precompiled versions of FFmpeg and GStreamer, integrating video transcoding plugins for Xilinx devices to support hardware acceleration for video encoding, decoding, and upscaling.

Before running video transcoding examples, install the Xilinx Video SDK by following the steps in the Xilinx Video SDK installation guide.

Step 5. Set up runtime environment

Refer to the Runtime environment setup guide to complete the runtime environment configuration.

You are now ready to run the examples.

Step 6. Run examples

Example 1. ABR (Adaptive Bit Rate) transcoding example on multiple devices

The Alveo U30 accelerator card used in the vt1a instance supports accelerated H.264/AVC and H.265/HEVC codecs, compatible with a wide range of end-user devices from older mobile handsets to the latest ones with ultra-high resolution displays. The U30 accelerator card is equipped with two Xilinx devices per card. vt1a.4xlarge, vt1a.8xlarge, and vt1a.32xlarge instances have 1, 2, and 8 U30 accelerator cards, with 2, 4, and 16 devices, respectively. The number of transcodable streams per device is as follows:

ResolutionBitrateStreams
4K60fps1
1080p60fps4
1080p30fps8
720p30fps16

This example performs an ABR Ladder transcoding task on an H.264 sample file with a resolution of 4K and a bitrate of 60fps. Since the number of processed streams exceeds what a single device can handle (refer to the table above), two devices are used.

info

Note: This example is part of a tutorial provided by Xilinx. For more details and additional examples, refer to Xilinx Video SDK Tutorials and Examples.

Example ffmpeg pipeline code

Run the ffmpeg pipeline to transcode a 4K resolution H.264 input file into seven H.265 (HEVC) files with different resolutions and bitrates. The task is performed on two devices (device#0, device#1), with device#0 accelerating the decoding of the input file, downscaling it to 1080p, and copying it to both the host and device#1. Next, device#0 encodes the original input file to 4K with a bitrate of 16M as an H.265 (HEVC) file, while device#1 encodes the 1080p file copied from device#0 to six different resolutions and bitrates as H.265 (HEVC) files. You can download the sample input file from the link below.

Download sample input file ↗️

   source /opt/xilinx/xcdr/setup.sh
INPUT_FILE=sample_60sec_3840x2160_60fps_h264.mp4
ffmpeg -hide_banner -c:v mpsoc_vcu_h264 -lxlnx_hwdev 0 \
-i ${INPUT_FILE} \
-max_muxing_queue_size 1024 \
-filter_complex "[0]split=2[dec1][dec2]; \
[dec2]multiscale_xma=outputs=1:lxlnx_hwdev=0:out_1_width=1920:out_1_height=1080:out_1_rate=full[scal]; \
[scal]xvbm_convert[host]; [host]split=2[scl1][scl2]; \
[scl2]multiscale_xma=outputs=4:lxlnx_hwdev=1:out_1_width=1280:out_1_height=720:out_1_rate=full:
out_2_width=848:out_2_height=480:out_2_rate=half: \
out_3_width=640:out_3_height=360:out_3_rate=half: \
out_4_width=280:out_4_height=160:out_4_rate=half \
[a][b30][c30][d30]; [a]split[a60][aa];[aa]fps=30[a30]" \
-map '[dec1]' -c:v mpsoc_vcu_hevc -b:v 16M -max-bitrate 16M -lxlnx_hwdev 0 -slices 4 -cores 4 -max_interleave_delta 0 -f mp4 -y /tmp/xil_multidevice_ladder_4k.mp4 \
-map '[scl1]' -c:v mpsoc_vcu_hevc -b:v 6M -max-bitrate 6M -lxlnx_hwdev 1 -max_interleave_delta 0 -f mp4 -y /tmp/xil_multidevice_ladder_1080p60.mp4 \
-map '[a60]' -c:v mpsoc_vcu_hevc -b:v 4M -max-bitrate 4M -lxlnx_hwdev 1 -max_interleave_delta 0 -f mp4 -y /tmp/xil_multidevice_ladder_720p60.mp4 \
-map '[a30]' -c:v mpsoc_vcu_hevc -b:v 3M -max-bitrate 3M -lxlnx_hwdev 1 -max_interleave_delta 0 -f mp4 -y /tmp/xil_multidevice_ladder_720p30.mp4 \
-map '[b30]' -c:v mpsoc_vcu_hevc -b:v 2500K -max-bitrate 2500K -lxlnx_hwdev 1 -max_interleave_delta 0 -f mp4 -y /tmp/xil_multidevice_ladder_480p30.mp4 \
-map '[c30]' -c:v mpsoc_vcu_hevc -b:v 1250K -max-bitrate 1250K -lxlnx_hwdev 1 -max_interleave_delta 0 -f mp4 -y /tmp/xil_multidevice_ladder_360p30.mp4 \
-map '[d30]' -c:v mpsoc_vcu_hevc -b:v 625K -max-bitrate 625K -lxlnx_hwdev 1 -max_interleave_delta 0 -f mp4 -y /tmp/xil_multidevice_ladder_160p30.mp4
ParameterValueDescription
-iINPUT_FILEInput file path. This value is the path and file name of the sample H.264 file for the example.
- Example: -i $HOME/videos/sample_4kp60fps.h264
-c:vmpsoc_vcu_h264, mpsoc_vcu_hevcHardware-accelerated decoder or encoder to use.
- The value is the accelerated decoder/encoder for H.264, mpsoc_vcu_h264, and the accelerated decoder/encoder for H.265, mpsoc_vcu_hevc.
-lxlnx_hwdev0, 1, ..., nDevice number to use. Values range from [0, number of cards x2 - 1].
⚠️ Do not enter a number that exceeds the maximum number of devices for the vt1a instance size, or an error will occur.
-b:vBITRATE_VALUETarget bitrate for the encoded stream.
- Example: -b:v 16M

After running this pipeline, you can check the utilization of each device as follows.

source /opt/xilinx/xcdr/setup.sh
check_rsrc_cmd="xrmadm scripts/xrmadm/list_cmd.json | grep '\(device_[0-1]\)\|\(cu_[0-9]\)\|\(cuName\)\|\(usedLoad\)'"
watch -n 1 $check_rsrc_cmd

실행 결과 화면

Example 2. Verify and compare multi-stream transcoding performance

The vt1a instance can process multiple streams in parallel at the same speed within the device’s processing capacity. In this example, you will measure the time and speed for parallel transcoding tasks on multiple streams, not just a single stream. Then, run the same task on some GPU and VM instances, measure the task completion time for each, and compare the results.

Example script code

Download the sample ↗️

#!/bin/bash

if [[ $# -lt 3 ]]; then
echo "[ERROR] Incorrect arguments supplied."
echo "Usage: $(basename $0) <vt1|cpu|gpu> <input file path(string)> <the number of input(int)> <the number of vt1's device(int)>"
exit 1
fi

INSTANCE_TYPE=$1
INPUT_FILE=$2
INPUT_CNT=$3
INPUT_ARR=$(seq 0 $((INPUT_CNT-1)))
DEV_CNT=$4

INPUT_FILE_INFO=`ffprobe -v error -select_streams v:0 -show_entries stream=codec_name,width,height -of default=noprint_wrappers=1:nokey=1 ${INPUT_FILE}`
INPUT_FILE_INFO=($INPUT_FILE_INFO)
INPUT_FILE_W=${INPUT_FILE_INFO[1]}
INPUT_FILE_H=${INPUT_FILE_INFO[2]}
if [ ${INPUT_FILE_H} == '1080' ]; then
BITRATE="10M"
AVAIL_FILE_CNT=4
else
echo "[ERROR] Incorrect arguments supplied."
echo "given Resolution of Input Video File is NOT 1080p BUT ${INPUT_FILE_W} x ${INPUT_FILE_H}"
exit 1
fi

if [ ${INSTANCE_TYPE} == 'vt1' ]; then
source /opt/xilinx/xcdr/setup.sh
DETECTED_DEV_CNT=`xbutil examine | grep -c xilinx_u30`
if [[ ${DEV_CNT} -gt ${DETECTED_DEV_CNT} ]]; then
echo "[ERROR] Incorrect arguments supplied."
echo "<the number of device(int)> can not be larger than the actual number of devices(${DETECTED_DEV_CNT})"
exit 1
fi
else
DEV_CNT=1
DETECTED_DEV_CNT=1
fi

function init_dev_arr()
{
local device_cnt=$1
local arr_=()
for a in $(seq 0 $device_cnt)
do
arr_[a]=0
done
echo ${arr_[@]}
}

DEV_ARR=($(init_dev_arr $DEV_CNT))
for i in ${INPUT_ARR}
do
OUTPUT_FILE=/tmp/output_hevc_${i}.mp4
LOG_FILE=logs/log${i}.out
rm -rf ${OUTPUT_FILE} ${LOG_FILE}

if [ ${INSTANCE_TYPE} == 'vt1' ]; then
DEV_IDX=$((i % DEV_CNT))
if [[ DEV_ARR[$DEV_IDX] -eq $AVAIL_FILE_CNT ]]; then
echo ">> all device is in use. waiting until devices is available..."
wait
echo
# echo ">> And then, initialize device array"
DEV_ARR=($(init_dev_arr $DEV_CNT))
echo
fi
cmd="nohup ffmpeg -xlnx_hwdev ${DEV_IDX} -c:v mpsoc_vcu_h264 -i ${INPUT_FILE} -hide_banner -c:v mpsoc_vcu_hevc -b:v ${BITRATE} -max-bitrate ${BITRATE} -max_interleave_delta 0 -profile:v main -y -f mp4 ${OUTPUT_FILE} > ${LOG_FILE} 2>&1 &"
echo $cmd
echo
# eval "$cmd"
DEV_ARR[$DEV_IDX]=$((DEV_ARR[$DEV_IDX] + 1))
elif [ ${INSTANCE_TYPE} == 'gpu' ]; then
cmd="nohup ffmpeg -hwaccel cuda -hwaccel_output_format cuda -c:v h264_cuvid -i ${INPUT_FILE} -hide_banner -c:v hevc_nvenc -b:v ${BITRATE} -maxrate ${BITRATE} -max_interleave_delta 0 -profile:v main -preset p4 -y -f mp4 ${OUTPUT_FILE} > ${LOG_FILE} 2>&1 &"
echo $cmd
echo
# eval "$cmd"
else
N_THREADS=5
cmd="nohup ffmpeg -c:v h264 -i ${INPUT_FILE} -f mp4 -c:v libx265 -b:v ${BITRATE} -maxrate ${BITRATE} -threads ${N_THREADS} -max_interleave_delta 0 -profile:v main -preset faster -y -f mp4 ${OUTPUT_FILE} > ${LOG_FILE} 2>&1 &"
echo $cmd
echo
# eval "$cmd"
fi
done

wait
echo
echo "All Done."

Save the code above as multistream.sh and follow the command format below to increase the number of multiple streams up to 32 for each instance type while measuring the total task time.

N_STREAM_ARR=$(seq 1 32)
for N_STREAM in ${N_STREAM_ARR}
do
# vt1
N_DEVICE=1
time bash multistream.sh vt1 sample_60sec_1920x1080_60fps_h264.mp4 ${N_STREAM} ${N_DEVICE}

# gpu
time bash multistream.sh gpu sample_60sec_1920x1080_60fps_h264.mp4 ${N_STREAM}

# cpu
time bash multistream.sh cpu sample_60sec_1920x1080_60fps_h264.mp4 ${N_STREAM}

wait
done
ParameterValueDescription
INSTANCE_TYPEvt1, cpu, gpuInstance type for the task
INPUT_FILEsample_60sec_1920x1080_60fps_h264.mp4Input video file for the task
INPUT_CNT[1, 32]Total number of streams = number of INPUT_FILE instances to be processed
DEV_CNT[1, 2]Number of devices to use when working with the vt1a instance

In this example, gpu corresponds to the gn1i instance type, and cpu corresponds to the c2a instance type. The results are displayed in the graph below.

Comparison of results

As the number of multi-stream transcoding tasks increases, the task time increases proportionally for gn1i and c2a instance types. However, for the vt1a instance type, parallel processing occurs up to the maximum number of multi-streams each device can handle, resulting in a relatively smaller increase in task time.