Subtitle Tools in C Language

I decided to release some of the C language tools I've written over the years for working with subtitle files. These include SubRip (.srt) tools, a SubStationAlpha (.ssa)-to-SubRip (.srt) converter, as well as a PGS (.sup) tool to extract subtitle images as bitmaps, as well as synchronize the timestamps to new anchor points. Some tools for detecting character-encoding and converting to UTF-8 are included.

In addition, I have included routines developed for managing YCbCr-to-RGB, RGB-to-YCbCr, and derivation of BT.709 RGB colorspace constants from BT.709 color primaries.

Before using a SubRip file as input to any of the tools listed below, it's generally a good idea to run check.c on it to ensure it's in the proper format.

Table 1: SubRip (.srt) Tools
check.c

Check for errors in a SubRip (.srt) file and report results.

Compile: gcc -Wall check.c -o check

Usage: ./check inputfilename.srt

If it finds an error, fix it and re-run check again until no errors are found.

Output: reports to stdout

offset.c

Read an existing SubRip (.srt) file, apply positive, negative, or no offset to the time stamps, and save in an output file. Ignores existing subtitle numbers and renumbers them from 1 to N. I often use it just to renumber by entering 0 for the offsets. If present, the Byte Order Mark (BOM) of the input file will be included in the output file.

Compile: gcc -Wall offset.c -o offset

Usage: ./offset inputfilename.srt

Output: out.srt

sync.c

Read an existing SubRip (.srt) file and synchronize all timestamps to user-input anchor-points. Subtitle durations are preserved. If present, the Byte Order Mark (BOM) of the input file will be included in the output file.

Compile: gcc -Wall sync.c -o sync

Usage: ./sync inputfilename.srt

Output: out.srt

Synchronization is accomplished by using "first" and "last" timestamps as anchor-points. Choose "first" and "last" subtitles that are near or at the beginning and end of the feature in order to maximize scaling accuracy.

For example, if the existing timestamps for subtitles appearing early and late in the feature are:

"first":

00:00:29,280 --> 00:00:31,880

"last":

01:31:29,280 --> 01:31:30,920

and new start times for these subtitles are to be:

"first":

00:00:22,280

"last":

01:31:25,000

then you would use sync.c like this:

  • Current start timestamp for first anchor point subtitle (hh:mm:ss,ms)? 00:00:29,280
  • Current start timestamp for last anchor point subtitle (hh:mm:ss,ms)? 01:31:29,280
  • New start timestamp for first anchor point subtitle (hh:mm:ss,ms)? 00:00:22,280
  • New start timestamp for last anchor point subtitle (hh:mm:ss,ms)? 01:31:25,000
ssa2srt.c

Read an existing SubStationAlpha (SSA) file and convert to SubRip (.srt) output file. Transfers styles and markups for font color, bold, italic, underline, strikeout, and alignment. Only recognizes V4 and V4+ Styles. Ignores those SSA style attributes and override tags not implemented in SubRip format. Warning: Unlike SubRip files, SSA files don't require subtitles to be in chronological order. The SubRip output file is not corrected for this rare situation; use reorder.c below.

Compile: gcc -Wall ssa2srt.c -o ssa2srt

Usage: ./ssa2srt inputfilename

Output: out.srt

ssa2srt-nostyles.c

Read an existing SubStationAlpha (SSA) file and convert to SubRip (.srt) output file. Doesn't transfer styles, only markups for font color, bold, italic, underline, strikeout, and alignment. Warning: Unlike SubRip files, SSA files don't require subtitles to be in chronological order. The SubRip output file is not corrected for this rare situation; use reorder.c below.

Compile: gcc -Wall ssa2srt-nostyles.c -o ssa2srt-nostyles

Usage: ./ssa2srt-nostyles inputfilename

Output: out.srt

reorder.c

Re-order non-chronological subtitles in a SubRip (.srt) file by sorting on start times. If present, the Byte Order Mark (BOM) of the input file will be included in the output file.

Compile: gcc -Wall reorder.c -o reorder

Usage: ./reorder inputfilename.srt

Output: out.srt

srt2txt.c

Read an existing SubRip (.srt) file and save only the text lines to an output file. This is useful if you want to submit the text to a translation tool/service without the subtitle numbers and timestamps. Once translated, you can use txt2srt.c to convert back to a SubRip file.

Compile: gcc -Wall srt2txt.c -o srt2txt

Usage: ./srt2txt inputfilename.srt [nospace]

Output: out.txt Subtitle texts will be separated by blank lines unless nospace option is specified.

txt2srt.c

Take the timestamps from a SubRip (.srt) file and the text from a text file and create a new .srt file. The text file must have the same number of subtitles as the SubRip file, and they must be separated by single blank lines.

Compile: gcc -Wall txt2srt.c -o txt2srt

Usage: ./txt2srt inputfilename.srt inputfilename.txt

Output: out.srt

fixtag.c

Read an existing SubRip (.srt) file and look for and fix some common markup tag errors.

Tags included: italics, bold, underline, strikethrough, font color, and font size

Using the optional close argument will cause fixtag to append missing closing tags to the last line of text of the subtitle. You should always compare out.srt with the original subtitle file to determine the author's intent. If present, the Byte Order Mark (BOM) of the input file will be included in the output file.

Compile: gcc -Wall fixtag.c -o fixtag

Usage: ./fixtag inputfilename.srt [close]

Output: out.srt

time-text.c

Take the time stamps from one SubRip (.srt) file and the subtitle texts from another SubRip file and create a new SubRip file with those timestamps and subtitle texts. Obviously the two input SubRip files should have the same number of subtitles. If present, the Byte Order Mark (BOM) of the text SubRip file will be included in the output file.

Compile: gcc -Wall time-text.c -o time-text

Usage: ./time-text timeinputfile.srt textinputfile.srt

Output: out.srt

combine.c

Read an existing SubRip (.srt) file, combine subtitles with identical textual content and consecutive timestamps. Within each group of matching subs, it takes the starting time-stamp from the first subtitle and ending time-stamp from the last. Writes a new SubRip file. If present, the Byte Order Mark (BOM) of the input file will be included in the output file.

Compile: gcc -Wall combine.c -o combine

Usage: ./combine inputfilename.srt

Output: out.srt

split.c

I only created this to produce test files for combine.c. Read an existing SubRip (.srt) file and split each subtitle into two identical subs. If present, the Byte Order Mark (BOM) of the input file will be included in the output file.

Compile: gcc -Wall split.c -o split

Usage: ./split inputfilename.srt

Output: out.srt

readbom.c

Read an existing SubRip (.srt) or text file and determine if there is an existing Byte Order Mark (BOM) and report results to stdout. This does not detect character-encoding by analyzing the text; see ced and enc below instead.

Compile: gcc -Wall readbom.c -o readbom

Usage: ./readbom inputfilename

Output: reports to stdout

writebom.c

Read an existing SubRip (.srt) or text file and if an existing Byte Order Mark (BOM) does not already exist, prepend the BOM selected by the user. This does not change character-encoding of the text; see enc below instead.

Compile: gcc -Wall writebom.c -o writebom

Usage: ./writebom inputfilename

Output: out.txt

stripbom.c

Strip the Byte Order Mark (BOM), if it exists, from a SubRip (.srt) or text file.

Compile: gcc -Wall stripbom.c -o stripbom

Usage: ./stripbom inputfilename

Output: out.txt

ced.tar.gz

This is Google's 2016 character-encoding detector ced adapted by me to be used as a command line tool to analyze SubRip (.srt) or text files (rather than web pages within a web browser).

Compile: gunzip, untar, and then use make. Result is ced. Their code produces numerous compilation warnings which you can safely ignore.

Usage: ./ced inputfilename.

Output: reports to stdout

enc.tar.gz

Detect likely character-encoding of a SubRip (.srt) or text file using ced and also Linux's chardet, ask user what their best guess is for character-encoding based on the results, convert to UTF-8 using Linux's iconv, and prepend a UTF-8 Byte Order Mark (BOM) if requested. The ced code is built into enc here, and chardet and iconv are executed via system calls. I use Ubuntu which includes chardet and iconv but I don't know about other Linux flavors.

Compile: gunzip, untar, and then use make. Result is enc. Google's ced code produces numerous compilation warnings which you can safely ignore.

Usage: ./enc inputfilename

When prompted for choice of encoding, enter the name of the encoding, for example: KOI8-R or ISO_8859-2. In general, I believe it won't be upper or lowercase-dependent.

Output: out.txt

samples.tar.gz

This is a collection of character-encoding sample files I created. The source of the original English UTF-8 text was https://en.wikipedia.org/wiki/Mary_Read. I used that UTF-8.en.sample, replaced non-standard quotes with ", deepl and google translate to convert to other languages, then iconv to convert to various other encodings. These were only semi-useful, as sometimes chardet detects correctly, sometimes ced, sometimes neither. For some reason I haven't figured out, I found chardetng to be least reliable, so I don't use it. Some of these encodings are identical but have different names since some encodings are subsets of others.

Table 2: PGS (.sup) Tool
pgs.c

A tool to analyze a PGS (.sup) file and produce a report file. Optional functions include producing a bitmap file for each subtitle, applying an offset to timestamps, and synchronizing all timestamps to new anchor-points. Subtitle durations are preserved when synchronizing to new anchor points.

Compile: gcc -Wall pgs.c -lm -o pgs

Usage: ./pgs filename.sup [option]

  • Options (use one at a time):
  • bmp - Produce a bitmap file for each subtitle.
  • offset - Apply user-input offsets to all subtitle timestamps.
  • sync - Synchronize all subtitle timestamps to user-input anchor points. Usage is identical to sync.c above.

Output: pgs.out, and optionally: bitmap file for each subtitle, or offset/resynchronized PGS file out.sup

Table 3: Chapter Tool
chapters.c

Create an XML chapters file given the feature duration and desired number of chapters. The resulting .xml file can be added to a video container using tools like MKVToolNix GUI. There are often 12 to 16 chapters for a typical 1.5 hour feature. I usually use ffmpeg to find the duration. e.g., ffmpeg -i filename.mkv

Results of typing: ffmpeg -i filename.mkv

Note that chapters.c expects the same time notation as ffmpeg i.e., using fractions of a second. (This is different from SubRip/srt file format which is ",milliseconds".) You can copy and paste the result from ffmpeg.

Compile: gcc -Wall chapters.c -o chapters

Usage: ./chapters

Output: chapters.xml

Table 4: Colorspace Tools
References: ITU-R BT.709-6, ITU-T H.273 (V4), and SMPTE RP 177-1993
bt709.c

Derive the RGB color constants for BT.709 YCbCr colorspace. The Normalized Primary Matrix (NPM), which includes KR, KG, and KB, is derived from the color primaries as defined in the BT.709 standard.

Compile: gcc -Wall bt709.c -lm -o bt709

Usage: ./bt709

Output: reports to stdout

ycbcr2rgb.c

Convert YCbCr (BT.709) to 8-bit sRGB. Assumes BT.709 color primaries and gamma-correction were used to produce YCbCr. Uses sRGB color primaries (same as BT.709) and applies sRGB gamma-correction to produce sRGB coordinates.

Compile: gcc -Wall ycbcr2rgb.c -lm -o ycbcr2rgb

Usage: ./ycbcr2rgb

Output: reports to stdout

rgb2ycbcr.c

Convert 8-bit sRGB to YCbCr (BT.709). Assumes sRGB gamma-correction was applied to sRGB. Uses BT.709 color primaries and applies BT.709 gamma-correction to produce YCbCr.

Compile: gcc -Wall rgb2ycbcr.c -lm -o rgb2ycbcr

Usage: ./rgb2ycbcr

Output: reports to stdout

P. David Buchan pdbuchan@gmail.com