Perl script for mangling SRT subtitle files

This post was written by eli on May 21, 2024
Posted Under: perl

I had a set of SRT files with pretty good subtitles, but with one annoying problem: When there was a song in the background, the translation of the song would pop up and interrupt of the dialogue’s subtitles, so it became impossible to understand what’s going on.

Luckily, those song-translating subtitles had all have a “{\a6}” string, which is an ASS tag meaning that the text should be shown at the top of the picture. mplayer ignores these tags, which explains why these subtitles make sense, but mess up things for me. So the simple solution is to remove these entries.

Why don’t I use VLC instead? Mainly because I’m used to mplayer, and I’m under the impression that mplayer gives much better and easier control of low-level issues such as adjusting the subtitles’ timing. But also the ability to run it with a lot of parameters from the command line and jumping back and forth in the displayed video, in particular through a keyboard remote control. But maybe it’s just a matter of habit.

Here’s a Perl script that reads an SRT file and removes all entries with such string. It fixes the numbering of the entries to make up for those that have been removed. Fun fact: The entries don’t need to appear in chronological order. In fact, most of the annoying subtitles appeared at the end of the file, even though they messed up things everywhere.

This can be a boilerplate for other needs as well, of course.

#!/usr/bin/perl
use warnings;
use strict;

my $fname = shift;

my $data = readfile($fname);

my ($name, $ext) = ($fname =~ /^(.*)\.(.*)$/);

die("No extension in file name \"$fname\"\n")
  unless (defined $name);

# Regex for a newline, swallowing surrounding CR if such exist
my $nl = qr/\r*\n\r*/;

# Regex for a subtitle entry
my $tregex = qr/(?:\d+$nl.*?(?:$nl$nl|$))/s;

my ($pre, $chunk, $post) = ($data =~ /^(.*?)($tregex*)(.*)$/);

die("Input file doesn't look like an SRT file\n")
  unless (defined $chunk);

my $lpre = length($pre);
my $lpost = length($post);

print "Warning: Passing through $lpre bytes at beginning of file untouched\n"
 if ($lpre);

print "Warning: Passing through $lpost bytes at beginning of file untouched\n"
 if ($lpost);

my @items = ($chunk =~ /($tregex)/g);

#### This is the mangling part

my @outitems;
my $removed = 0;
my $counter = 1;

foreach my $i (@items) {
  if ($i =~ /\\a6/) {
    $removed++;
  } else {
    $i =~ s/\d+/$counter/;
    $counter++;
    push @outitems, $i;
  }
}

print "Removed $removed subtitle entries from $fname\n";

#### Mangling part ends here

writefile("$name-clean.$ext", join("", $pre, @outitems, $post));

exit(0); # Just to have this explicit

############ Simple file I/O subroutines ############

sub writefile {
  my ($fname, $data) = @_;

  open(my $out, ">:utf8", $fname)
    or die "Can't open \"$fname\" for write: $!\n";
  print $out $data;
  close $out;
}

sub readfile {
  my ($fname) = @_;

  local $/; # Slurp mode

  open(my $in, "<:utf8", $fname)
    or die "Can't open $fname for read: $!\n";

  my $input = <$in>;
  close $in;

  return $input;
}

Add a Comment

required, use real name
required, will not be published
optional, your blog address