[PR #962] [MERGED] [IMPROVEMENT] [FIX] Improve the start and end timestamps of extracted burned in captions #1784

Closed
opened 2026-01-29 17:18:32 +00:00 by claunia · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/CCExtractor/ccextractor/pull/962
Author: @saurabhshah0410
Created: 3/12/2018
Status: Merged
Merged: 3/12/2018
Merged by: @cfsmp3

Base: masterHead: improvement


📝 Commits (1)

  • 52b5a38 Improve the start and end timestamps of extracted burned in captions

📊 Changes

1 file changed (+50 additions, -23 deletions)

View changed files

📝 src/lib_ccx/hardsubx_decoder.c (+50 -23)

📄 Description

In raising this pull request, I confirm the following (please check boxes):

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.

My familiarity with the project is as follows (check one):

  • I have never used CCExtractor.
  • I have used CCExtractor just a couple of times.
  • I absolutely love CCExtractor, but have not contributed previously.
  • I am an active contributor to CCExtractor.

The start and end timestamps of extracted burned in captions are flawed
and off by a large difference. Also, the start time of the first burned
in caption extracted is always zero, which is not always the case. And
the extracted captions always appear in continuous timestamps.

To see that, you can download this file from the UK TV Samples in the samples repository:
https://drive.google.com/open?id=0B_61ywKPmI0TdlRWcVdnajVJUWs

Since the duration of that file is 15 minutes, we can trim it down to 30 seconds for our purposes:
ffmpeg -i BBC1.mp4 -acodec copy -vcodec copy -scodec copy -ss 00:00:00 -t 00:00:30 bbc.mp4

This will generate the first 30 seconds of the BBC1.mp4 in bbc.mp4.
Now, before this commit, if I compile ccextractor with hard subs enabled and run the following command:
./ccextractor bbc.mp4 -hardsubx -sub_color yellow -conf_thresh 60,
the generated bbc.srt(ignore the weird looking characters, that is just OCR not giving a good output I guess) is:

1
00:00:00,000 --> 00:00:08,500
Oh, no. No, lim tired.

2
00:00:08,502 --> 00:00:09,380
.‘7 -
Oh, no. No, I'm tired.

3
00:00:09,382 --> 00:00:14,380
Baby shower was rubbish.
I'm just going to go to bed.

4
00:00:14,382 --> 00:00:15,340
Are you OK? '_‘ -' _: 1:. . .. '... .

5
00:00:15,342 --> 00:00:16,500
Are you OK?

6
00:00:16,502 --> 00:00:18,340
Are you OK? ‘

7
00:00:18,342 --> 00:00:23,460
All right. Well, don't stay up
'too late. You've got a lot to do.

8
00:00:23,462 --> 00:00:28,380
I ..' Night-night.

9
00:00:28,382 --> 00:00:28,380
“Sir-ht :l 1 .

One can see that the timings are clearly flawed, and are off by a large margin.
Also, the way that the code is written, the first extracted caption will always have
a starting timestamp of 00:00:00. Also, the extracted subtitles always come one after
the other in terms of time(i.e., there is only a 2 ms gap between two consecutive captions
making it look like the whole video had hard subs in it throughout).

This commit improves the start and end timestamps of the extracted
burned in captions and reduces the error significantly, bringing the
timestamps fairly close to the actual timings as they appear in the
media file.

With these changes included, and running the command:
./ccextractor bbc.mp4 -hardsubx -sub_color yellow -conf_thresh 60,
the generated bbc.srt is:

1
00:00:07,342 --> 00:00:08,500
.‘7 -
Oh, no. No, I'm tired.

2
00:00:09,382 --> 00:00:10,300
Baby shower was rubbish.
I'm just going to go to bed.

3
00:00:14,382 --> 00:00:15,340
Are you OK?

4
00:00:15,342 --> 00:00:16,500
Are you OK? ‘

5
00:00:17,382 --> 00:00:22,380
All right. Well, don't stay up*
too late. You've got a lot to do.

6
00:00:22,382 --> 00:00:23,460
I ..' Night-night.

One can see that while the OCR output is the same, the timings have improved and
are closer to the actual timings in the media file.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/CCExtractor/ccextractor/pull/962 **Author:** [@saurabhshah0410](https://github.com/saurabhshah0410) **Created:** 3/12/2018 **Status:** ✅ Merged **Merged:** 3/12/2018 **Merged by:** [@cfsmp3](https://github.com/cfsmp3) **Base:** `master` ← **Head:** `improvement` --- ### 📝 Commits (1) - [`52b5a38`](https://github.com/CCExtractor/ccextractor/commit/52b5a387bc58153e7c94a79f9e420fe3b765c490) Improve the start and end timestamps of extracted burned in captions ### 📊 Changes **1 file changed** (+50 additions, -23 deletions) <details> <summary>View changed files</summary> 📝 `src/lib_ccx/hardsubx_decoder.c` (+50 -23) </details> ### 📄 Description **In raising this pull request, I confirm the following (please check boxes):** - [x] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [x] I have checked that another pull request for this purpose does not exist. - [x] I have considered, and confirmed that this submission will be valuable to others. - [x] I accept that this submission may not be used, and the pull request closed at the will of the maintainer. - [x] I give this submission freely, and claim no ownership to its content. **My familiarity with the project is as follows (check one):** - [ ] I have never used CCExtractor. - [ ] I have used CCExtractor just a couple of times. - [ ] I absolutely love CCExtractor, but have not contributed previously. - [x] I am an active contributor to CCExtractor. --- The start and end timestamps of extracted burned in captions are flawed and off by a large difference. Also, the start time of the first burned in caption extracted is always zero, which is not always the case. And the extracted captions always appear in continuous timestamps. To see that, you can download this file from the UK TV Samples in the samples repository: https://drive.google.com/open?id=0B_61ywKPmI0TdlRWcVdnajVJUWs Since the duration of that file is 15 minutes, we can trim it down to 30 seconds for our purposes: `ffmpeg -i BBC1.mp4 -acodec copy -vcodec copy -scodec copy -ss 00:00:00 -t 00:00:30 bbc.mp4` This will generate the first 30 seconds of the `BBC1.mp4` in `bbc.mp4`. Now, before this commit, if I compile ccextractor with hard subs enabled and run the following command: `./ccextractor bbc.mp4 -hardsubx -sub_color yellow -conf_thresh 60`, the generated `bbc.srt`(ignore the weird looking characters, that is just OCR not giving a good output I guess) is: ``` 1 00:00:00,000 --> 00:00:08,500 Oh, no. No, lim tired. 2 00:00:08,502 --> 00:00:09,380 .‘7 - Oh, no. No, I'm tired. 3 00:00:09,382 --> 00:00:14,380 Baby shower was rubbish. I'm just going to go to bed. 4 00:00:14,382 --> 00:00:15,340 Are you OK? '_‘ -' _: 1:. . .. '... . 5 00:00:15,342 --> 00:00:16,500 Are you OK? 6 00:00:16,502 --> 00:00:18,340 Are you OK? ‘ 7 00:00:18,342 --> 00:00:23,460 All right. Well, don't stay up 'too late. You've got a lot to do. 8 00:00:23,462 --> 00:00:28,380 I ..' Night-night. 9 00:00:28,382 --> 00:00:28,380 “Sir-ht :l 1 . ``` One can see that the timings are clearly flawed, and are off by a large margin. Also, the way that the code is written, the first extracted caption will always have a starting timestamp of 00:00:00. Also, the extracted subtitles always come one after the other in terms of time(i.e., there is only a 2 ms gap between two consecutive captions making it look like the whole video had hard subs in it throughout). This commit improves the start and end timestamps of the extracted burned in captions and reduces the error significantly, bringing the timestamps fairly close to the actual timings as they appear in the media file. With these changes included, and running the command: `./ccextractor bbc.mp4 -hardsubx -sub_color yellow -conf_thresh 60`, the generated `bbc.srt` is: ``` 1 00:00:07,342 --> 00:00:08,500 .‘7 - Oh, no. No, I'm tired. 2 00:00:09,382 --> 00:00:10,300 Baby shower was rubbish. I'm just going to go to bed. 3 00:00:14,382 --> 00:00:15,340 Are you OK? 4 00:00:15,342 --> 00:00:16,500 Are you OK? ‘ 5 00:00:17,382 --> 00:00:22,380 All right. Well, don't stay up* too late. You've got a lot to do. 6 00:00:22,382 --> 00:00:23,460 I ..' Night-night. ``` One can see that while the OCR output is the same, the timings have improved and are closer to the actual timings in the media file. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
claunia added the pull-request label 2026-01-29 17:18:32 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#1784