[BUG] ocr.c writing outside allocated memory #573

Closed
opened 2026-01-29 16:48:03 +00:00 by claunia · 10 comments
Owner

Originally created by @dcjm on GitHub (Apr 6, 2020).

I am raising this as an issue but I do have a fix for the problem so it could be a pull request. However, this is a full explanation of the problem in case you want to try a different approach.

When running ccextractor on some ts files I found two cases where malloc was reporting memory corruption. This was tested first with the 0.88 release and then with git master. I ran valgrind and got the following:

==32409== Invalid write of size 1
==32409== at 0x483CA14: memmove (vg_replace_strmem.c:1270)
==32409== by 0x13F423: ocr_bitmap (ocr.c:671)
==32409== by 0x13FEEC: ocr_rect (ocr.c:915)
==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663)
==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726)
==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832)
==32409== by 0x149BEA: process_data (general_loop.c:644)
==32409== by 0x14AB6E: general_loop (general_loop.c:1019)
==32409== by 0x137058: api_start (ccextractor.c:213)
==32409== by 0x137D7F: main (ccextractor.c:534)
==32409== Address 0x9b75967 is 0 bytes after a block of size 135 alloc'd
==32409== at 0x483577F: malloc (vg_replace_malloc.c:299)
==32409== by 0x13F07E: ocr_bitmap (ocr.c:595)
==32409== by 0x13FEEC: ocr_rect (ocr.c:915)
==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663)
==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726)
==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832)
==32409== by 0x149BEA: process_data (general_loop.c:644)
==32409== by 0x14AB6E: general_loop (general_loop.c:1019)
==32409== by 0x137058: api_start (ccextractor.c:213)
==32409== by 0x137D7F: main (ccextractor.c:534)
==32409==
==32409== Invalid write of size 1
==32409== at 0x13F466: ocr_bitmap (ocr.c:679)
==32409== by 0x13FEEC: ocr_rect (ocr.c:915)
==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663)
==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726)
==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832)
==32409== by 0x149BEA: process_data (general_loop.c:644)
==32409== by 0x14AB6E: general_loop (general_loop.c:1019)
==32409== by 0x137058: api_start (ccextractor.c:213)
==32409== by 0x137D7F: main (ccextractor.c:534)
==32409== Address 0x9b7596a is 3 bytes after a block of size 135 alloc'd
==32409== at 0x483577F: malloc (vg_replace_malloc.c:299)
==32409== by 0x13F07E: ocr_bitmap (ocr.c:595)
==32409== by 0x13FEEC: ocr_rect (ocr.c:915)
==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663)
==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726)
==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832)
==32409== by 0x149BEA: process_data (general_loop.c:644)
==32409== by 0x14AB6E: general_loop (general_loop.c:1019)
==32409== by 0x137058: api_start (ccextractor.c:213)
==32409== by 0x137D7F: main (ccextractor.c:534)

Adding assertions and then looking at the values with gdb showed that the assumption in the comment on line 635 is incorrect:

(gdb) print text_out
$1 = 0x5555570b91f0 "Of course. I'll do\nwhatever | can to h<font color="#ffff00">elp.\n"

(gdb) print text_out
$1 = 0x555556caa9a0 "Ford resented him.\nSo he cooked up <font color="#ffff00">a pack of lies\n"

The overflow comes about because the loop at line 650 leaves last_font_tag pointing to the start of the line but the call to strstr on line 658 sets last_font_tag_end to the closing > on the subsequent line. The fix is to add
if (last_font_tag_end > line_end) last_font_tag_end = NULL;
after line 658.

If you want to investigate this yourselves I can put the subtitle streams on a public server. There may be a better way to fix this.

Originally created by @dcjm on GitHub (Apr 6, 2020). I am raising this as an issue but I do have a fix for the problem so it could be a pull request. However, this is a full explanation of the problem in case you want to try a different approach. When running ccextractor on some ts files I found two cases where malloc was reporting memory corruption. This was tested first with the 0.88 release and then with git master. I ran valgrind and got the following: ==32409== Invalid write of size 1 ==32409== at 0x483CA14: memmove (vg_replace_strmem.c:1270) ==32409== by 0x13F423: ocr_bitmap (ocr.c:671) ==32409== by 0x13FEEC: ocr_rect (ocr.c:915) ==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663) ==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726) ==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832) ==32409== by 0x149BEA: process_data (general_loop.c:644) ==32409== by 0x14AB6E: general_loop (general_loop.c:1019) ==32409== by 0x137058: api_start (ccextractor.c:213) ==32409== by 0x137D7F: main (ccextractor.c:534) ==32409== Address 0x9b75967 is 0 bytes after a block of size 135 alloc'd ==32409== at 0x483577F: malloc (vg_replace_malloc.c:299) ==32409== by 0x13F07E: ocr_bitmap (ocr.c:595) ==32409== by 0x13FEEC: ocr_rect (ocr.c:915) ==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663) ==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726) ==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832) ==32409== by 0x149BEA: process_data (general_loop.c:644) ==32409== by 0x14AB6E: general_loop (general_loop.c:1019) ==32409== by 0x137058: api_start (ccextractor.c:213) ==32409== by 0x137D7F: main (ccextractor.c:534) ==32409== ==32409== Invalid write of size 1 ==32409== at 0x13F466: ocr_bitmap (ocr.c:679) ==32409== by 0x13FEEC: ocr_rect (ocr.c:915) ==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663) ==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726) ==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832) ==32409== by 0x149BEA: process_data (general_loop.c:644) ==32409== by 0x14AB6E: general_loop (general_loop.c:1019) ==32409== by 0x137058: api_start (ccextractor.c:213) ==32409== by 0x137D7F: main (ccextractor.c:534) ==32409== Address 0x9b7596a is 3 bytes after a block of size 135 alloc'd ==32409== at 0x483577F: malloc (vg_replace_malloc.c:299) ==32409== by 0x13F07E: ocr_bitmap (ocr.c:595) ==32409== by 0x13FEEC: ocr_rect (ocr.c:915) ==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663) ==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726) ==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832) ==32409== by 0x149BEA: process_data (general_loop.c:644) ==32409== by 0x14AB6E: general_loop (general_loop.c:1019) ==32409== by 0x137058: api_start (ccextractor.c:213) ==32409== by 0x137D7F: main (ccextractor.c:534) Adding assertions and then looking at the values with gdb showed that the assumption in the comment on line 635 is incorrect: (gdb) print text_out $1 = 0x5555570b91f0 "Of course. I'll do\nwhatever | can to h<font color=\"#ffff00\">elp.\n" (gdb) print text_out $1 = 0x555556caa9a0 "Ford resented him.\nSo he cooked up <font color=\"#ffff00\">a pack of lies\n" The overflow comes about because the loop at line 650 leaves `last_font_tag` pointing to the start of the line but the call to `strstr `on line 658 sets `last_font_tag_end `to the closing `>` on the subsequent line. The fix is to add ` if (last_font_tag_end > line_end) last_font_tag_end = NULL;` after line 658. If you want to investigate this yourselves I can put the subtitle streams on a public server. There may be a better way to fix this.
claunia added the difficulty: easyOCRHacktoberfest labels 2026-01-29 16:48:03 +00:00
Author
Owner

@cfsmp3 commented on GitHub (Apr 6, 2020):

@NilsIrl want to look into this?

@cfsmp3 commented on GitHub (Apr 6, 2020): @NilsIrl want to look into this?
Author
Owner

@NilsIrl commented on GitHub (Apr 7, 2020):

If you want to investigate this yourselves I can put the subtitle streams on a public server.

That would be very helpful.

@NilsIrl commented on GitHub (Apr 7, 2020): > If you want to investigate this yourselves I can put the subtitle streams on a public server. That would be very helpful.
Author
Owner

@dcjm commented on GitHub (Apr 7, 2020):

The streams are here. These streams contain only the subtitles. The video and audio have been stripped with ffmpeg.

Ccextractor was built on a Debian Buster (amd64) machine.
uname -a
Linux athena 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux

The current version, including my fix and the added assertions is:
CCExtractor detailed version info
Version: 0.88
Git commit: 6b6f548531b7b681d5bd99a563077e6fd80a8ba9
Compilation date: 2020-04-06
File SHA256: 51eef32d5888a585bc83ffcf8f9a7de914896feb87776be9d1cea5e18c74545d
Libraries used by CCExtractor
Tesseract Version: 4.0.0
Leptonica Version: leptonica-1.76.0
libGPAC Version: 0.7.2-DEV
zlib: 1.2.11
utf8proc Version: 2.4.0
protobuf-c Version: 1.3.1
libpng Version: 1.6.35
FreeType
libhash
nuklear
libzvbi

@dcjm commented on GitHub (Apr 7, 2020): The streams are [here](https://www.polyml.org/download/). These streams contain only the subtitles. The video and audio have been stripped with ffmpeg. Ccextractor was built on a Debian Buster (amd64) machine. uname -a Linux athena 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux The current version, including my fix and the added assertions is: CCExtractor detailed version info Version: 0.88 Git commit: 6b6f548531b7b681d5bd99a563077e6fd80a8ba9 Compilation date: 2020-04-06 File SHA256: 51eef32d5888a585bc83ffcf8f9a7de914896feb87776be9d1cea5e18c74545d Libraries used by CCExtractor Tesseract Version: 4.0.0 Leptonica Version: leptonica-1.76.0 libGPAC Version: 0.7.2-DEV zlib: 1.2.11 utf8proc Version: 2.4.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.35 FreeType libhash nuklear libzvbi
Author
Owner

@NilsIrl commented on GitHub (Apr 8, 2020):

I do have a fix for the problem

If you have already written the code needed, there's no reason to not send a pull request even if we find a better solution.

@NilsIrl commented on GitHub (Apr 8, 2020): > I do have a fix for the problem If you have already written the code needed, there's no reason to not send a pull request even if we find a better solution.
Author
Owner

@NilsIrl commented on GitHub (Apr 8, 2020):

I was not able to reproduce this on master or v0.88 with stream1.ts.

Output
$ TESSDATA_PREFIX=/nix/store/ig0wnzsbs9j7jaa0bh36ybzl99x6838p-tesseract-3.05.00/share/ valgrind ./ccextractor files/stream1.ts
==14241== Memcheck, a memory error detector
==14241== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==14241== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==14241== Command: ./ccextractor files/stream1.ts
==14241==
CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
Input: files/stream1.ts
[Extract: 1] [Stream mode: Autodetect]
[Program : Auto ] [Hauppage mode: No] [Use MythTV code: Auto]
[Timing mode: Auto] [Debug: No] [Buffer input: No]
[Use pic_order_cnt_lsb for H.264: No] [Print CC decoder traces: No]
[Target format: .srt] [Encoding: UTF-8] [Delay: 0] [Trim lines: No]
[Add font color data: Yes] [Add font typesetting: Yes]
[Convert case: No][Filter profanity: No] [Video-edit join: No]
[Extraction start time: not set (from start)]
[Extraction end time: not set (to end)]
[Live stream: No] [Clock frequency: 90000]
[Teletext page: Autodetect]
[Start credits text: None]
[Quantisation-mode: CCExtractor's internal function]

-----------------------------------------------------------------
Opening file: files/stream1.ts
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
100%  |  117:35
Number of NAL_type_7: 0
Number of VCL_HRD: 0
Number of NAL HRD: 0
Number of jump-in-frames: 0
Number of num_unexpected_sei_length: 0

Min PTS:				00:00:01:400
Max PTS:				01:57:36:679
Length:				 01:57:35:279
Done, processing time = 1084 seconds
Issues? Open a ticket here
https://github.com/CCExtractor/ccextractor/issues
==14241==
==14241== HEAP SUMMARY:
==14241==     in use at exit: 866,857 bytes in 848 blocks
==14241==   total heap usage: 74,683,343 allocs, 74,682,495 frees, 316,115,451,165 bytes allocated
==14241==
==14241== LEAK SUMMARY:
==14241==    definitely lost: 866,305 bytes in 847 blocks
==14241==    indirectly lost: 0 bytes in 0 blocks
==14241==      possibly lost: 0 bytes in 0 blocks
==14241==    still reachable: 552 bytes in 1 blocks
==14241==         suppressed: 0 bytes in 0 blocks
==14241== Rerun with --leak-check=full to see details of leaked memory
==14241==
==14241== For lists of detected and suppressed errors, rerun with: -s
==14241== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
@NilsIrl commented on GitHub (Apr 8, 2020): I was not able to reproduce this on master or v0.88 with stream1.ts. <details> <summary>Output</summary> ``` $ TESSDATA_PREFIX=/nix/store/ig0wnzsbs9j7jaa0bh36ybzl99x6838p-tesseract-3.05.00/share/ valgrind ./ccextractor files/stream1.ts ==14241== Memcheck, a memory error detector ==14241== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==14241== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==14241== Command: ./ccextractor files/stream1.ts ==14241== CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke. Teletext portions taken from Petr Kutalek's telxcc -------------------------------------------------------------------------- Input: files/stream1.ts [Extract: 1] [Stream mode: Autodetect] [Program : Auto ] [Hauppage mode: No] [Use MythTV code: Auto] [Timing mode: Auto] [Debug: No] [Buffer input: No] [Use pic_order_cnt_lsb for H.264: No] [Print CC decoder traces: No] [Target format: .srt] [Encoding: UTF-8] [Delay: 0] [Trim lines: No] [Add font color data: Yes] [Add font typesetting: Yes] [Convert case: No][Filter profanity: No] [Video-edit join: No] [Extraction start time: not set (from start)] [Extraction end time: not set (to end)] [Live stream: No] [Clock frequency: 90000] [Teletext page: Autodetect] [Start credits text: None] [Quantisation-mode: CCExtractor's internal function] ----------------------------------------------------------------- Opening file: files/stream1.ts File seems to be a transport stream, enabling TS mode Analyzing data in general mode 100% | 117:35 Number of NAL_type_7: 0 Number of VCL_HRD: 0 Number of NAL HRD: 0 Number of jump-in-frames: 0 Number of num_unexpected_sei_length: 0 Min PTS: 00:00:01:400 Max PTS: 01:57:36:679 Length: 01:57:35:279 Done, processing time = 1084 seconds Issues? Open a ticket here https://github.com/CCExtractor/ccextractor/issues ==14241== ==14241== HEAP SUMMARY: ==14241== in use at exit: 866,857 bytes in 848 blocks ==14241== total heap usage: 74,683,343 allocs, 74,682,495 frees, 316,115,451,165 bytes allocated ==14241== ==14241== LEAK SUMMARY: ==14241== definitely lost: 866,305 bytes in 847 blocks ==14241== indirectly lost: 0 bytes in 0 blocks ==14241== possibly lost: 0 bytes in 0 blocks ==14241== still reachable: 552 bytes in 1 blocks ==14241== suppressed: 0 bytes in 0 blocks ==14241== Rerun with --leak-check=full to see details of leaked memory ==14241== ==14241== For lists of detected and suppressed errors, rerun with: -s ==14241== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ``` </details>
Author
Owner

@dcjm commented on GitHub (Apr 9, 2020):

Did you try stream2.ts? This seems to show the problem quite quickly. As a further check I installed Debian Buster in a new virtual machine with the minimum necessary to build ccextractor. It processed stream1.ts without a problem. I have a feeling that only failed when I added explicit bounds checking assertions. The output for stream2.ts was:

Opening file: stream2.ts
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
malloc_consolidate(): invalid chunk size
Aborted

Stream3.ts produced

Opening file: stream3.ts
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
double free or corruption (!prev)
Aborted

@dcjm commented on GitHub (Apr 9, 2020): Did you try stream2.ts? This seems to show the problem quite quickly. As a further check I installed Debian Buster in a new virtual machine with the minimum necessary to build ccextractor. It processed stream1.ts without a problem. I have a feeling that only failed when I added explicit bounds checking assertions. The output for stream2.ts was: ``` Opening file: stream2.ts File seems to be a transport stream, enabling TS mode Analyzing data in general mode malloc_consolidate(): invalid chunk size Aborted ``` Stream3.ts produced ``` Opening file: stream3.ts File seems to be a transport stream, enabling TS mode Analyzing data in general mode double free or corruption (!prev) Aborted ```
Author
Owner

@NilsIrl commented on GitHub (Apr 12, 2020):

Well, I've tried with stream2.ts and stream3.ts but to no avail, nothing wrong reported by valgrind on my side.

when I added explicit bounds checking assertions

Could you send a diff/patch/branch of the checks you added?

@NilsIrl commented on GitHub (Apr 12, 2020): Well, I've tried with `stream2.ts` and `stream3.ts` but to no avail, nothing wrong reported by `valgrind` on my side. > when I added explicit bounds checking assertions Could you send a diff/patch/branch of the checks you added?
Author
Owner

@cfsmp3 commented on GitHub (Apr 12, 2020):

@dcjm what exact command line did you use?

@cfsmp3 commented on GitHub (Apr 12, 2020): @dcjm what exact command line did you use?
Author
Owner

@dcjm commented on GitHub (Apr 13, 2020):

The commit with both the assertions and the fix is here. The fix is the addition of lines 660 and 661 so that needs to be removed or commented out to trigger the assertions.

The command line was simply:
ccextractor/linux/ccextractor stream2.ts

Which Linux platform do you use for testing? I may be able to run some comparisons. The issue could come down to the version of tesseract. This is what is installed on my testing virtual machine:

dpkg -a -l | grep tesser
ii  libtesseract-dev:amd64        4.0.0-2                     amd64        Development files for the tesseract command line OCR tool
ii  libtesseract4:amd64           4.0.0-2                     amd64        Tesseract OCR library
ii  tesseract-ocr                 4.0.0-2                     amd64        Tesseract command line OCR tool
ii  tesseract-ocr-eng             1:4.00~git30-7274cfa-1      all          tesseract-ocr language files for English
ii  tesseract-ocr-osd             1:4.00~git30-7274cfa-1      all          tesseract-ocr language files for script and orientation
@dcjm commented on GitHub (Apr 13, 2020): The commit with both the assertions and the fix is [here](https://github.com/dcjm/ccextractor/commit/6b6f548531b7b681d5bd99a563077e6fd80a8ba9). The fix is the addition of lines 660 and 661 so that needs to be removed or commented out to trigger the assertions. The command line was simply: `ccextractor/linux/ccextractor stream2.ts` Which Linux platform do you use for testing? I may be able to run some comparisons. The issue could come down to the version of tesseract. This is what is installed on my testing virtual machine: ``` dpkg -a -l | grep tesser ii libtesseract-dev:amd64 4.0.0-2 amd64 Development files for the tesseract command line OCR tool ii libtesseract4:amd64 4.0.0-2 amd64 Tesseract OCR library ii tesseract-ocr 4.0.0-2 amd64 Tesseract command line OCR tool ii tesseract-ocr-eng 1:4.00~git30-7274cfa-1 all tesseract-ocr language files for English ii tesseract-ocr-osd 1:4.00~git30-7274cfa-1 all tesseract-ocr language files for script and orientation ```
Author
Owner

@cfsmp3 commented on GitHub (Mar 22, 2023):

I couldn't reproduce (and I spent quite a bit of time today with those samples because they showed two memory leaks that took a bit to fix). Anyway, I've added those two lines.

Closing.

@cfsmp3 commented on GitHub (Mar 22, 2023): I couldn't reproduce (and I spent quite a bit of time today with those samples because they showed two memory leaks that took a bit to fix). Anyway, I've added those two lines. Closing.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#573