[PR #1671] [FIX] Issue#1665 Enhanced Matroska Language Tag Handling

**Original Pull Request:** https://github.com/CCExtractor/ccextractor/pull/1671 **State:** closed **Merged:** Yes --- **In raising this pull request, I confirm the following:** - [x] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [x] I have checked that another pull request for this purpose does not exist. - [x] I have considered, and confirmed that this submission will be valuable to others. - [x] I accept that this submission may not be used, and the pull request closed at the will of the maintainer. - [x] I give this submission freely, and claim no ownership to its content. - [x] **I have mentioned this change in the [changelog](https://github.com/CCExtractor/ccextractor/blob/master/docs/CHANGES.TXT).** **My familiarity with the project is as follows:** - [ ] I have never used CCExtractor. - [ ] I have used CCExtractor just a couple of times. - [ ] I absolutely love CCExtractor, but have not contributed previously. - [x] I am an active contributor to CCExtractor. --- ## Description Introduced improved handling of language tags in the Matroska parser. It addresses an issue where IETF BCP47 language tags (e.g., "en-US") were not being correctly processed, leading to potential segmentation faults and inaccurate subtitle extraction. Like in issue #1665 ### The Initial Problem: Modern MKV Files and IETF Language Tags Modern Matroska (MKV) files are increasingly using IETF BCP47 language tags to identify subtitle tracks. These tags offer more precision than the traditional 3-letter ISO 639-2 codes, allowing for specification of regional variations, scripts, and other linguistic details (e.g., `en-GB` for British English, `es-MX` for Mexican Spanish). The existing parser was primarily designed for the older 3-letter codes and did not fully account for the presence and proper handling of these IETF tags. This resulted in the parser failing to correctly identify and utilize the IETF language tags, leading to issues such as: * **Incorrect Language Identification:** Subtitle tracks with IETF tags might not be recognized or might be misidentified. * **Filename Generation Errors:** Output filenames might not accurately reflect the language of the subtitle track. * **Matching Failures:** Users might not be able to select specific language tracks using command-line options if those tracks were identified using IETF tags. * **Segmentation Faults:** In certain scenarios, the lack of proper handling could lead to segmentation faults due to accessing uninitialized memory. ## Summary of Changes - **Corrected IETF Language Tag Storage:** Added `sub_track->lang_ietf = lang_ietf;` during subtitle track creation to ensure IETF language tags are properly stored in the `matroska_sub_track` structure. - **Intelligent Filename Generation:** Modified `generate_filename_from_track()` to prioritize IETF language tags when available, creating more descriptive and accurate filenames. - **Improved Language Matching:** Enhanced `matroska_save_all()` to first attempt matching against IETF language tags before falling back to 3-letter ISO 639 codes, improving language selection accuracy. - **Robust Memory Management:** Ensured proper allocation, assignment, and freeing of the `lang_ietf` field to prevent memory leaks and segmentation faults. ## This enhancement is crucial for: - **Modern Standards Compliance:** Supporting IETF BCP47 language tags, the modern standard for language identification. - **Improved Accuracy:** Enabling more precise language identification, including regional variants and dialects. - **Increased Compatibility:** Ensuring correct processing of Matroska files that utilize extended language tags. ## How Has This Been Tested? - [x] Tested with various Matroska files containing both 3-letter language codes and IETF language tags. - [x] Verified correct subtitle extraction and filename generation for different language variants. - [x] Confirmed no memory leaks or segmentation faults occur during parsing. Thank you, Tank0nf.

claunia commented

2026-01-29 17:21:50 +00:00

Owner

Original Pull Request: https://github.com/CCExtractor/ccextractor/pull/1671

State: closed
Merged: Yes

In raising this pull request, I confirm the following:

I have read and understood the contributors guide.
I have checked that another pull request for this purpose does not exist.
I have considered, and confirmed that this submission will be valuable to others.
I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
I give this submission freely, and claim no ownership to its content.
I have mentioned this change in the changelog.

My familiarity with the project is as follows:

I have never used CCExtractor.
I have used CCExtractor just a couple of times.
I absolutely love CCExtractor, but have not contributed previously.
I am an active contributor to CCExtractor.

Description

Introduced improved handling of language tags in the Matroska parser. It addresses an issue where IETF BCP47 language tags (e.g., "en-US") were not being correctly processed, leading to potential segmentation faults and inaccurate subtitle extraction. Like in issue #1665

The Initial Problem: Modern MKV Files and IETF Language Tags

Modern Matroska (MKV) files are increasingly using IETF BCP47 language tags to identify subtitle tracks. These tags offer more precision than the traditional 3-letter ISO 639-2 codes, allowing for specification of regional variations, scripts, and other linguistic details (e.g., en-GB for British English, es-MX for Mexican Spanish).

The existing parser was primarily designed for the older 3-letter codes and did not fully account for the presence and proper handling of these IETF tags. This resulted in the parser failing to correctly identify and utilize the IETF language tags, leading to issues such as:

Incorrect Language Identification: Subtitle tracks with IETF tags might not be recognized or might be misidentified.
Filename Generation Errors: Output filenames might not accurately reflect the language of the subtitle track.
Matching Failures: Users might not be able to select specific language tracks using command-line options if those tracks were identified using IETF tags.
Segmentation Faults: In certain scenarios, the lack of proper handling could lead to segmentation faults due to accessing uninitialized memory.

Summary of Changes

Corrected IETF Language Tag Storage: Added sub_track->lang_ietf = lang_ietf; during subtitle track creation to ensure IETF language tags are properly stored in the matroska_sub_track structure.
Intelligent Filename Generation: Modified generate_filename_from_track() to prioritize IETF language tags when available, creating more descriptive and accurate filenames.
Improved Language Matching: Enhanced matroska_save_all() to first attempt matching against IETF language tags before falling back to 3-letter ISO 639 codes, improving language selection accuracy.
Robust Memory Management: Ensured proper allocation, assignment, and freeing of the lang_ietf field to prevent memory leaks and segmentation faults.

This enhancement is crucial for:

Modern Standards Compliance: Supporting IETF BCP47 language tags, the modern standard for language identification.
Improved Accuracy: Enabling more precise language identification, including regional variants and dialects.
Increased Compatibility: Ensuring correct processing of Matroska files that utilize extended language tags.

How Has This Been Tested?

Tested with various Matroska files containing both 3-letter language codes and IETF language tags.
Verified correct subtitle extraction and filename generation for different language variants.
Confirmed no memory leaks or segmentation faults occur during parsing.

Thank you,
Tank0nf.

claunia added the pull-request label 2026-01-29 17:21:50 +00:00

[PR #1671] [FIX] Issue#1665 Enhanced Matroska Language Tag Handling #2380

Description

The Initial Problem: Modern MKV Files and IETF Language Tags

Summary of Changes

This enhancement is crucial for:

How Has This Been Tested?