[PR #651] [MERGED] Improve SBS #1498

Open
opened 2026-01-29 17:16:47 +00:00 by claunia · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/CCExtractor/ccextractor/pull/651
Author: @maxkoryukov
Created: 1/15/2017
Status: Merged
Merged: 1/17/2017
Merged by: @cfsmp3

Base: masterHead: fix/sbs-rebased-1


📝 Commits (10+)

  • 5c2d695 Fixed format specifiers for debug output
  • 5404108 Some improvements for test-environment
  • 7c9ffbb Levenshtein for char * in utility.c
  • 1b1a572 SBS: use Levenshtein distance to detect duplicates in subs
  • c582175 Wrap debug instructions in #ifdef
  • f23beab Fix error with uninitialed sbs_handled_len. Free sbs_buffer on dinit_encoder_context
  • ad7b141 Tiny fixes
  • 93e407f Improve SBS: fix for #639 and non-gready similarity detection
  • b5b2a7d Probably fix the maxkoryukov/ccextractor#1 : split to sentences
  • 566d128 Remove SBS stuff from decoder_init

📊 Changes

13 files changed (+814 additions, -254 deletions)

View changed files

📝 .gitignore (+5 -0)
📝 src/lib_ccx/ccx_decoders_common.c (+0 -6)
📝 src/lib_ccx/ccx_encoders_common.c (+3 -6)
📝 src/lib_ccx/ccx_encoders_common.h (+1 -9)
📝 src/lib_ccx/ccx_encoders_splitbysentence.c (+461 -136)
📝 src/lib_ccx/utility.c (+21 -0)
📝 src/lib_ccx/utility.h (+2 -0)
📝 tests/Makefile (+2 -3)
📝 tests/README.md (+7 -7)
📝 tests/ccx_encoders_splitbysentence_suite.c (+210 -87)
📝 tests/runtest.c (+5 -0)
tests/samples/sbs_append_string_00 (+66 -0)
tests/samples/sbs_append_string_01 (+31 -0)

📄 Description

In this PR:

  • use Levenshtein distance for joining subs into sentences (allow to fix errors)
  • SBS tests are moved to separated files
  • SBS buffer removed from encoder_context, SBS uses its own buffers and structures

The code is still dirty, but it already works good enough

Important notice

SBS is color-blind, so <font> tags cause errors in resulting output (</font> will be placed in next sub). Please, run ccextractor -sbs only with -nodvbcolor


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/CCExtractor/ccextractor/pull/651 **Author:** [@maxkoryukov](https://github.com/maxkoryukov) **Created:** 1/15/2017 **Status:** ✅ Merged **Merged:** 1/17/2017 **Merged by:** [@cfsmp3](https://github.com/cfsmp3) **Base:** `master` ← **Head:** `fix/sbs-rebased-1` --- ### 📝 Commits (10+) - [`5c2d695`](https://github.com/CCExtractor/ccextractor/commit/5c2d6956fded6546565eb82648b57b523610f8e6) Fixed format specifiers for debug output - [`5404108`](https://github.com/CCExtractor/ccextractor/commit/5404108cc1715188272eba109d163ae39c774767) Some improvements for test-environment - [`7c9ffbb`](https://github.com/CCExtractor/ccextractor/commit/7c9ffbbde93788e10f941bcba60c1d4c54bd4b1f) Levenshtein for char * in `utility.c` - [`1b1a572`](https://github.com/CCExtractor/ccextractor/commit/1b1a572f737e9ecc5aa6feec48ffe22d37ceda3f) SBS: use Levenshtein distance to detect duplicates in subs - [`c582175`](https://github.com/CCExtractor/ccextractor/commit/c582175d351e73484e4cbd9eeff9a3e347abc14b) Wrap debug instructions in #ifdef - [`f23beab`](https://github.com/CCExtractor/ccextractor/commit/f23beab07efa3496aed7fe77414d53e37e2d9672) Fix error with uninitialed `sbs_handled_len`. Free `sbs_buffer` on dinit_encoder_context - [`ad7b141`](https://github.com/CCExtractor/ccextractor/commit/ad7b141cc6b61f752755292272f4109d649e3ee4) Tiny fixes - [`93e407f`](https://github.com/CCExtractor/ccextractor/commit/93e407f4a51affcec90ea2cd0c70d1acbead4058) Improve SBS: fix for #639 and non-gready similarity detection - [`b5b2a7d`](https://github.com/CCExtractor/ccextractor/commit/b5b2a7d70d76ae7edff95cab990c9d6fd4fc7ffc) Probably fix the maxkoryukov/ccextractor#1 : split to sentences - [`566d128`](https://github.com/CCExtractor/ccextractor/commit/566d1284f2eb17d19328ac843bf7b27cc12b899c) Remove SBS stuff from decoder_init ### 📊 Changes **13 files changed** (+814 additions, -254 deletions) <details> <summary>View changed files</summary> 📝 `.gitignore` (+5 -0) 📝 `src/lib_ccx/ccx_decoders_common.c` (+0 -6) 📝 `src/lib_ccx/ccx_encoders_common.c` (+3 -6) 📝 `src/lib_ccx/ccx_encoders_common.h` (+1 -9) 📝 `src/lib_ccx/ccx_encoders_splitbysentence.c` (+461 -136) 📝 `src/lib_ccx/utility.c` (+21 -0) 📝 `src/lib_ccx/utility.h` (+2 -0) 📝 `tests/Makefile` (+2 -3) 📝 `tests/README.md` (+7 -7) 📝 `tests/ccx_encoders_splitbysentence_suite.c` (+210 -87) 📝 `tests/runtest.c` (+5 -0) ➕ `tests/samples/sbs_append_string_00` (+66 -0) ➕ `tests/samples/sbs_append_string_01` (+31 -0) </details> ### 📄 Description In this PR: * use Levenshtein distance for joining subs into sentences (allow to fix errors) * SBS tests are moved to separated files * SBS buffer removed from `encoder_context`, SBS uses its own buffers and structures The code is still dirty, but it already works good enough **Important notice** SBS is color-blind, so `<font>` tags cause errors in resulting output (`</font>` will be placed in **next** sub). Please, run `ccextractor -sbs` only with `-nodvbcolor` --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
claunia added the pull-request label 2026-01-29 17:16:47 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#1498