[PR #5946] [MERGED] tools: add a powershell script to generate CPWD from the UCD #26553

Open
opened 2026-01-31 09:16:47 +00:00 by claunia · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/microsoft/terminal/pull/5946
Author: @DHowett
Created: 5/17/2020
Status: Merged
Merged: 6/3/2020
Merged by: @undefined

Base: masterHead: dev/duhowett/cpwd_from_powershell


📝 Commits (10+)

  • cc4a68a tools: add a powershell script to generate CPWD from the UCD
  • 448d698 fix tabs
  • 3e5bf5f lint the script
  • e97fc6a spell, fix
  • c958298 Add overrides support
  • 593439f rework to use UnicodeRange class
  • 7a482b6 cleanup: rangelist is a class, remove $last
  • d8470b3 Require PS 7
  • a481fd3 slightly clarify
  • 8a804b0 Merge remote-tracking branch 'origin/master' into ttt

📊 Changes

3 files changed (+280 additions, -1 deletions)

View changed files

📝 .github/actions/spell-check/dictionary/apis.txt (+2 -1)
📝 .github/actions/spell-check/expect/expect.txt (+4 -0)
tools/Generate-CodepointWidthsFromUCD.ps1 (+274 -0)

📄 Description

This commit introduces Generate-CodepointWidthsFromUCD, a powershell
(7+) script that will parse a UCD XML database in the UAX 42 format from
https://www.unicode.org/Public/UCD/latest/ucdxml/ and generate
CodepointWidthDetector's giant width array.

By default, it will emit one UnicodeRange for every range of non-narrow
glyphs with a different Width + Emoji + Emoji Presentation class;
however, it can be run in "packing" and "full" mode.

  • Packing mode: ignore the width/emoji/pres class and combine adjacent
    runs that CPWD will treat the same.
    • This is for optimizing the number of individual ranges emitted
      into code.
  • Full mode: include narrow codepoints (helpful for visualization)

It also supports overrides, provided in an XML document of the same format
as the UCD itself. Entries in the overrides files are applied after the
entire UCD is read and will replace any impacted ranges.

The output (when packing) looks like this:

// Generated by Generate-CodepointWidthsFromUCD -Pack:True -Full:False
// on 05/17/2020 02:47:55 (UTC) from Unicode 13.0.0.
// 66182 (0x10286) codepoints covered.
static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{
    UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous },
    UnicodeRange{ 0xa4, 0xa4, CodepointWidth::Ambiguous },
    UnicodeRange{ 0xa7, 0xa8, CodepointWidth::Ambiguous },
    .
    .
    .
    UnicodeRange{ 0x1f210, 0x1f23b, CodepointWidth::Wide },
    UnicodeRange{ 0x1f37e, 0x1f393, CodepointWidth::Wide },
    UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous },
};

The output (when overriding) looks like this:

// Generated by Generate-CodepointWidthsFromUCD.ps1 -Pack:True -Full:False -NoOverrides:False
// on 5/22/2020 11:17:39 PM (UTC) from Unicode 13.0.0.
// 321205 (0x4E6B5) codepoints covered.
// 240 (0xF0) codepoints overridden.
static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{
    UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous },
    ...
    UnicodeRange{ 0xfe20, 0xfe2f, CodepointWidth::Narrow }, // narrow combining ligatures (split into left/right halves, which take 2 columns together)
    ...
    UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous },
};

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/microsoft/terminal/pull/5946 **Author:** [@DHowett](https://github.com/DHowett) **Created:** 5/17/2020 **Status:** ✅ Merged **Merged:** 6/3/2020 **Merged by:** [@undefined](undefined) **Base:** `master` ← **Head:** `dev/duhowett/cpwd_from_powershell` --- ### 📝 Commits (10+) - [`cc4a68a`](https://github.com/microsoft/terminal/commit/cc4a68a53d354f88d4b1fbc4452b964872f80b3c) tools: add a powershell script to generate CPWD from the UCD - [`448d698`](https://github.com/microsoft/terminal/commit/448d698b0d4ce51ad433f0bad87ba8a3bcfbf021) fix tabs - [`3e5bf5f`](https://github.com/microsoft/terminal/commit/3e5bf5f10d76b0c881f4bdace94cc5c1460682cd) lint the script - [`e97fc6a`](https://github.com/microsoft/terminal/commit/e97fc6a9348878fc122cfaba77acd49994ab459d) spell, fix - [`c958298`](https://github.com/microsoft/terminal/commit/c9582983dd1657c62306193855e0405b553c5fef) Add overrides support - [`593439f`](https://github.com/microsoft/terminal/commit/593439fc011e739839c65961ca65823703f5760c) rework to use UnicodeRange class - [`7a482b6`](https://github.com/microsoft/terminal/commit/7a482b65b376569bc4ffdda55d3ac584364cbc63) cleanup: rangelist is a class, remove $last - [`d8470b3`](https://github.com/microsoft/terminal/commit/d8470b37fae917b312f4bc975837d3bd8a3decef) Require PS 7 - [`a481fd3`](https://github.com/microsoft/terminal/commit/a481fd3137e7de88a60902e8f41f4eeceb8b15cb) slightly clarify - [`8a804b0`](https://github.com/microsoft/terminal/commit/8a804b0ab138af847418044144f49bf9d4e586ee) Merge remote-tracking branch 'origin/master' into ttt ### 📊 Changes **3 files changed** (+280 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `.github/actions/spell-check/dictionary/apis.txt` (+2 -1) 📝 `.github/actions/spell-check/expect/expect.txt` (+4 -0) ➕ `tools/Generate-CodepointWidthsFromUCD.ps1` (+274 -0) </details> ### 📄 Description This commit introduces Generate-CodepointWidthsFromUCD, a powershell (7+) script that will parse a UCD XML database in the UAX 42 format from https://www.unicode.org/Public/UCD/latest/ucdxml/ and generate CodepointWidthDetector's giant width array. By default, it will emit one UnicodeRange for every range of non-narrow glyphs with a different Width + Emoji + Emoji Presentation class; however, it can be run in "packing" and "full" mode. * Packing mode: ignore the width/emoji/pres class and combine adjacent runs that CPWD will treat the same. * This is for optimizing the number of individual ranges emitted into code. * Full mode: include narrow codepoints (helpful for visualization) It also supports overrides, provided in an XML document of the same format as the UCD itself. Entries in the overrides files are applied after the entire UCD is read and will replace any impacted ranges. The output (when packing) looks like this: ```c++ // Generated by Generate-CodepointWidthsFromUCD -Pack:True -Full:False // on 05/17/2020 02:47:55 (UTC) from Unicode 13.0.0. // 66182 (0x10286) codepoints covered. static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{ UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous }, UnicodeRange{ 0xa4, 0xa4, CodepointWidth::Ambiguous }, UnicodeRange{ 0xa7, 0xa8, CodepointWidth::Ambiguous }, . . . UnicodeRange{ 0x1f210, 0x1f23b, CodepointWidth::Wide }, UnicodeRange{ 0x1f37e, 0x1f393, CodepointWidth::Wide }, UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous }, }; ``` The output (when overriding) looks like this: ```c++ // Generated by Generate-CodepointWidthsFromUCD.ps1 -Pack:True -Full:False -NoOverrides:False // on 5/22/2020 11:17:39 PM (UTC) from Unicode 13.0.0. // 321205 (0x4E6B5) codepoints covered. // 240 (0xF0) codepoints overridden. static constexpr std::array<UnicodeRange, 23> s_wideAndAmbiguousTable{ UnicodeRange{ 0xa1, 0xa1, CodepointWidth::Ambiguous }, ... UnicodeRange{ 0xfe20, 0xfe2f, CodepointWidth::Narrow }, // narrow combining ligatures (split into left/right halves, which take 2 columns together) ... UnicodeRange{ 0x100000, 0x10fffd, CodepointWidth::Ambiguous }, }; ``` --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
claunia added the pull-request label 2026-01-31 09:16:47 +00:00
Sign in to join this conversation.
No Label pull-request
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#26553