Skip to content

Calculate the display width of Unicode strings

Notifications You must be signed in to change notification settings

CyberZHG/UnicodeWidthApproximation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Unicode Width Approximation

Unicode 17.0.0 C++ Unit Tests Python Build & Test WASM Build & Test Coverage Status

A library for calculating the display width of Unicode strings in terminal/monospace environments.

Available for C++, Python, and JavaScript/WebAssembly.

Features

  • Full compliance with Unicode 17.0 character properties
  • Width calculation based on East Asian Width, General Category, and Emoji properties
  • Grapheme cluster aware: emoji sequences, combining marks, etc. are handled correctly

Width Rules

Character Type Width Examples
Control characters (Cc) 0 \x00, \n, \t
Format characters (Cf) 0 ZWJ (U+200D), ZWNJ (U+200C)
Combining marks (Mn, Me) 0 Combining accents
East Asian Wide (W) 2 CJK ideographs
East Asian Fullwidth (F) 2 Fullwidth ASCII
Emoji_Presentation 2 🐱, πŸ‡¨πŸ‡³, πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦
Other characters 1 ASCII, Latin, etc.

Examples

Input Width Explanation
"hello" 5 5 ASCII characters
"δΈ­ζ–‡" 4 2 CJK characters Γ— 2
"cafΓ©" 4 4 characters (Γ© is precomposed)
"cafe\u0301" 4 "cafe" + combining acute = 4 (combining mark is 0 width)
"πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦" 2 Family emoji (1 grapheme cluster)
"πŸ‡¨πŸ‡³" 2 Flag emoji (1 grapheme cluster)
"οΌ‘" (fullwidth) 2 Fullwidth Latin A

Installation

Python

pip install unicode-width-approximation

JavaScript (NPM)

npm install unicode-width-approximation

C++ (CMake)

Add as a subdirectory or use FetchContent:

include(FetchContent)
FetchContent_Declare(
    UnicodeWidthApproximation
    GIT_REPOSITORY https://github.com/CyberZHG/UnicodeWidthApproximation.git
    GIT_TAG main
)
FetchContent_MakeAvailable(UnicodeWidthApproximation)

target_link_libraries(your_target PRIVATE UnicodeWidthApproximation)

Usage

Python

from unicode_width_approximation import get_string_width, get_codepoint_width

# Get width of strings
print(get_string_width("hello"))      # 5
print(get_string_width("δΈ­ζ–‡"))        # 4
print(get_string_width("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦"))  # 2

# Get width of single code points
print(get_codepoint_width(ord('A')))  # 1
print(get_codepoint_width(0x4E00))    # 2 (CJK)
print(get_codepoint_width(0x1F600))   # 2 (emoji)

JavaScript / TypeScript

import {
    getStringWidth,
    getCodepointWidth
} from "unicode-width-approximation";

// Get width of strings
console.log(getStringWidth("hello"));      // 5
console.log(getStringWidth("δΈ­ζ–‡"));        // 4
console.log(getStringWidth("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦"));  // 2

// Get width of single code points
console.log(getCodepointWidth(0x41));    // 1 (ASCII 'A')
console.log(getCodepointWidth(0x4E00));  // 2 (CJK)
console.log(getCodepointWidth(0x1F600)); // 2 (emoji)

C++

#include "unicode_width.h"
#include <iostream>

int main() {
    // Get width of a string
    std::cout << unicode_width::getStringWidth("hello") << std::endl;      // 5
    std::cout << unicode_width::getStringWidth("δΈ­ζ–‡") << std::endl;        // 4
    std::cout << unicode_width::getStringWidth("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦") << std::endl;  // 2
    std::cout << unicode_width::getStringWidth("πŸ‡¨πŸ‡³") << std::endl;        // 2

    // Get width of a single code point
    std::cout << unicode_width::getCodepointWidth('A') << std::endl;       // 1
    std::cout << unicode_width::getCodepointWidth(0x4E00) << std::endl;    // 2 (CJK)
    std::cout << unicode_width::getCodepointWidth(0x1F600) << std::endl;   // 2 (emoji)
    std::cout << unicode_width::getCodepointWidth(0x0300) << std::endl;    // 0 (combining)

    // Check character properties
    std::cout << unicode_width::isWideChar(0x4E00) << std::endl;           // true
    std::cout << unicode_width::isZeroWidth(0x200D) << std::endl;          // true

    return 0;
}

API Reference

getStringWidth(s) / get_string_width(s)

Calculates the total display width of a UTF-8 encoded string.

Parameters:

  • s - The input UTF-8 encoded string

Returns:

  • The total display width in columns

Note: This function uses grapheme cluster segmentation to correctly handle combining marks and emoji sequences. The width is determined by the first code point of each grapheme cluster.

getCodepointWidth(code) / get_codepoint_width(code)

Gets the display width of a single Unicode code point.

Parameters:

  • code - The Unicode code point

Returns:

  • 0 for zero-width characters (control, format, combining marks)
  • 2 for wide characters (CJK, fullwidth, emoji)
  • 1 for all other characters

isWideChar(code) / is_wide_char(code)

Checks if a code point is a wide character (East Asian Wide or Fullwidth).

Parameters:

  • code - The Unicode code point

Returns:

  • true if the character has width 2

isZeroWidth(code) / is_zero_width(code)

Checks if a code point is a zero-width character.

Parameters:

  • code - The Unicode code point

Returns:

  • true if the character has width 0

Building from Source

Prerequisites

  • CMake 4.0+
  • C++20 compatible compiler
  • Python 3.x (for generating Unicode data tables)
  • (Optional) Python 3.8+ with pybind11 for Python bindings
  • (Optional) Emscripten for WebAssembly bindings

Build Commands

# C++ library only
cmake -B build
cmake --build build

# With tests
cmake -B build -DUNICODE_WIDTH_APPROXIMATION_ENABLE_TESTS=ON
cmake --build build
ctest --test-dir build

# Python bindings (via pip)
pip install .

# WebAssembly bindings
cd wasm
npm run build

CMake Options

Option Default Description
UNICODE_WIDTH_APPROXIMATION_ENABLE_TESTS OFF Build unit tests
UNICODE_WIDTH_APPROXIMATION_ENABLE_COVERAGE OFF Enable code coverage
UNICODE_WIDTH_APPROXIMATION_ENABLE_STRICT OFF Enable strict compiler warnings
UNICODE_WIDTH_APPROXIMATION_BIND_PYTHON OFF Build Python bindings
UNICODE_WIDTH_APPROXIMATION_BIND_ES OFF Build WebAssembly bindings

Data Sources

This library uses the following Unicode Character Database files:

  • EastAsianWidth.txt - East Asian Width property
  • DerivedGeneralCategory.txt - General Category property
  • emoji-data.txt - Emoji properties

Notes on Text-Default Emoji

Some emoji characters (like © U+00A9 and ❀ U+2764) are "text-default", meaning they display as text by default and only appear as emoji when followed by VS16 (U+FE0F). This library returns width 1 for these characters when they appear alone. When combined with VS16 in a grapheme cluster, the cluster is treated as a single emoji with width 2.

License

MIT License

Links

About

Calculate the display width of Unicode strings

Topics

Resources

Stars

Watchers

Forks