Accumulation of URLs about Unicode

Accumulation of URLs about Unicode

Contents:
    Unicode standard
    Unicode general informations
    U+ notation, Unicode escape sequence
    Security title
    Segmentation, Grapheme
    Normalization, equivalence
    Character set
    Transliteration title
    String matching - Lower vs Casefold
    String matching - Collation
    Locale
    CLDR Common Locale Data Repository
    Case mappings
    Collation, sorting
    BIDI title
    Emoji
    Countries, flags
    Evidence of partial or wrong support of Unicode
    Optimization, SIMD
    Variation sequence
    Whitespaces, separators
    Hyphenation
    DNS title, Domain Name title, Domain Name System title
    All languages
    Classical languages
    Arabic language
    Indic languages
    CJK
    Korean
    Japanese
    Polish
    IME - Input Method Editor
    Text editing
    Text rendering, Text shaping library
    String Matching
    Fuzzy String Matching
    Levenshtein distance and string similarity
    String comparison
    JSON
    TOML serialization format
    CBOR Concise Binary Representation
    Binary encoding in Unicode
    Invalid format
    Mojibake
    Filenames
    WTF8
    Codepoint/grapheme indexation
    Rope
    Encoding title
    ICU title
    ICU design
    ICU demos
    ICU bindings
    ICU4X title
    utf8proc title
    Twitter text parsing
    terminal / console / cmd
    QT Title
    IBM OS
    IBM RPG Lang
    IBM z/OS
    macOS OS
    Windows OS
    Microsoft Word
    Language comparison
    Regular expressions
    Test cases, test-cases, tests files
    font bold, italic, strikethrough, underline, backwards, upside down
    youtube
    xxx lang
    Ada lang
    Awk lang
    C++ lang, cpp lang, Boost
    cRexx lang
    DotNet, CoreFx
    Dafny lang
    Dart lang
    Elixir lang
    Erlang lang
    Factor lang
    Fortran lang
    GO lang
    jRuby lang
    Java lang
    JavaScript lang
    Julia lang
    Kotlin lang
    Lisp lang
    Mathematica lang
    netrexx lang
    Oracle
    Perl lang (Perl 6 has been renamed to Raku)
    PHP lang
    Python lang
    R lang
    RAKU lang Rakudo lang (Perl6, Perl 6, MOAR-VM)
    Rexx lang
    Ruby lang
    Rust lang
    Saxon lang
    SQL lang
    Swift lang
    Typst lang
    XPath lang
    Zig lang, Ziglyph
    Knock, knock.

Unicode standard

Remember
Don't know why, but the Unicode consortium has 2 different URLs:
    https://unicode.org/
    https://www.unicode.org/
To avoid doubling URLs, I use the 2nd form.
[later] also (sometimes)
    https://home.unicode.org/

https://home.unicode.org/
https://www.unicode.org/                                        (same as home.unicode.org)
https://www.unicode.org/versions/
https://www.unicode.org/versions/latest/                        (latest version)
https://www.unicode.org/versions/enumeratedversions.html        (current and previous versions)

https://www.unicode.org/Public/                                 (datas for current and previous versions)

https://www.unicode.org/ucd/
UCD = Unicode Character Database

https://www.unicode.org/Public/MAPPINGS                         (ISO8859)
    These tables are considered to be authoritative mappings
    between the Unicode Standard and different parts of
    the ISO/IEC 8859 standard.

https://www.unicode.org/faq/specifications.html

https://www.unicode.org/reports/
Unicode® Technical Reports
    A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard,
    but is published as a separate document. The Unicode Standard may require
    conformance to normative content in a Unicode Standard Annex, if so specified
    in the Conformance chapter of that version of the Unicode Standard.

    A Unicode Technical Standard (UTS) is an independent specification.
    Conformance to the Unicode Standard does not imply conformance to any UTS.

    A Unicode Technical Report (UTR) contains informative material.
    Conformance to the Unicode Standard does not imply conformance to any UTR.
    Other specifications, however, are free to make normative references to a UTR.

Unicode Standard Annex (UAX)

    UAX #9, The Unicode Bidirectional Algorithm
    https://www.unicode.org/reports/tr9/

    UAX #11, East Asian Width
    https://www.unicode.org/reports/tr11/

    UAX #14, Unicode Line Breaking Algorithm
    https://www.unicode.org/reports/tr14/
    TN #54, Annotated Line Breaking Algorithm
    https://www.unicode.org/notes/tn54/

    UAX #15, Unicode Normalization Forms
    https://www.unicode.org/reports/tr15/

    UAX #24, Unicode Script Property
    https://www.unicode.org/reports/tr24/

    UAX #29, Unicode Text Segmentation
    https://www.unicode.org/reports/tr29/

    UAX #31, Unicode Identifier and Pattern Syntax
    https://www.unicode.org/reports/tr31/

    UAX #34, Unicode Named Character Sequences
    https://www.unicode.org/reports/tr34/

    UAX #38, Unicode Han Database (Unihan)
    https://www.unicode.org/reports/tr38/

    UAX #41, Common References for Unicode Standard Annexes
    https://www.unicode.org/reports/tr41/

    UAX #42, Unicode Character Database in XML
    https://www.unicode.org/reports/tr42/

    UAX #44, Unicode Character Database
    https://www.unicode.org/reports/tr44/

    UAX #45, U-Source Ideographs
    https://www.unicode.org/reports/tr45/

    UAX #50, Unicode Vertical Text Layout
    https://www.unicode.org/reports/tr50/

Unicode Technical Standard  (UTS)

    UTS #22, UNICODE CHARACTER MAPPING MARKUP LANGUAGE (CharMapML)
    https://www.unicode.org/reports/tr22/
    This document specifies an XML format for the interchange of mapping data
    for character encodings, and describes some of the issues connected with the
    use of character conversion.

https://www.unicode.org/glossary
Code Point.
    (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF.
        Not all code points are assigned to encoded characters. See code point type.
    (2) A value, or position, for a character, in any coded character set.
Code Unit.
    The minimal bit combination that can represent a unit of encoded text for processing or interchange.
    The Unicode Standard uses 8-bit code units in the UTF-8 encoding form,
    16-bit code units in the UTF-16 encoding form,
    and 32-bit code units in the UTF-32 encoding form.
Unicode Scalar Value.
    Any Unicode code point except high-surrogate and low-surrogate code points.
    In other words, the ranges of integers 0 to D7FF and E000 to 10FFFF inclusive.

UNICODE COLLATION ALGORITHM
Unicode has an official string collation algorithm called UCA
https://www.unicode.org/reports/tr10/
https://www.unicode.org/reports/tr10/#S2.1.1
The Unicode Collation Algorithm takes an input Unicode string and a Collation Element Table,
containing mapping data for characters. It produces a sort key, which is an array of
unsigned 16-bit integers. Two or more sort keys so produced can then be binary-compared
to give the correct comparison between the strings for which they were generated.

08/06/2021
Default Unicode Collation Element Table (DUCET)
For the latest version, see:
https://www.unicode.org/Public/UCA/latest/allkeys.txt
---
UTS10-D1. Collation Weight: A non-negative integer used in the UCA to establish
          a means for systematic comparison of constructed sort keys.
UTS10-D2. Collation Element: An ordered list of collation weights.
UTS10-D3. Collation Level: The position of a collation weight in a collation element.
---
https://github.com/unicode-org/icu4x/issues/6033#issuecomment-2627767833
ducet isn't supported by ICU4X (or ICU4C/J)

https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms
UNICODE NORMALIZATION FORMS

https://www.unicode.org/reports/tr31/#Alternative_Identifier_Syntax
https://www.unicode.org/reports/tr31/
UNICODE IDENTIFIER AND PATTERN SYNTAX
jlf: there is ONE (just ONE) occurence of NFKC_CF:
    Comparison and matching should be done after converting to NFKC_CF format.
    Thus #MötleyCrüe should match #MÖTLEYCRÜE and other variants.
---
In the UnicodeStandard PDF:
- The mapping NFKC_Casefold (short alias NFKC_CF) is specified in the data
  file DerivedNormalizationProps.txt in the Unicode Character Database.
- The derived binary property Changes_When_NFKC_Casefolded is also listed
  in the data file DerivedNormalizationProps.txt in the Unicode Character Database.
  Conformance 156 3.13 Default Case Algorithms
For more information on the use of NFKC_Casefold and caseless matching for identifiers,
see Unicode Standard Annex #31, “Unicode Identifier and Pattern Syntax

https://www.unicode.org/reports/tr51/
Unicode  emoji

23/05/2021
https://www.unicode.org/notes/tn28/
UNICODEMATH, A NEARLY PLAIN-TEXT ENCODING OF MATHEMATIC
    𝑎𝑏𝑐
    𝑑

    𝑎 + 𝑐
    𝑑

    (𝑎 + 𝑏)𝑛 = ∑ (𝑛 𝑘) 𝑎𝑘𝑏𝑛−𝑘

https://www.unicode.org/notes/tn5/
Unicode Technical Note #5
CANONICAL EQUIVALENCE IN APPLICATIONS

Conformance
https://github.com/unicode-org/conformance
This repository provides tools and procedures for verifying that an
implementation is working correctly according to the data-based specifications.
The tests are implemented on several platforms including NodeJS (JavaScript),
ICU4X (RUST), ICU4C, etc.
Data Driven Test was initiated in 2022 at Google.
The first release of the package was delivered in October, 2022.

https://www.unicode.org/main.html
Unicode® Technical Site
    https://www.unicode.org/faq/

https://www.unicode.org/faq/char_combmark.html
Characters and Combining Marks

Unicode general informations

https://codepoints.net/
Very detailled decription of each character
Source of the WEB site:
https://github.com/Codepoints/codepoints.net

https://util.unicode.org/UnicodeJsps/
Lot of informations about a character.

https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/UTF-16
https://en.wikipedia.org/wiki/UTF-32

http://xahlee.info/comp/unicode_index.html

http://xahlee.info/comp/unicode_invert_text.html
Inverted text: :ʇxǝʇ pǝʇɹǝʌuI

http://xahlee.info/comp/unicode_animals.html
T-REXX: 🦖

https://www.fontspace.com/unicode/analyzer
https://www.compart.com/en/unicode/

22/05/2021
https://onlineunicodetools.com/
Online Unicode tools is a collection of useful browser-based utilities for manipulating Unicode text.

28/05/2021
https://unicode.scarfboy.com/
Search tool
Provides plenty of information about Unicode characters
but no encoding UTF16

https://unicode-table.com/en/                   search by name
Provides the encoding UTF16

https://www.minaret.info/test/menu.msp
Minaret Unicode Tests
    Case Folding
    Character Type
    Collation
    Normalization
    Sorting
    Transliteration

https://www.gosecure.net/blog/2020/08/04/unicode-for-security-professionals/
Unicode for Security Professionals
by Philippe Arteau | Aug 4, 2020
jlf : this article covers many of the Unicode characteristics

https://github.com/bits/UTF-8-Unicode-Test-Documents
Every Unicode character / codepoint in files and a file generator

http://www.ltg.ed.ac.uk/~richard/utf-8.html
let convert utf8 to codepoint + symbolic name

https://blog.lunatech.com/posts/2009-02-03-what-every-web-developer-must-know-about-url-encoding

https://mothereff.in/utf-8
UTF-8 encoder/decoder

https://corp.unicode.org/pipermail/unicode/
The Unicode Archives
January 2, 2014 - current

https://www.unicode.org/mail-arch/unicode-ml/
March 21, 2001 - April 2, 2020

https://www.unicode.org/mail-arch/unicode-ml/Archives-Old/
October 11, 1994 - March 19, 2001

https://www.unicode.org/search/
Search Unicode.org

https://www.w3.org/TR/charmod/
Character Model for the World Wide Web 1.0: Fundamentals

https://www.johndcook.com/blog/2021/11/01/number-sets-html/
Number sets in HTML and Unicode
    ℕ U+2115 &Nopf; &naturals;
    ℤ U+2124 &Zopf; &integers;
    ℚ U+211A &Qopf; &rationals;
    ℝ U+211D &Ropf; &reals;
    ℂ U+2102 &Copf; &complexes;
    ℍ U+210D &Hopf; &quaternions;

https://gregtatum.com/writing/2021/encoding-text-utf-32-utf-16-unicode/
https://gregtatum.com/writing/2021/encoding-text-utf-8-unicode/

https://lwn.net/Articles/667669/
Is the current Unicode design impractical?
jlf: this link is also in the section Raku Lang because it's about Perl6.
jlf: worth reading.

https://www.sciencedirect.com/science/article/pii/S1742287613000595
Unicode search of dirty data.
This paper discusses problems arising in digital forensics with regard to Unicode,
character encodings, and search. It describes how multipattern search can handle
the different text encodings encountered in digital forensics and a number of issues
pertaining to proper handling of Unicode in search patterns. Finally, we demonstrate
the feasibility of the approach and discuss the integration of our developed search
engine, lightgrep, with the popular bulk_extractor tool.
---
There are UTF-16LE strings which contain completely different UTF-8 strings as prefixes.
For example the byte sequence which is “nonsense” in UTF-8 is 潮獮湥敳 in UTF-16LE (!)
    "nonsense"~c2x=                         -- '6E6F6E73656E7365'
    "nonsense"~text("utf16be")~c2x=         -- '6E6F 6E73 656E 7365'
    "nonsense"~text("utf16be")~c2u=         -- 'U+6E6F U+6E73 U+656E U+7365'
    "nonsense"~text("utf16be")~utf8=        -- T'湯湳敮獥'      Le potage
    "nonsense"~text("utf16le")~c2x=         -- '6E6F 6E73 656E 7365'
    "nonsense"~text("utf16le")~c2u=         -- 'U+6F6E U+736E U+6E65 U+6573'
    "nonsense"~text("utf16le")~utf8=        -- T'潮獮湥敳'      marée

https://github.com/simsong/bulk_extractor

http://t-a-w.blogspot.com/2008/12/funny-characters-in-unicode.html
SKULL AND CROSSBONES
SNOWMAN
POSTAL MARK FACE
APL FUNCTIONAL SYMBOL TILDE DIAERESIS
ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM
ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
THAI CHARACTER KHOMUT
GLAGOLITIC CAPITAL LETTER SPIDERY HA
VERY MUCH GREATER-THAN
NEITHER LESS-THAN NOR GREATER-THAN
HEAVY BLACK HEART
FLORAL HEART BULLET, REVERSED ROTATED
INTERROBANG
𝄞 (U+1D11E) MUSICAL SYMBOL G CLEF
𝕥 (U+1D565) MATHEMATICAL DOUBLE-STRUCK SMALL T
𝟶 (U+1D7F6) MATHEMATICAL MONOSPACE DIGIT ZERO
𠂊 (U+2008A) Han Character

https://www.unicode.org/udhr/
UDHR in Unicode
The goal of the UDHR in Unicode project is to demonstrate the use of Unicode
for a wide variety of languages, using the Universal Declaration of Human Rights
(UDHR) as a representative text.

https://github.com/jagracey/Awesome-Unicode
Awesome Unicode

https://cldr.unicode.org/index/charts
CLDR Charts
By-Type Chart: Numbers:Symbols
Question
    I am using the following code excerpt to format numbers:
        LocalizedNumberFormatter lnFmt = NumberFormatter.withLocale(Locale.US).unit(MeasureUnit.CELSIUS).unitWidth(NumberFormatter.UnitWidth.SHORT);
        System.out.println(lnFmt.format(-10).toString());
    In the resulting string, minus sign is represented as 0x2d (ASCII HYPHEN-MINUS).
    Shouldn't it be U+2212 (Unicode MINUS SIGN)?
Answer
    You can see the minus sign symbol being used for each locale here:
    https://unicode-org.github.io/cldr-staging/charts/latest/by_type/numbers.symbols.html#2f08b5ebf85e1e8b
    U+2212 is used in: ·fa· ·ps· ·uz_Arab· ·eo· ·et· ·eu· ·fi· ·fo· ·gsw· ·hr· ·kl· ·ksh· ·lt· ·nn· ·no· ·rm· ·se· ·sl· ·sv·
Question
    Where this list of locales was taken from?
    I am particulary interested in ‘ru’: why U+2212 is not used for it?

https://stackoverflow.com/questions/10143836/why-is-there-no-utf-24
Why is there no UTF-24? [duplicate]
Well, the truth is : UTF-24 was suggested in 2007 :
https://www.unicode.org/mail-arch/unicode-ml/y2007-m01/0057.html

Possible Duplicate:
Why UTF-32 exists whereas only 21 bits are necessary to encode every character?
https://stackoverflow.com/questions/6339756/why-utf-32-exists-whereas-only-21-bits-are-necessary-to-encode-every-character

https://unicodebook.readthedocs.io/
Book "Programming with Unicode"
2010-2011, Victor Stinner
jlf: only one occurrence of the word "grapheme".
Maybe at that time, it was not obvious that it would become an important concept.

https://mcilloni.ovh/2023/07/23/unicode-is-hard/
Unicode is harder than you think
23 Jul 2023
---
jlf: good overview, with some ICU samples.

https://www.kermitproject.org/utf8.html
UTF-8 SAMPLER
Last update: Sun Mar 12 14:21:05 2023

http://www.inter-locale.com/whitepaper/learn/learn-to-test.html
International Testing Basics
Testing non-English and non-ASCII (and/or Unicode) support in a product requires
tests and test plans that exercise the edge cases in the software.

https://www.youtube.com/watch?v=gd5uJ7Nlvvo
Plain Text - Dylan Beattie - NDC Copenhagen 2022
---
jlf: many comments say it's a good talk
did not watch
todo: watch

https://www.smashingmagazine.com/2012/06/all-about-unicode-utf8-character-sets/
Unicode, UTF8 & Character Sets: The Ultimate Guide
jlf: maybe to read

https://tonsky.me/blog/unicode/
The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)
https://news.ycombinator.com/item?id=37735801
What every software developer must know about Unicode in 2023
jlf: nothing new in this article, just reusing infos from other sites.
jlf: did not read all the comments

https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode
This article covers all Unicode characters with a derived property of "Math".

U+ notation, Unicode escape sequence

29/05/2021
https://stackoverflow.com/questions/1273693/why-is-u-used-to-designate-a-unicode-code-point/8891355
The Python language defines the following string literals:
    u'xyz' to indicate a Unicode string, a sequence of Unicode characters
    '\uxxxx' to indicate a string with a unicode character denoted by four hex digits
    '\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits
    \N{name}    Character named name in the Unicode database
    \uxxxx      Character with 16-bit hex value xxxx. Exactly four hex digits are required.
    \Uxxxxxxxx  Character with 32-bit hex value xxxxxxxx. Exactly eight hex digits are required.

https://www.perl.com/article/json-unicode-and-perl-oh-my-/
Its \uXXXX escapes support only characters within Unicode’s BMP;
to store emoji or other non-BMP characters you either have to encode to UTF-8 directly.
or indicate a UTF-16 surrogate pair in \uXXXX escapes.

https://corp.unicode.org/pipermail/unicode/2021-April/009410.html
Need reference to good ABNF for \uXXXX syntax

https://bit.ly/UnicodeEscapeSequences
Unicode Escape Sequences Across Various Languages and Platforms

Security title

https://www.unicode.org/reports/tr39
UNICODE SECURITY MECHANISMS
https://www.unicode.org/Public/security/latest/confusables.txt

https://en.wikipedia.org/wiki/Homoglyph

https://www.trojansource.codes/

https://api.mtr.pub/vhf/confusable_homoglyphs

https://util.unicode.org/UnicodeJsps/confusables.jsp
https://www.w3.org/TR/charmod-norm/#normalizationLimitations
Confusable characters:
"ΡРP"~text~characters==
    an Array (shape [3], 3 items)
     1 : ( "Ρ"   U+03A1 Lu 1 "GREEK CAPITAL LETTER RHO" )
     2 : ( "Р"   U+0420 Lu 1 "CYRILLIC CAPITAL LETTER ER" )
     3 : ( "P"   U+0050 Lu 1 "LATIN CAPITAL LETTER P" )
These confusable characters are not impacted by the lump option:
"ΡРP"~text~nfc(lump:)~characters    -- same result

https://www.unicode.org/reports/tr36/#visual_spoofing
UNICODE SECURITY CONSIDERATIONS

http://www.unicode.org/reports/tr55/
Draft Unicode® Technical Standard #55
UNICODE SOURCE CODE HANDLING
---
While the normative material for computer language specifications is part of the
Unicode Standard, in Unicode Standard Annex #31, Unicode Identifiers and Syntax
[UAX31], the algorithms specific to the display of source code or to higher-level
diagnostics are specified in this document.
Note: While, for the sake of brevity, many of the examples in this document make
use of non-ASCII identifiers, most of the issues described here apply even if
non-ASCII characters are confined to strings and comments.
---
3.1.1 Normalization and Case
Case-insensitive languages should meet requirement UAX31-R4 with normalization
form KC, and requirement UAX31-R5 with full case folding. They should ignore
default ignorable code points in comparison. Conformance with these requirements
and ignoring of default ignorable code points may be achieved by comparing
identifiers after applying the transformation toNFKC_Casefold.
Note: Full case folding is preferable to simple case folding, as it better
matches expectations of case-insensitive equivalence.
The choice between Normalization Form C and Normalization Form KC should match
expectations of identifier equivalence for the language.
In a case-sensitive language, identifiers are the same if and only if they look
the same, so Normalization Form C (canonical equivalence) is appropriate, as
canonical equivalent sequences should display the same way.
In a case-insensitive language, the equivalence relation between identifiers is
based on a more abstract sense of character identity; for instance, e and E are
treated as the same letter. Normalization Form KC (compatibility equivalence) is
an equivalence between characters that share such an abstract identity.
Example: In a case-insensitive language, SO and so are the same identifier; if
that language uses Normalization Form KC, the identifiers so and 𝖘𝖔 are likewise
identical.

Unicode 15.1
[icu-design] ICU 74 API proposal: bidiSkeleton and LTR- and RTL-confusabilities
The Source Code Working Group, a limited-duration working group under the
Properties & Algorithms Group of the Unicode Technical Committee, has added a
new bidi-aware concept of confusability to UTS #39 in Unicode Version 15.1;
until publication see the proposed update,
https://www.unicode.org/reports/tr39/tr39-27.html#Confusable_Detection.
The new UTS #55, Unicode Source Code Handling, to be published simultaneously
with Unicode Version 15.1, recommends the use of this new kind of confusability:
https://www.unicode.org/reports/tr55/tr55-2.html#Confusable-Detection.

https://semanticdiff.com/blog/pull-request-unicode-tricks/
Unicode tricks in pull requests: Do review tools warn us?

Segmentation, Grapheme

29/05/2021
https://github.com/alvinlindstam/grapheme
https://pypi.org/project/grapheme/
Here too, he says that CR+LF is a grapheme...

Same here:
https://www.reddit.com/r/programming/comments/m274cg/til_rn_crlf_is_a_single_grapheme_cluster/
https://www.unicode.org/reports/tr29/#Table_Combining_Char_Sequences_and_Grapheme_Clusters

01/06/2021
https://halt.software/optimizing-unicodes-grapheme-cluster-break-algorithm/
They claim this improvement:
For the simple data set, this was 0.38 of utf8proc time.
For the complex data set, this was 0.56 of utf8proc time.

01/06/2021
https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/
GraphemeCursor	Cursor-based segmenter for grapheme clusters.
GraphemeIndices	External iterator for grapheme clusters and byte offsets.
Graphemes	External iterator for a string's grapheme clusters.
USentenceBoundIndices	External iterator for sentence boundaries and byte offsets.
USentenceBounds	External iterator for a string's sentence boundaries.
UWordBoundIndices	External iterator for word boundaries and byte offsets.
UWordBounds	External iterator for a string's word boundaries.
UnicodeSentences	An iterator over the substrings of a string which, after splitting
    the string on sentence boundaries, contain any characters with the Alphabetic
    property, or with General_Category=Number.
UnicodeWords	An iterator over the substrings of a string which, after splitting
    the string on word boundaries, contain any characters with the Alphabetic
    property, or with General_Category=Number.

https://github.com/knighton/unicode
Minimalist Unicode normalization/segmentation library. Python and C++.
Abandonned, last commit 21/05/2015

https://hsivonen.fi/string-length/
First published: 2019-09-08
It’s Not Wrong that "🤦🏼‍♂️".length == 7
But It’s Better that "🤦🏼‍♂️".len() == 17 and Rather Useless that len("🤦🏼‍♂️") == 5
But I Want the Length to Be 1!
jlf:
    "🤦🏼‍♂️"~text~length=  -- 1
    "🤦🏼‍♂️"~text~characters==
    an Array (shape [5], 5 items)
     1 : ( "🤦"  U+1F926 So 2 "FACE PALM" )
     2 : ( "🏼"  U+1F3FC Sk 2 "EMOJI MODIFIER FITZPATRICK TYPE-3" )
     3 : ( "‍"    U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" )
     4 : ( "♂"   U+2642 So 1 "MALE SIGN" )
     5 : ( "️"    U+FE0F Mn 0 "VARIATION SELECTOR-16", "VS16" )

07/06/2021
https://news.ycombinator.com/item?id=20914184
String lengths in Unicode
    Claude Roux
    We went through a lot of pain to get this right in Tamgu ( https://github.com/naver/tamgu ).
    In particular, emojis can be encoded across 5 or 6 Unicode characters.
    A "black thumb up" is encoded with 2 Unicode characters: the thumb glyph and its color.
    This comes at a cost. Every time you extract a sub-string from a string,
    you have to scan it first for its codepoints, then convert character positions
    into byte positions. One way to speed up stuff a bit, is to check if the string
    is in ASCII (see https://lemire.me/blog/2018/05/16/validating-utf-8-strings-u )
    and apply regular operator then.
    We implemented many techniques based on "intrinsics" instructions to speed up
    conversions and search in order to avoid scanning for codepoints.
    See https://github.com/naver/tamgu/blob/master/src/conversion.cxx for more information.
    https://github.com/naver/tamgu/wiki/4.-Speed-up-UTF8-string-processing-with-Intel's-%22intrinsics%22-instructions-(en)
jlf: they have specific support for Korean... Probably because the NAVER company is from Republic of Korea ?
08/06/2021
https://twitter.com/hashtag/tamgu?src=hashtag_click
https://twitter.com/hashtag/TAL?src=hashtag_click
#tamgu le #langage_de_programmation pour le Traitement Automatique des Langues (#TAL).

jlf 30/09/2021
I have a doubt about that:
Is 👩‍👨‍👩‍👧' really a grapheme?
When moving the cursor in BBEdit, I see a boundary between each character.
[later]
Ok, when moving the cursor in Visual Studio Code, it's really a unique grapheme, no way to put the cursor "inside".
And the display is aligned with what I see in Google Chrome :
one WOMAN followed by a family, and no way to put the cursor between the WOMAN and the family.
---
https://www.unicode.org/review/pr-27.html       (old, talk about Unicode 4)
https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries   (todo: review occurences of ZWJ)

29/10/2021
https://h3manth.com/posts/unicode-segmentation-in-javascript/
https://github.com/tc39/proposal-intl-segmenter

https://news.ycombinator.com/item?id=21690326
Tailored grapheme clusters
Grapheme clusters are locale-dependent, much like string collation is locale-dependent.
What Unicode gives you by default, the (extended) grapheme cluster, is as useful as
the DUCET (Default Unicode Collation Element Table); while you can live with them,
you would be unsatisfied. In fact there are tons of Unicode bugs that can't be corrected
due to the compatibility reason, and can only be fixed via tailored locale-dependent schemes.
---
Hangul normalization and collation is broken in Unicode, albeit for slightly different reasons.
The Unicode Collation Algorithm explictly devotes two sections related to Hangul; the first section, for "trailing weights" [1], is recommended for the detailed explanation.
The Unicode Text Segmentation standard [2] explicitly mentions that Indic aksaras [3]
require the tailoring to grapheme clusters. Depending on the view, you can also consider
orthographic digraphs as examples (Dutch "ij" is sometimes considered a single character for example).

https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
What's the difference between a character, a code point, a glyph and a grapheme?
jlf: not very good...

https://github.com/clipperhouse/words
words is a command which splits strings into individual words, as defined by Unicode.
It accepts text from stdin, and writes one word (token) per line to stdout.

https://www.unicode.org/reports/tr29/#Random_Access
    jlf: Executor uses indexers for random access (ako breadcrumbs).

    Random access introduces a further complication. When iterating through a string
    from beginning to end, a regular expression or state machine works well. From
    each boundary to find the next boundary is very fast. By constructing a state
    table for the reverse direction from the same specification of the rules,
    reverse iteration is possible.

    However, suppose that the user wants to iterate starting at a random point in
    the text, or detect whether a random point in the text is a boundary. If the
    starting point does not provide enough context to allow the correct set of rules
    to be applied, then one could fail to find a valid boundary point. For example,
    suppose a user clicked after the first space after the question mark in
    “Are␣you␣there?␣ ␣No,␣I’m␣not”. On a forward iteration searching for a sentence
    boundary, one would fail to find the boundary before the “N”, because the “?”
    had not been seen yet.

    A second set of rules to determine a “safe” starting point provides a solution.
    Iterate backward with this second set of rules until a safe starting point is
    located, then iterate forward from there. Iterate forward to find boundaries
    that were located between the safe point and the starting point; discard these.
    The desired boundary is the first one that is not less than the starting point.
    The safe rules must be designed so that they function correctly no matter what
    the starting point is, so they have to be conservative in terms of finding
    boundaries, and only find those boundaries that can be determined by a small
    context (a few neighboring characters).

    This process would represent a significant performance cost if it had to be
    performed on every search. However, this functionality can be wrapped up in an
    iterator object, which preserves the information regarding whether it currently
    is at a valid boundary point. Only if it is reset to an arbitrary location in
    the text is this extra backup processing performed. The iterator may even cache
    local values that it has already traversed.

Unicode 15.1
New rule GB9c for grapheme segmentation.
https://www.unicode.org/reports/tr29/
---
No longer available:
https://www.unicode.org/reports/tr29/proposed.html
---
jlf: saw this review note
"the new rule GB9c has been implemented in CLDR and ICU as a profile for some years"
What is a profile?
---
This specification defines default mechanisms; more sophisticated implementations
can and should tailor them for particular locales or environments and, for the
purpose of claiming conformance, document the tailoring in the form of a profile.
...
Note that a profile can both add and remove boundary positions, compared to the
results specified by UAX29-C1-1, UAX29-C2-1, or UAX29-C3-1.

https://github.com/unicode-org/lstm_word_segmentation
Python code for training an LSTM model for word segmentation in Thai, Burmese,
and similar languages.

Normalization, equivalence

https://www.unicode.org/faq/normalization.html
Normalization FAQ

https://www.macchiato.com/unicode-intl-sw/nfc-faq
NFC FAQ
jlf: MUST READ!

https://www.unicode.org/reports/tr15
UNICODE NORMALIZATION FORMS

26/11/2013
Text normalization in Go
https://blog.golang.org/normalization

27/11/2013
The string type is broken
https://mortoray.com/2013/11/27/the-string-type-is-broken/
https://news.ycombinator.com/item?id=6807524
https://www.reddit.com/r/programming/comments/1rkdip/the_string_type_is_broken/
In the comments
Objective-C’s NSString type does correctly upper-case baﬄe into BAFFLE.
(where the rectangle is a grapheme showing 2 small 'f')
Q: What about getting the first three characters of “baﬄe”? Is “baf” the correct answer?
A:  That’s a good question. I suspect “baf” is the correct answer, and I wonder if there is any library that does it.
    I suspect if you normalize it first (since the ffl would disappear I think).
A:  The ligarture disappears in NFK[CD] but not in NF[CD].
    Whether normalization to NFK[CD] is a good idea depends (as always) on the situation.
    For visual grapheme cluster counting, one would convert the entire text to NFKC.
    For getting teaser text from an article i would not a normalization step
    and let a ligature count as just one grapheme cluster even if it may resemble three of them logically.
    I assume, that articles are stored in NFC (the nondestructive normalization form with smallest memory footprint).
    The Unicode standard does not treat ligatures as containing more than one grapheme cluster for that normalization forms that permits them.
    So “eﬄab” (jlf: efflab) is the correct result of reversing “baﬄe” (jlf: baffle)
    and “baﬄe”[2] has to return “ﬄ” even when working on the grapheme cluster level!

    There may or may not be a need for another grapheme cluster definition that permits splitting of ligatures in NF[CD].
    A straight forward way to implement a reverse function adhering to that special definition would NFKC each Unicode grapheme cluster on the fly.
    When that results in multiple Unicode grapheme clusters, that are used – else the original is preserved (so that “ℕ” does not become “N”).
    The real problem is to find a good name for that special interpretation of a grapheme cluster…
Note :
    see also the comment of Tom Christiansen about casing.
    I don't copy-paste here, too long.

https://github.com/blackwinter/unicode
Unicode normalization library. (Mirror of Yoshida-san's code base to maintain the RubyGem.)
Abandonned, last commit 07/07/2016

https://github.com/sjorek/unicode-normalization
An enhanced facade to existing unicode-normalization implementations
Last commit 25/03/2018

https://docs.microsoft.com/en-us/windows/win32/intl/using-unicode-normalization-to-represent-strings
Using Unicode Normalization to Represent Strings

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
String.prototype.normalize()
The normalize() method returns the Unicode Normalization Form of the string.

https://forums.swift.org/t/string-case-folding-and-normalization-apis/14663/3
For the comments

https://en.wikipedia.org/wiki/Unicode_equivalence
Unicode equivalence is the specification by the Unicode character encoding standard that some sequences
of code points represent essentially the same character. This feature was introduced in the standard
to allow compatibility with preexisting standard character sets, which often included similar or identical characters.

On Wed, Oct 28, 2020 at 9:54 AM Mark Davis ☕️ <mark@macchiato.com> wrote:
Re: [icu-support] Options for Immutable Collation?

        I think your search for 'middle ground' is fruitless.
            An NFKD ordering is not correct for any human language, and changes with each new Unicode version.
            And even the default Unicode collation ordering is wrong for many languages, because there is no order that simultaneously satisfies all (eg German ordering and Swedish ordering are incompatible).
        Your 'middle ground' would be correct for nobody, and yet be unstable across Unicode versions; or worse yet, fail for new characters.

        IMO, the best practice for a file system (or like systems) is to store in codepoint order. When called upon to present a sorted list of files to a user, the displaying program should sort that list according to the user's language preferences.

    You are right: for a deterministic/reproducible list sorting for a cross-platform filesystem API, anything more complex would be an implementation hazard.

    However, after reviewing both developer discussions and implementation of Unicode handling in 6+ filesystems, IDNA200X, PRECIS and getting roped into work on an IETF i18n filesystem best-practices RFC ... I've got some thoughts.  Thoughts that I will put into a new thread after I do some experimenting : ).

    Thank you all so much!!!
    -Zach Lym

08/06/2021
https://fr.wikipedia.org/wiki/Normalisation_Unicode
NFD     Les caractères sont décomposés par équivalence canonique et réordonnés
        canonical decomposition
NFC     Les caractères sont décomposés par équivalence canonique, réordonnés, et composés par équivalence canonique
        canonical decomposition followed by canonical composition
NFKD    Les caractères sont décomposés par équivalence canonique et de compatibilité, et sont réordonnés
        compatibility decomposition
NFKC    Les caractères sont décomposés par équivalence canonique et de compatibilité, sont réordonnés et sont composés par équivalence canonique
        compatibility decomposition followed by canonical composition
FCD     "Fast C or D" form; cf. UTN #5
FCC     "Fast C Contiguous"; cf. UTN #5

09/06/2021
Rust
https://docs.rs/unicode-normalization
    Decompositions  External iterator for a string decomposition’s characters.
    Recompositions  External iterator for a string recomposition’s characters.
    Replacements    External iterator for replacements for a string’s characters.
    StreamSafe      UAX15-D4: This iterator keeps track of how many non-starters
                    there have been since the last starter in NFKD and will emit
                    a Combining Grapheme Joiner (U+034F) if the count exceeds 30.

    is_nfc                      Authoritatively check if a string is in NFC.
    is_nfc_quick                Quickly check if a string is in NFC, potentially returning IsNormalized::Maybe if further checks are necessary. In this case a check like s.chars().nfc().eq(s.chars()) should suffice.
    is_nfc_stream_safe          Authoritatively check if a string is Stream-Safe NFC.
    is_nfc_stream_safe_quick    Quickly check if a string is Stream-Safe NFC.
    is_nfd                      Authoritatively check if a string is in NFD.
    is_nfd_quick                Quickly check if a string is in NFD.
    is_nfd_stream_safe          Authoritatively check if a string is Stream-Safe NFD.
    is_nfd_stream_safe_quick    Quickly check if a string is Stream-Safe NFD.
    is_nfkc                     Authoritatively check if a string is in NFKC.
    is_nfkc_quick               Quickly check if a string is in NFKC.
    is_nfkd                     Authoritatively check if a string is in NFKD.
    is_nfkd_quick               Quickly check if a string is in NFKD.

    Enums
    IsNormalized	The QuickCheck algorithm can quickly determine if a text is
                    or isn’t normalized without any allocations in many cases,
                    but it has to be able to return Maybe when a full decomposition
                    and recomposition is necessary.

08/06/2021
Pharo
https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43

https://github.com/duerst/eprun
Efficient Pure Ruby Unicode Normalization (eprun)
According to julia/utf8proc, the interesting part is the tests.

https://corp.unicode.org/pipermail/unicode/2020-December/009150.html
Normalization Generics (NFx, NFKx, NFxy)

https://6guts.wordpress.com/2015/04/12/this-week-unicode-normalization-many-rts/

https://gregtatum.com/writing/2021/diacritical-marks/
DIACRITICAL MARKS IN UNICODE

https://news.ycombinator.com/item?id=29751641
Unicode Normalization Forms: When ö ≠ ö
https://blog.opencore.ch/posts/unicode-normalization-forms/

https://unicode-org.github.io/icu/userguide/transforms/normalization/
ICU Documentation
Normalization
Has a few comments about NFKC_Casefold
- NFKC_Casefold: ICU 4.4 supports the combination of NFKC, case folding and
  removing ignorable characters which was introduced with Unicode 5.2.
- Data Generation Tool

https://stackoverflow.com/questions/56995429/will-normalizing-a-string-give-the-same-result-as-normalizing-the-individual-gra
Will normalizing a string give the same result as normalizing the individual grapheme clusters?
---
No, that generally is not true. The Unicode Standard warns against the assumption that concatenating
normalised strings produces another normalised string. From UAX #15:
    In using normalization functions, it is important to realize that none of the Normalization Forms
    are closed under string concatenation. That is, even if two strings X and Y are normalized,
    their string concatenation X+Y is not guaranteed to be normalized.

https://stackoverflow.com/questions/7171377/separating-unicode-ligature-characters
NFKD is no panacea: there are plenty of ligatures and other notionally combined
forms it just does not work on at all. For example, it will not manage to decompose
ß or ẞ to SS (even those there is a casefold thither!), nor Æ to AE or æ to ae,
nor Œ to OE or œ to oe. It is also useless for turning ð or đ into d or ø into o.
For all those things, you need the UCA (Unicode Collation Algorithm), not NFKD.
NFD/NFKD also both have the annoying property of destroying singletons, if this
matters to you.
---
my understanding is that those decompositions you mention should not be done.
They are not simply ligatures in the typographical sense, but real separate
characters that are used differently! ß can be decomposed to ss if necessary
(for example if you can only store ASCII), but they are not equivalent. The ff
Ligature, on the other hand is only a typographical ligature.

Character set

https://www.gnu.org/software/libc/manual/html_mono/libc.html#Character-Set-Handling

Transliteration title

https://unicode-org.github.io/icu/userguide/transforms/general/
ICU documentation

https://bartvanraaij.dev/2020-10-17-converting-utf8-strings-to-ascii-using-icu-transliterator/
Converting UTF-8 strings to ASCII using the ICU Transliterator

https://tanzu.vmware.com/content/blog/unicode-transliteration-to-ascii
Unicode Transliteration to Ascii

https://metacpan.org/pod/Text::Unidecode
plain ASCII transliterations of Unicode text

https://github.com/anyascii/anyascii
Unicode to ASCII transliteration - C Elixir Go Java JS Julia PHP Python Ruby Rust Shell .NET

https://research.google.com/archive/papers/36450.pdf
Proper Name Transliteration with ICU Transforms
Case study: Google Maps
Spanish-to-Russian
Spanish-to-Japanese
Spanish-to-Mandarin

String matching - Lower vs Casefold

https://stackoverflow.com/questions/45745661/lower-vs-casefold-in-string-matching-and-converting-to-lowercase

https://www.w3.org/TR/charmod-norm/
Character Model for the World Wide Web: String Matching
MUST READ, PLENTY OF EXAMPLES FOR CORNER CASES

https://www.w3.org/TR/charmod-norm/#definitionCaseFolding
Very good explanation!
    A few characters have a case folding that map one Unicode code point to two or more code points.
    This set of case foldings are called the full case foldings.

    character ß U+00DF LATIN SMALL LETTER SHARP S
    - The full case folding and the lower case mapping of this character is to two ASCII letters 's'.
    - The upper case mapping is to "SS".
    Because some applications cannot allocate additional storage when performing a case fold operation,
    Unicode provides a simple case folding that maps a code point that would normally fold to more or
    fewer code points to use a single code point for comparison purposes instead.
    Unlike the full folding, this folding invariably alters the content (and potentially the meaning) of the text.
    Unicode simple is not appropriate for use on the Web.

    character ᾛ [U+1F9B GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI]
    ᾛ	⇒	ἣι	full case fold: U+1F23 GREEK SMALL LETTER ETA WITH DASIA AND VARIA + U+03B9 GREEK SMALL LETTER IOTA
    ᾛ	⇒	ᾓ	simple case fold: U+1F93 GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI

    Language Sensitivity
    Another aspect of case mapping and case folding is that it can be language sensitive.
    Unicode defines default case mappings and case foldings for each encoded character,
    but these are only defaults and are not appropriate in all cases. Some languages need
    case mapping to be tailored to meet specific linguistic needs. One example of this are
    Turkic languages written in the Latin script:
        Default Folding
        I	⇒	i	Default folding of letter I
        Turkic Language Folding
        I	⇒	ı	Turkic language folding of dotless (ASCII) letter I
        İ	⇒	i	Turkic language folding of dotted letter I

https://www.w3.org/TR/charmod-norm/#matchingAlgorithm
There are four choices for text normalization:
- Default.
  This normalization step has no effect on the text and, as a result, is sensitive
  to form differences involving both case and Unicode normalization.
- ASCII Case Fold.
  Comparison of text with the characters case folded in the ASCII (Basic Latin, U+0000 to U+007F) range.
- Unicode Canonical Case Fold.
  Comparison of text that is both case folded and has Unicode canonical normalization applied.
- Unicode Compatibility Case Fold.
  Comparison of text that is both case folded and has Unicode compatibility normalization applied.
  This normalization step is presented for completeness, but it is not generally appropriate for use on the Web.

https://www.elastic.co/guide/en/elasticsearch/guide/current/languages.html
Elasticsearch
Dealing with Human Language

https://stackoverflow.com/questions/319426/how-do-i-do-a-case-insensitive-string-comparison
Related to Python, but the comments are very general and worth reading.
---
Unicode Standard section 3.13 has two other definitions for caseless comparisons:
(D146, canonical) NFD(toCasefold(NFD(str))) on both sides and
(D147, compatibility) NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) on both sides.
It states the inner NFD is solely to handle a certain Greek accent character.

https://boyter.org/posts/unicode-support-what-does-that-actually-mean/
https://news.ycombinator.com/item?id=23524400
ſecret == secret == Secret
ſatisfaction == satisfaction == ſatiſfaction == Satiſfaction == SatiSfaction === ſatiSfaction
Another good example to consider is the character Æ.
Under simple case folding rules the lower of Æ is ǣ.
However with full case folding rules this also matches ae.
Which one is correct? Well that depends on who you ask.

See also
https://github.com/unicode-org/icu4x/issues/3151
in the section "ICU4X title".

https://lwn.net/Articles/784316/
Working with UTF-8 in the kernel
jlf: interesting read about NTFS caseless, and about a drama because of lack of
support for the turkish case.

String matching - Collation

https://unicode-org.github.io/icu/userguide/collation/string-search.html
(ICU) String Search Service
jlf: they give 3 issues applicable to text searching. Accented letters and
conjoined letters are covered by Executor. But ignorable punctuation is not.

https://www.postgresql.org/docs/current/collation.html
https://www.postgresql.org/docs/current/collation.html#ICU-TAILORING-RULES
This page is referenced in this question on ICU mailing list:
---
question jian he
    I guess my real question is how to use the tailoring for pinyin.
    for example, i want to order "赵" before any text in hanzi.
    I am using the icu collation version:  153.112.40  in postgres dev branch.
    CREATE COLLATION custom1 (provider = icu, locale =
    'zh-u-co-pinyin-x-icu', rules = '&赵 << 阿 <<< 嗄');
    select '赵' < '阿' collate custom1;
    select '赵' < '朝' collate custom1;

    based on [1] info, the first query ('赵' < '阿') returns true, which is
    what i expected.
    but i expect the second will also return true, which is false for now.

    so how can i make sure "赵" letter sort before any other chinese letter
    by using icu tailoring.

    [1] https://github.com/unicode-org/cldr/blob/release-43/common/collation/zh.xml
---
Markus Scherer
    What's a little misleading there is “The following statement sets up a collation
    named ebcdic with rules to sort US-ASCII characters in the order of the EBCDIC encoding.”
    -- mostly because there is not one single EBCDIC order of the characters in the US-ASCII repertoire.
    Different EBCDIC codepages sort a dozen or so of those characters in different ways.
    I am not sure why you are asking whether that page is outdated based on our discussion so far.
    I am not sure what postgresql does with the combination of a locale ID (other than "und") and also a custom set of rules.
    I assume that it fetches the tailoring rules of the selected locale and appends the custom rules.

    Let's assume that the pinyin tailoring that's used is the
            <collation type='pinyin' alt='short'>
    tailoring which starts on line 2026 there.

    With &赵 << 阿 <<< 嗄
    you are resetting to 赵 which you can find on line 3365:
                   <*召兆诏枛垗炤狣赵笊肁旐棹詔照罩肇肈趙曌燳鮡櫂瞾羄 # zhào
    so 阿 and 嗄 sort primary-equal to that.

    Then you compare these with 朝 which you can find on line 2158:
                   <*牊晁巢巣朝鄛鼌漅嘲樔潮窲罺轈鼂謿 # cháo
    As you can see, 朝 (cháo) sorts much earlier than 赵 (zhào).

    If you want it to sort before any other Chinese letter, then you need to reset to a much earlier point in the sort order.
    Putting something at the very beginning of any CJK tailoring should be doable with
    &[last regular] < 赵 << 阿 <<< 嗄

    Just for Pinyin, you could also insert characters right at the start of the 'A' group (see line 2032 in the data file):
    &'\uFDD0A' < 赵 << 阿 <<< 嗄
---
jian he
    so far what I tried.
    CREATE COLLATION custom2 (provider = icu, locale =
    'zh-u-co-pinyin-x-icu', rules = '&\uFDD0A < 赵 < 阿 <<< 嗄');
    CREATE COLLATION custom3 (provider = icu, locale =
    'zh-u-co-pinyin-x-icu', rules = '&\uFDD0A < 赵 < 阿 << 嗄');
    CREATE COLLATION custom4 (provider = icu, locale =
    'zh-u-co-pinyin-x-icu', rules = '&\uFDD0A < 赵 << 阿 <<< 嗄');
    CREATE COLLATION custom5 (provider = icu, locale = 'zh-Hans-x-icu',
    rules = '&\uFDD0A < 赵 << 阿 <<< 嗄');
    CREATE COLLATION custom6 (provider = icu, locale = 'zh-Hans-CN-x-icu',
    rules = '&\uFDD0A < 赵 << 阿 <<< 嗄');
    CREATE COLLATION custom7 (provider = icu, locale = 'zh-Hans-CN', rules
    = '&\uFDD0A < 赵 < 阿 <<< 嗄');

    they all return false for '赵' < '朝' (not meet my expectation)
    they all return true for '赵' < '阿' (meet my expectation)
    so, i am not sure how to use this icu-rules.

    standard, no icu-rule, '赵' < '朝' is false,  '赵' < '阿'  is false, which
    is what I expected.

    Are there any python or C code examples using icu-rule for chinese?

Locale

02/06/2021
https://www.php.net/manual/fr/function.setlocale.php
Warning
The locale information is maintained per process, not per thread.
If you are running PHP on a multithreaded server API , you may experience sudden changes
in locale settings while a script is running, though the script itself never called setlocale().
This happens due to other scripts running in different threads of the same process at the same time,
changing the process-wide locale using setlocale().
On Windows, locale information is maintained per thread as of PHP 7.0.5.
On Windows, setlocale(LC_ALL, '') sets the locale names from the system's regional/language settings (accessible via Control Panel).

https://www.gnu.org/software/libc/manual/html_mono/libc.html#Locales
Locales and Internationalization

https://pubs.opengroup.org/onlinepubs/9699919799/
IEEE Std 1003.1-2017
Locale

https://unix.stackexchange.com/questions/87745/what-does-lc-all-c-do/87763#87763
What does "LC_ALL=C" do?

https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f027338b0fab0f5078971fbe
stream_libarchive: workaround various types of locale braindeath
(legendary C locales rant)

https://stackoverflow.com/questions/30479607/explain-the-effects-of-export-lang-lc-ctype-and-lc-all
    The LANG, LC_CTYPE and LC_ALL are special environment variables which after
    they got exported to the shell environment, are available and ready to be rea
    by certain programs which supports a locale (natural language formatting for C).

    Each variable sets the C library's notion of natural language formatting style
    for particular sets of routines, for example:

        - LC_ALL - Set the entire locale generically
        - LC_CTYPE - Set a locale for the ctype and multibyte functions.
          This controls recognition of upper and lower case, alphabetic or non-
          alphabetic characters, and so on.

    and other such as LC_COLLATE (for string collation routines),
                      LC_MESSAGES (for message catalogs),
                      LC_MONETARY (for formatting monetary values),
                      LC_NUMERIC (for formatting numbers),
                      LC_TIME (for formatting dates and times).

    Regarding LANG, it is used as a substitute for any unset LC_* variable.
    See: man setlocale (BSD), man locale

    So when certain C functions are called (such as setlocale, ctype, multibyte, catopen, printf, etc.),
    they read the locale settings from the configuration files and local environment in order to control
    and format natural language formatting style as per C programming language standards.

    see: setlocale                  http://www.unix.com/man-page/freebsd/3/setlocale/
    see: ctype                      http://www.unix.com/man-page/freebsd/3/ctype/
    see: multibyte                  http://www.unix.com/man-page/freebsd/3/multibyte/
    see: catopen                    http://www.unix.com/man-page/freebsd/3/catopen/
    see:printf                      http://www.unix.com/man-page/freebsd/3/printf/
    see: ISO C99                    https://en.wikipedia.org/wiki/C99
    see: C Library - <locale.h>     https://www.tutorialspoint.com/c_standard_library/locale_h.htm

AIX documentation
https://www.ibm.com/docs/en/aix/7.1?topic=globalization-locales
- Understanding locale
- Understanding locale categories
- Understanding locale environment variables
- Understanding the locale definition source file
- Multibyte subroutines
- Wide character subroutines
- Bidirectionality and character shaping
- Code set independence
- File name matching
- Radix character handling
- Programming model

https://bugzilla.mozilla.org/show_bug.cgi?id=1612379
Narrow down the list of ICU locales we ship

https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/data_management.md
Data management in ICU4X

https://pubs.opengroup.org/onlinepubs/9699919799/
localedef - define locale environment
If the locale value begins with a slash, it shall be interpreted as the pathname
of a file that was created in the output format used by the localedef utility;
see OUTPUT FILES under localedef. Referencing such a pathname shall result in
that locale being used for the indicated category.

CLDR Common Locale Data Repository

19/06/2021
https://github.com/twitter/twitter-cldr-rb
Ruby implementation of the ICU (International Components for Unicode) that uses
the Common Locale Data Repository to format dates, plurals, and more.

https://github.com/twitter/twitter-cldr-js
JavaScript implementation of the ICU (International Components for Unicode) that uses
the Common Locale Data Repository to format dates, plurals, and more. Based on twitter-cldr-rb.

https://unicode-org.atlassian.net/jira/software/c/projects/CLDR/issues/?filter=allissues
CLDR tickets

Case mappings

Rule Final_Sigma in default case algorithms.
https://github.com/php/php-src/pull/10268
jlf: difficult to implement, involves to scan arbitrarily far to the left and
right of capital sigma.

https://www.unicode.org/faq/casemap_charprop.html

https://stackoverflow.com/questions/7360996/unicode-correct-title-case-in-java?noredirect=1&lq=1
Unicode-correct title case in Java

https://docs.rs/unicode-case-mapping/latest/unicode_case_mapping/
Example
assert_eq!(unicode_case_mapping::to_lowercase('İ'), ['i' as u32, 0x0307]);
assert_eq!(unicode_case_mapping::to_lowercase('ß'), ['ß' as u32, 0]);
assert_eq!(unicode_case_mapping::to_uppercase('ß'), ['S' as u32, 'S' as u32, 0]);
assert_eq!(unicode_case_mapping::to_titlecase('ß'), ['S' as u32, 's' as u32, 0]);
assert_eq!(unicode_case_mapping::to_titlecase('-'), [0; 3]);
assert_eq!(unicode_case_mapping::case_folded('I'), NonZeroU32::new('i' as u32));
assert_eq!(unicode_case_mapping::case_folded('ß'), None);
assert_eq!(unicode_case_mapping::case_folded('ẞ'), NonZeroU32::new('ß' as u32));

https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/titlecase.html
fun Char.titlecase(): String
    val chars = listOf('a', 'ǅ', 'ŉ', '+', 'ß')
    val titlecaseChar = chars.map { it.titlecaseChar() }
    val titlecase = chars.map { it.titlecase() }
    println(titlecaseChar) // [A, ǅ, ŉ, +, ß]
    println(titlecase) // [A, ǅ, ʼN, +, Ss]
fun Char.titlecase(locale: Locale): String
    val chars = listOf('a', 'ǅ', 'ŉ', '+', 'ß', 'i')
    val titlecase = chars.map { it.titlecase() }
    val turkishLocale = Locale.forLanguageTag("tr")
    val titlecaseTurkish = chars.map { it.titlecase(turkishLocale) }
    println(titlecase) // [A, ǅ, ʼN, +, Ss, I]
    println(titlecaseTurkish) // [A, ǅ, ʼN, +, Ss, İ]

https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/177
jlf: good summary, was not so obvious before I understand there are simple and
full case mappings...
    Also note that the Unicode standard only provides defaults for, but then
    goes on to say that locale/language specific mappings should really be used.

    The Unicode standard is very explicit that things like uppercase
    transformations should be able to handle language specific issues such as
    the Turkish dotted and dotless i, and that “ß” should be uppercased to
    “SS” in German.

    See:
        Q: Is all of the Unicode case mapping information in UnicodeData.txt?
        A: No. The UnicodeData.txt file includes all of the one-to-one case mappings.
        Since many parsers were built with the expectation that UnicodeData.txt
        would have at most a single character in each case mapping field, the
        file SpecialCasing.txt was added to provide the one-to-many mappings,
        such as the one needed for uppercasing ß (U+00DF LATIN SMALL LETTER SHARP S).
        In addition, CaseFolding.txt contains additional mappings used in case
        folding and caseless matching. For more information, see Section 5.18,
        Case Mappings in The Unicode Standard.
    and
        A: The Unicode Standard defines the default case mapping for each
        individual character, with each character considered in isolation. This
        mapping does not provide for the context in which the character appears,
        nor for the language-specific rules that must be applied when working in
        natural language text.

https://www.b-list.org/weblog/2018/nov/26/case/
Truths programmers should know about case

Collation, sorting

https://www.unicode.org/reports/tr35/tr35-collation.html
UNICODE LOCALE DATA MARKUP LANGUAGE (LDML)
PART 5: COLLATION

01/06/2021
https://github.com/jgm/unicode-collation
https://hackage.haskell.org/package/unicode-collation
Haskell implementation of the Unicode Collation Algorithm

https://icu4c-demos.unicode.org/icu-bin/collation.html
ICU Collation Demo

https://www.enterprisedb.com/docs/epas/latest/epas_guide/03_database_administration/06_unicode_collation_algorithm/
Unicode Collation Algorithm

https://www.minaret.info/test/collate.msp
This page provides a means to convert a string of Unicode characters into a binary collation key using
the Java language version ("icu4j") of the IBM International Components for Unicode (ICU) library.
A collation key is the basis for sorting and comparing strings in a language-sensitive Unicode environment.
A collation key is built using a "locale" (a designation for a particular laguage or a variant) and a comparison level.
The levels supported here (Primary, Secondary, Tertiary, Quaternary and Identical) correspond to levels
"L1" through "Ln" as described in Unicode Technical Standard #10 - Unicode Collation Algorithm.
When comparing collation keys for two different strings, both keys must have been created using the same locale
and comparison level in order to be meaningful. The two keys are compared from left to right, byte for byte
until one of the bytes is not equal to the other. Whichever byte is numerically less than the other causes
the source string for that collation key to sort before the other string.

https://lemire.me/blog/2018/12/17/sorting-strings-properly-is-stupidly-hard/
It's the comments section which is interesting.

https://discourse.julialang.org/t/sorting-strings-by-unicode-collation-order/11195
Not supported

03/08/2022
https://discourse.julialang.org/t/unicode-15-0-beta-and-sorting-collation/83090
https://www.unicode.org/emoji/charts-15.0/emoji-ordering.html

https://en.wikipedia.org/wiki/Natural_sort_order
Natural sort order is an ordering of strings in alphabetical order,
except that multi-digit numbers are ordered as a single character.
Natural sort order has been promoted as being more human-friendly ("natural")
than the machine-oriented pure alphabetical order.
For example, in alphabetical sorting "z11" would be sorted before "z2"
because "1" is sorted as smaller than "2",
while in natural sorting "z2" is sorted before "z11" because "2" is sorted as smaller than "11".
Alphabetical sorting:
    z11
    z2
Natural sorting:
    z2
    z11
Functionality to sort by natural sort order is built into many programming languages and libraries.

02/06/2021
https://www.postgresql.org/message-id/flat/BA6132ED-1F6B-4A0B-AC22-81278F5AB81E%40tripadvisor.com
The dangers of streaming across versions of glibc: A cautionary tale
SELECT 'Ｍ' > 'ஐ';
'FULLWIDTH LATIN CAPITAL LETTER M' (U+FF2D)
'TAMIL LETTER AI' (U+0B90)
Across different machines, running the same version of postgres, and in databases
with identical character encodings and collations ('en_US.UTF-8') that select will
return different results if the version of glibc is different.
master:src/backend/utils/adt/varlena.c:1494,1497  These are the lines where postgres
calls strcoll_l and strcoll, in order to sort strings in a locale aware manner.
The reality is that there are different versions of glibc out there in the wild,
and they do not sort consistently across versions/environments.

https://collations.info/concepts/
a site devoted to working with Collations, Unicode, Encodings, Code Pages, etc in Microsoft SQL Server.

BIDI title

https://www.iamcal.com/understanding-bidirectional-text/
Understanding Bidirectional (BIDI) Text in Unicode

https://www.w3.org/International/articles/inline-bidi-markup/uba-basics
Unicode Bidirectional Algorithm basics
jlf: the example are gif images :-(( no way to copy-paste the characters.

https://www.unicode.org/notes/tn39/
BIDI BRACKETS FOR DUMMIES

https://stackoverflow.com/questions/5801820/how-to-solve-bidi-bracket-issues
How to solve BiDi bracket issues?

https://gist.github.com/mvidner/e96ac917d9a54e09d9730220a34b0d24
Problems with Bidirectional (BiDi) Text

https://www.w3.org/International/questions/qa-bidi-unicode-controls
How to use Unicode controls for bidi text

https://github.com/mvidner/bidi-test
Testing bidirectional text

https://terminal-wg.pages.freedesktop.org/bidi/
BiDi in Terminal Emulators

https://www.w3.org/International/articles/inline-bidi-markup/uba-basics
Unicode Bidirectional Algorithm basics
W3C

http://fribidi.org/
GNU FriBidi is an implementation of the Unicode Bidirectional Algorithm (bidi).
jlf: dead...
The latest release is fribidi-0.19.7.tar.bz2 from August 4, 2015. This release
is based on Unicode 6.2.0 character database.
---
jlf: maybe not dead, but low activity... v1.0.13
https://github.com/fribidi/fribidi

https://news.ycombinator.com/item?id=37990523
Ask HN: Bidirectional Text Navigation

Emoji

https://www.unicode.org/Public/emoji/15.0/emoji-test.txt

https://emojipedia.org/

http://xahlee.info/comp/unicode_emoji.html

29/05/2021
https://tonsky.me/blog/emoji/

27/02/2023
https://news.ycombinator.com/item?id=34925446
Discussion about emoji and graphemes (again...).
Nothing very interesting in this discussion.
Remember:
The "length" of a string in extended grapheme clusters is not stable across Unicode versions, which seems like a recipe for confusion.
The length in code units is unambiguous and constant across versions.
---
Executor:
NinjaCat = "🐱‍👤"
NinjaCat~description=
    'UTF-8 not-ASCII (11 bytes)'
NinjaCat~text~characters==
    an Array (shape [3], 3 items)
     1 : ( "🐱"  U+1F431 So 2 "CAT FACE" )
     2 : ( "‍"    U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" )
     3 : ( "👤"  U+1F464 So 2 "BUST IN SILHOUETTE" )

Countries, flags

22/05/2021
https://en.wikipedia.org/wiki/Regional_indicator_symbol
Regional indicator symbol

https://en.wikipedia.org/wiki/ISO_3166-1
ISO 3166-1 (Codes for the representation of names of countries and their subdivisions)

https://observablehq.com/@jobleonard/which-unicode-flags-are-reversible

Evidence of partial or wrong support of Unicode

13/08/2013
We don’t need a string type
https://mortoray.com/2013/08/13/we-dont-need-a-string-type/

01/12/2013
Strings in Ruby are UTF-8 now… right?
http://andre.arko.net/2013/12/01/strings-in-ruby-are-utf-8-now/

14/07/2017
Testing Ruby's Unicode Support
http://blog.honeybadger.io/ruby-s-unicode-support/

22/05/2021
Emoji.length == 2
https://news.ycombinator.com/item?id=13830177
Lot of comments, did not read all, to continue

22/05/2021
https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
Let's Stop Ascribing Meaning to Code Points

18/07/2021
https://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-assumptions/
Breaking Our Latin-1 Assumptions

Optimization, SIMD

08/06/2021
https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/

[obsolete]
https://github.com/lemire/fastvalidate-utf-8
header-only library to validate utf-8 strings at high speeds (using SIMD instructions)
jlf 2023/06/16 (now obsolete)
NOTE: The fastvalidate-utf-8 library is obsolete as of 2022: please adopt the simdutf library.
It is much more powerful, faster and better tested.

https://github.com/simdutf/simdutf
simdutf: Unicode at gigabytes per second

08/06/2021
https://github.com/simdjson/simdjson
simdjson : Parsing gigabytes of JSON per second
The simdjson library uses commonly available SIMD instructions and microparallel algorithms
to parse JSON 4x faster than RapidJSON and 25x faster than JSON for Modern C++.
Minify JSON at 6 GB/s, validate UTF-8 at 13 GB/s, NDJSON at 3.5 GB/s

https://arxiv.org/abs/2010.03090
Validating UTF-8 In Less Than One Instruction Per Byte
John Keiser, Daniel Lemire
The majority of text is stored in UTF-8, which must be validated on ingestion.
We present the lookup algorithm, which outperforms UTF-8 validation routines used
in many libraries and languages by more than 10 times using commonly available SIMD instructions.
To ensure reproducibility, our work is freely available as open source software.

https://r-libre.teluq.ca/2178/
    Recherche et analyse de solutions performantes pour le traitement de fichiers JSON dans un langage de haut niveau [r-libre/2178]
Referenced from
    https://lemire.me/blog/
    Daniel Lemire's blog – Daniel Lemire is a computer science professor at the University of Quebec (TELUQ) in Montreal.
    His research is focused on software performance and data engineering. He is a techno-optimist.

https://github.com/simdutf/simdutf
https://news.ycombinator.com/item?id=32700315
Unicode routines (UTF8, UTF16, UTF32): billions of characters per second using SSE2, AVX2, NEON, AVX-512.

https://jcdav.is/2016/09/01/How-the-JVM-compares-your-strings/
(jlf: also referenced in the section "String comparison")
How the JVM compares your strings using the craziest x86 instruction you've never heard of
---
Comment from a Swift thread:
https://forums.swift.org/t/string-s-abi-and-utf-8/17676/25
PCMPxSTRx is no longer faster than equivalent "simple" vector instruction sequences for straightforward comparisons
(this had already been the case for a few years when that article was written, which is curious).
It can be used productively (with some care) for some other operations like substring matching,
but that's not as much of a heavy-hitter. There's a bunch of string stuff that will benefit from general vectorization,
and which is absolutely on our roadmap to tackle, but using the PCMPxSTRx instructions specifically isn't a source of wins on the most important operations.

https://news.ycombinator.com/item?id=34267936
Transcoding Unicode with AVX-512: AMD Zen 4 vs. Intel Ice Lake (lemire.me)

https://www.reddit.com/r/java/comments/qafjtg/faster_charset_encoding/
Java 17 uses avx in both encoding and decoding

https://lemire.me/blog/2023/02/16/computing-the-utf-8-size-of-a-latin-1-string-quickly-avx-edition/
Computing the UTF-8 size of a Latin 1 string quickly (AVX edition)

Variation sequence

https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-variation-sequences.txt
https://www.unicode.org/Public/15.1.0/ucd/emoji/emoji-variation-sequences.txt
# emoji-variation-sequences.txt

22/05/2021
List of all code points that can display differently via a variation sequence
http://randomguy32.de/unicode/charts/standardized-variants/#emoji
Safari is better to display the characters.
Google Chrome and Opera have the same limitations: some characters are not supported (ex: section Phags-Pa).

https://sethmlarson.dev/unicode-variation-selectors
Mahjong tiles and Unicode variation selectors

Whitespaces, separators

22/05/2021
https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/
A section about wcwidth.
A section about spaces:
    There are actually two definitions of whitespace in Unicode.
    Unicode assigns every codepoint a category, and has three categories for
    what sounds like whitespace:
        “Separator, space”;
        “Separator, line”;
        “Separator, paragraph”.
    CR, LF, tab, and even vertical tab are all categorized as “Other, control”
    and not as separators.
    The only character in the “Separator, line” category is U+2028 LINE SEPARATOR,
    and the only character in “Separator, paragraph” is U+2029 PARAGRAPH SEPARATOR.
    Thankfully, all of these have the WSpace property.

    As an added wrinkle, the lone oddball character “⠀” renders like a space in most fonts.
    jlf: 2 cols x 3 lines of debossed dots.
    But it’s not whitespace, it’s not categorized as a separator, and it doesn’t have WSpace.
    It’s actually U+2800 BRAILLE PATTERN BLANK, the Braille character with none of the dots raised.
    (I say “most fonts” because I’ve occasionally seen it rendered as a 2×4 grid of open circles.)

Hyphenation

break words into syllables
    I need to break words into syllables:astronomical --> as - tro - nom - ic - al
    Is it possible to do this (in different languages) using ICU library?  (if no, may be you suggest other tools for it?)

    Andreas Heigl:
    While it looks like this is not something for ICU[1], there are libraries out there handling that - most of the time based on the thesis of Marc Liang.
    I've built an implementation for PHP[2] but there are a lot of others out there[3].

    [1] https://github.com/unicode-org/icu4x/issues/164#issuecomment-651410272
    [2] https://github.com/heiglandreas/Org_Heigl_Hyphenator
    [3] https://github.com/search?q=hyphenate&type=repositories
    https://tug.org/docs/liang/liang-thesis.pdf

DNS title, Domain Name title, Domain Name System title

http://lambda-the-ultimate.org/node/5674#comment-97016
jlf: I created this section because of this comment
Have you ever looked at how international encoding of DNS names are done in URLs? It uses Punycode, and it's a disaster.
Here's a good starting point to read up on this: https://en.wikipedia.org/wiki/Internationalized_domain_name

https://en.wikipedia.org/wiki/Internationalized_domain_name
Internationalized domain name
    ToASCII leaves ASCII labels unchanged. It fails if the label is unsuitable for
    the Domain Name System. For labels containing at least one non-ASCII character,
    ToASCII applies the Nameprep algorithm (https://en.wikipedia.org/wiki/Nameprep)
    This converts the label to lowercase and performs other normalization. ToASCII
    then translates the result to ASCII, using Punycode (https://en.wikipedia.org/wiki/Punycode)
    Finally, it prepends the four-character string "xn--". This four-character string
    is called the ASCII Compatible Encoding (ACE) prefix. It is used to distinguish
    labels encoded in Punycode from ordinary ASCII labels. The ToASCII algorithm can
    fail in several ways. For example, the final string could exceed the 63-character
    limit of a DNS label. A label for which ToASCII fails cannot be used in an
    internationalized domain name.

    The function ToUnicode reverses the action of ToASCII, stripping off the ACE prefix
    and applying the Punycode decode algorithm. It does not reverse the Nameprep processing,
    since that is merely a normalization and is by nature irreversible. Unlike ToASCII,
    ToUnicode always succeeds, because it simply returns the original string if decoding fails.
    In particular, this means that ToUnicode has no effect on a string that does not
    begin with the ACE prefix.

https://en.wikipedia.org/wiki/Punycode
Punycode is a representation of Unicode with the limited ASCII character subset
used for Internet hostnames. Using Punycode, host names containing Unicode characters
are transcoded to a subset of ASCII consisting of letters, digits, and hyphens,
which is called the letter–digit–hyphen (LDH) subset. For example, München
(German name for Munich) is encoded as Mnchen-3ya.

All languages

https://www.omniglot.com/index.htm
The online encyclopedia of writing systems & languages
jlf: nothing about Unicode but good for culture générale.

Classical languages

https://docs.cltk.org/en/latest/
https://github.com/cltk/cltk
The Classical Language Toolkit
Python library
The Classical Language Toolkit (CLTK) is a Python library offering natural
language processing (NLP) for pre-modern languages. Pre-configured pipelines are
available for 19 languages.
    Akkadian
    Arabic
    Aramaic
    Classical Chinese
    Coptic
    Gothic
    Greek
    Hindi
    Latin
    Middle High German
    English
    French
    Old Church Slavonic
    Old Norse
    Pali
    Panjabi
    Sanskrit (Some parts of the Sanskrit library are forked from the Indic NLP Library)

Arabic language

https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
Arabic script in Unicode

Indic languages

https://www.unicode.org/faq/indic.html
Indic scripts in the narrow sense are the nine major Brahmi-derived scripts of India.
In a wider sense, the term can cover all Brahmic scripts and Kharoshthi.
What is ISCII?
Indian Standard Code for Information Interchange (ISCII) is the character code
for Indian scripts that originate from the Brahmi script.
Keywords:
    nukta
    Vedic Sanskrit
    vowel signs (matras)
    vowel modifiers (candrabindu, anusvara)
    the consonant modifier (nukta)
    Tamil
    Bengali (Bangla) / Assamese Script
    Sindhi implosive consonants

FAQ: How do I collate Indic language data?
Collation order is not the same as code point order. A good treatment of some
issues specific to collation in Indic languages can be found in the paper
Issues in Indic Language Collation by Cathy Wissink (https://www.unicode.org/notes/tn1/)
Collation in general must proceed at the level of language or language variant,
not at the script or codepoint levels. See also UTS #10: Unicode Collation Algorithm.
Some Indic-specific issues are also discussed in that report.

This section illustrates that Unicode’s concepts like “extended grapheme cluster”
are meant to provide some low-level, general segmentation, and are not going
to be enough for ideal experience for end users.

https://en.wikipedia.org/wiki/Devanagari#Conjunct_consonants

https://en.wikipedia.org/wiki/Devanagari_conjuncts
Conjunct consonants are a form of orthographic ligature characteristic of the
Brahmic scripts. They are constructed of more than two consonant letters.
Biconsonantal conjuncts are common, but longer conjuncts are increasingly
constrained by the languages' phonologies and the actual number of conjuncts
observed drops sharply. Ulrich Stiehl includes a five-letter Devanagari conjunct
र्त्स्न्य (rtsny)[1] among the top 360 most frequent conjuncts found in Classical
Sanskrit;[2] the complete list appears below. Conjuncts often span a syllable
boundary, and many of the conjuncts below occur only in the middle of words,
where the coda consonants of one syllable are conjoined with the onset c
onsonants of the following syllable.
[1] As in Sanskrit word कार्त्स्न्य (In Bengali Script কার্ৎস্ন্য), meaning "The Whole, Entirety"
[2] Stiehl, Ulrich. "Devanagari-Schreibübungen" (PDF). www.sanskritweb.net.
    http://www.sanskritweb.net/deutsch/devanagari.pdf

https://stackoverflow.com/questions/6805311/combining-devanagari-characters
Combining Devanagari characters
"बिक्रम मेरो नाम हो"~text~graphemes==
    a GraphemeSupplier
     1  : T'बि'
     2  : T'क्'     <-- According the comments, these 2 graphemes should be only one: क्र
     3  : T'र'      <-- even ICU doesn't support that... it's a tailored grapheme cluster
     4  : T'म'
     5  : T' '
     6  : T'मे'
     7  : T'रो'
     8  : T' '
     9  : T'ना'
     10 : T'म'
     11 : T' '
     12 : T'हो'
"बिक्रम मेरो नाम हो"~text~characters==
    an Array (shape [18], 18 items)
     1  : ( "ब"   U+092C Lo 1 "DEVANAGARI LETTER BA" )
     2  : ( "ि"    U+093F Mc 0 "DEVANAGARI VOWEL SIGN I" )
     3  : ( "क"   U+0915 Lo 1 "DEVANAGARI LETTER KA" )
     4  : ( "्"    U+094D Mn 0 "DEVANAGARI SIGN VIRAMA" )           <-- influence segmentation
     5  : ( "र"   U+0930 Lo 1 "DEVANAGARI LETTER RA" )
     6  : ( "म"   U+092E Lo 1 "DEVANAGARI LETTER MA" )
     7  : ( " "   U+0020 Zs 1 "SPACE", "SP" )
     8  : ( "म"   U+092E Lo 1 "DEVANAGARI LETTER MA" )
     9  : ( "े"    U+0947 Mn 0 "DEVANAGARI VOWEL SIGN E" )
     10 : ( "र"   U+0930 Lo 1 "DEVANAGARI LETTER RA" )
     11 : ( "ो"    U+094B Mc 0 "DEVANAGARI VOWEL SIGN O" )
     12 : ( " "   U+0020 Zs 1 "SPACE", "SP" )
     13 : ( "न"   U+0928 Lo 1 "DEVANAGARI LETTER NA" )
     14 : ( "ा"    U+093E Mc 0 "DEVANAGARI VOWEL SIGN AA" )
     15 : ( "म"   U+092E Lo 1 "DEVANAGARI LETTER MA" )
     16 : ( " "   U+0020 Zs 1 "SPACE", "SP" )
     17 : ( "ह"   U+0939 Lo 1 "DEVANAGARI LETTER HA" )
     18 : ( "ो"    U+094B Mc 0 "DEVANAGARI VOWEL SIGN O" )
In Devanagari, each grapheme cluster consists of an initial letter, optional
pairs of virama (vowel killer) and letter, and an optional vowel sign.
    virama = u'\N{DEVANAGARI SIGN VIRAMA}'
    cluster = u''
    last = None
    for c in s:
        cat = unicodedata.category(c)[0]
        if cat == 'M' or cat == 'L' and last == virama:
            cluster += c
        else:
            if cluster:
                yield cluster
            cluster = c
        last = c
    if cluster:
        yield cluster
---
Let's cover the grammar very quickly: The Devanagari Block.
As a developer, there are two character classes you'll want to concern yourself with:
Sign:
    This is a character that affects a previously-occurring character.
    Example, this character: ्. The light-colored circle indicates the location
    of the center of the character it is to be placed upon.
Letter / Vowel / Other:
    This is a character that may be affected by signs.
    Example, this character: क.
Combination result of ् and क: क्. But combinations can extend, so क् and षति will
actually become क्षति (in this case, we right-rotate the first character by 90 degrees,
modify some of the stylish elements, and attach it at the left side of the second character).

https://news.ycombinator.com/item?id=20058454
If I type anything like किमपि (“kimapi”) and hit backspace, it turns into किमप (“kimapa”).
That is, the following sequence of codepoints:
    ‎0915 DEVANAGARI LETTER KA
    ‎093F DEVANAGARI VOWEL SIGN I
    ‎092E DEVANAGARI LETTER MA
    ‎092A DEVANAGARI LETTER PA
    ‎093F DEVANAGARI VOWEL SIGN I
made of three grapheme clusters (containing 2, 1, and 2 codepoints respectively),
turns after a single backspace into the following sequence:
    ‎0915 DEVANAGARI LETTER KA
    ‎093F DEVANAGARI VOWEL SIGN I
    ‎092E DEVANAGARI LETTER MA
    ‎092A DEVANAGARI LETTER PA
This is what I expect/find intuitive, too, as a user.
Similarly अन्यच्च is made of 3 grapheme clusters but you hit backspace 7 times to delete it
(though there I'd slightly have preferred अन्यच्च→अन्यच्→अन्य→अन्→अ instead of
अन्यच्च→अन्यच्→अन्यच→अन्य→अन्→अन→अ that's seen, but one can live with this).

https://github.com/anoopkunchukuttan/indic_nlp_library
The goal of the Indic NLP Library is to build Python based libraries for common
text processing and Natural Language Processing in Indian languages.
The library provides the following functionalities:
    Text Normalization
    Script Information
    Word Tokenization and Detokenization
    Sentence Splitting
    Word Segmentation
    Syllabification
    Script Conversion
    Romanization
    Indicization
    Transliteration
    Translation

https://github.com/AI4Bharat/indicnlp_catalog
The Indic NLP Catalog
jlf: way beyond Unicode, tons of URLs...

https://news.ycombinator.com/item?id=20056966
jlf: Devnagari seems to be an example where grapheme is not the right segmentation
What does "index" mean? (hindi) "इंडेक्स" का क्या अर्थ है?
Including the quote marks, spaces, and question mark, that's 18 characters.
as a native speaker, shouldn't they be considered 15 characters?
क्स, क्या and र्थ each form individual conjunct consonants.
Counting them as two would then beget the question as to why डे is not considered
two characters too, seeing as it is formed by combining ड and ए, much like क्स
is formed by combining क् and स.
...
Devnagari allows simple characters to form compound characters.
Regarding क्स and डे, the difference between them is that the former is a combination
of two consonants (pronounced "ks") while the latter is formed by a consonant and
a vowel ("de"). However, looking at the visual representation is wrong, since डा
(consonant+vowel) would also look like two characters.

https://slidetodoc.com/indic-text-segmentation-presented-by-swaran-lata-senior/
INDIC TEXT SEGMENTATION

https://github.com/w3c/iip/issues/34
the final rendered state of the text is what influences the segmentation,
rather than the sequence of code points used.

https://docs.microsoft.com/en-us/typography/

https://docs.microsoft.com/en-us/typography/script-development/tamil
Developing OpenType Fonts for Tamil Script
The first step is to analyze the input text and break it into syllable clusters.
Then apply font features and computes ligatures and combine marks.

https://docs.microsoft.com/en-us/typography/script-development/devanagari
Developing OpenType Fonts for Devanagari Script

https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/
Picking Apart the Crashing iOS String
Posted by Manish Goregaokar on February 15, 2018
Indic scripts and consonant clusters
jlf: he's a black belt! or is it his native tongue?

https://stackoverflow.com/questions/75210512/how-to-split-devanagari-bi-tri-and-tetra-conjunct-consonants-as-a-whole-from-a-s
How to split Devanagari bi-tri and tetra conjunct consonants as a whole from a string?
    "हिन्दी मुख्यमंत्री हिमंत"
    Current output:
    हि  न्  दी   मु  ख्  य  मं  त्  री    हि  मं  त
    Desired ouput:
    हि न्दी  मु ख्य मं त्री  हि मं त

https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf
Proper Complex Script Support in Text Terminals
page 8
Characters in one line will further be grouped into terminal clusters.
A terminal cluster contains the characters that are combined together in the
terminal environment. It is an instance of the tailored grapheme cluster defined
in UAX #29. In Indic scripts, for example, syllables with virama conjoiners in
the middle will be considered one single terminal cluster, while they are
treated as multiple extended grapheme clusters in UAX #29.
---
page9
In some writing systems, the form of a character may depend on the characters
that follow it. One example of this is Devanagari’s repha forms. This requires
the establishment of a work zone that contains the most recent characters, and
the property of the characters in the work zone is considered volatile and may
change depending on the incoming text from the guest.
When the terminal receives text, it will first append the text into the work
zone and measure the entire work zone to process potential property changes.
If the measurement result says that the text in the work zone could be broken
into multiple clusters, then the work zone will be shrunk to only contain the
last (maybe incomplete) cluster. The text before that will be committed, and its
properties will no longer change. As a result, at any time the work zone will
contain at most one cluster. When the cursor moves (via the terminal receiving a
cursor move command or a newline), all the text in the work zone will be
committed—even if it is incomplete—and the work zone will be cleared.

https://slideplayer.com/slide/11341056/
INDIC TEXT SEGMENTATION
todo: read

https://news.ycombinator.com/item?id=9219162
I Can Text You A Pile of Poo, But I Can’t Write My Name
March 17th, 2015
jlf: the article is about Bengali, but HN comments are also for other languages.
todo: read

https://www.unicode.org/L2/L2023/23140-graphemes-expectations.pdf
Unicode 15.1:
Unicode grapheme clusters tend to be closer to the larger user-perceived units.
Hangul text is clearly segmented into syllable blocks. For Brahmic scripts,
things are less clear. Grapheme clusters may contain several base-level units,
but up to Unicode 15 always broke after virama characters. This broke not only
within orthographic syllables, but for a number of scripts also within the
encoding of conjunct forms that users perceive as base-level units, such as
Khmer coengs (see subsection Subscript Consonant Signs of section 16.4 Khmer of
the Unicode Standard). In Unicode 15.1, this is being corrected for six scripts,
while leaving the others broken.

CJK

https://resources.oreilly.com/examples/9781565922242/blob/master/doc/cjk.inf
Version 2.1 (July 12, 1996)
Online Companion to "Understanding Japanese Information Processing"
This online document provides information on CJK (that is, Chinese, Japanese,
and Korean) character set standards and encoding systems.
---
jlf: 1996... but maybe some things to learn.

https://en.wikipedia.org/wiki/Cangjie_input_method
Cangjie input method
jlf: nothing about Unicode... but maybe some things to learn.

Korean

22/05/2021
http://gernot-katzers-spice-pages.com/var/korean_hangul_unicode.html
The Korean Writing System

Japanese

https://heistak.github.io/your-code-displays-japanese-wrong/
https://news.ycombinator.com/item?id=29022906

https://www.johndcook.com/blog/2022/09/25/katakana-hiragana-unicode/
https://news.ycombinator.com/item?id=32987710

Polish

https://www.twardoch.com/download/polishhowto/index.html
Polish diacritics how to?

IME - Input Method Editor

https://hsivonen.fi/ime/
An IME is a piece of software that transforms user-generated input events
(mostly keyboard events, but some IMEs allow some auxiliary pointing device interaction)
into text in a manner more complex than a mere keyboard layout.
Basically, if the relationship between the keys that a user presses on a hardware keyboard
and the text that ends up in an applications text buffer is more complex than when writing French,
an IME is in use.

Text editing

https://lord.io/text-editing-hates-you-too/
TEXT EDITING HATES YOU TOO

Text rendering, Text shaping library

https://faultlore.com/blah/text-hates-you/
Text Rendering Hates You
Aria Beingessner
September 28th, 2019
jlf: culture générale
todo: read

https://harfbuzz.github.io/
https://github.com/harfbuzz/harfbuzz
jlf: referenced by ICU
Users of ICU Layout are strongly encouraged to consider the HarfBuzz project as
a replacement for the ICU Layout Engine.

Uniscribe if you are writing Windows software

CoreText on macOS

String Matching

https://www.w3.org/TR/charmod-norm/
String matching

Case folding is the process of making two texts which differ only in case identical for comparison purposes, that is, it is meant for the purpose of string matching.
This is distinct from case mapping, which is primarily meant for display purposes.
As with the default case mappings, Unicode defines default case fold mappings ("case folding") for each Unicode code point.

Fuzzy String Matching

29/05/2021
https://github.com/logannc/fuzzywuzzy-rs
Rust port of the Python fuzzywuzzy
https://github.com/seatgeek/fuzzywuzzy --> moved to https://github.com/seatgeek/thefuzz

Levenshtein distance and string similarity

https://github.com/ztane/python-Levenshtein/
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

String comparison

31/05/2021
https://stackoverflow.com/questions/49662585/how-do-i-compare-a-unicode-string-that-has-different-bytes-but-the-same-value
A pair NFC considers different but a user might consider the same is 'µ' (MICRO SIGN) and 'μ' (GREEK SMALL LETTER MU).
NFKC will collapse these two.

https://www.unicode.org/reports/tr10/
Unicode® Technical Standard #10
UNICODE COLLATION ALGORITHM
Collation is the general term for the process and function of determining the sorting order of strings of characters.
Collation varies according to language and culture: Germans, French and Swedes sort the same characters differently.
It may also vary by specific application: even within the same language, dictionaries may sort differently than phonebooks or book indices.
For non-alphabetic scripts such as East Asian ideographs, collation can be either phonetic or based on the appearance of the character.
Collation can also be customized according to user preference, such as ignoring punctuation or not, putting uppercase before lowercase (or vice versa), and so on.

https://www.unicode.org/reports/tr10/#Common_Misperceptions
Collation is not aligned with character sets or repertoires of characters.
Collation is not code point (binary) order.
Collation is not a property of strings.
Collation order is not preserved under concatenation or substring operations, in general.
Collation order is not preserved when comparing sort keys generated from different collation sequences.
Collation order is not a stable sort.
Collation order is not fixed.

https://en.wikipedia.org/wiki/Unicode_equivalence
Short definition of NFD, NFC, NFKD, NFKC

    In this article, a short paragraph which confirms that it's important to keep
    the original string unchanged !
    Errors due to normalization differences
    When two applications share Unicode data, but normalize them differently, errors and data loss can result.
    In one specific instance, OS X normalized Unicode filenames sent from the Samba file- and printer-sharing software.
    Samba did not recognise the altered filenames as equivalent to the original, leading to data loss.[4][5]
    Resolving such an issue is non-trivial, as normalization is not losslessly invertible.
    http://sourceforge.net/p/netatalk/bugs/348/
    #348 volcharset:UTF8 doesn't work from Mac

https://www.unicode.org/faq/normalization.html
Mode detailled description of normalization

PHP
    http://php.net/manual/en/collator.compare.php
    Collator::compare -- collator_compare — Compare two Unicode strings
    Object oriented style
        public int Collator::compare ( string $str1 , string $str2 )
    Procedural style
        int collator_compare ( Collator $coll , string $str1 , string $str2 )

    http://php.net/manual/en/class.collator.php
    Provides string comparison capability with support for appropriate locale-sensitive sort orderings.

Swift
    https://developer.apple.com/library/prerelease/watchos/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
        Two String values (or two Character values) are considered equal if their extended grapheme clusters are canonically equivalent.
        Extended grapheme clusters are canonically equivalent if they have the same linguistic meaning and appearance,
        even if they are composed from different Unicode scalars behind the scenes.

        .characters.count
        for character in dogString.characters
        for codeUnit in dogString.utf8
        for codeUnit in dogString.utf16
        for scalar in dogString.unicodeScalars

        Nothing about ordered comparison in Swift doc ?

    http://oleb.net/blog/2014/07/swift-strings/

        Ordering strings with the < and > operators uses the default Unicode collation algorithm.
        In the example below, "é" is smaller than i because the collation algorithm specifies
        that characters with combining marks follow right after their base character.
            "résumé" < "risotto" // -> true
        The String type does not (yet?) come with a method to specify the language to use for collation.
        You should continue to use
            -[NSString compare:options:range:locale:]
        or
            -[NSString localizedCompare:]
        if you need to sort strings that are shown to the user.

        In this example, specifying a locale that uses the German phonebook collation yields a different result than the default string ordering:
            let muffe = "Muffe"
            let müller = "Müller"
            muffe < müller // -> true

            // Comparison using an US English locale yields the same result
            let muffeRange = muffe.startIndex..<muffe.endIndex
            let en_US = NSLocale(localeIdentifier: "en_US")
            muffe.compare(müller, options: nil, range: muffeRange, locale: en_US) // -> .OrderedAscending

            // Germany phonebook ordering treats "ü" as "ue".
            // Thus, "Müller" < "Muffe"
            let de_DE_phonebook = NSLocale(localeIdentifier: "de_DE@collation=phonebook")
            muffe.compare(müller, options: nil, range: muffeRange, locale: de_DE_phonebook) // -> .OrderedDescending

Java
    https://jcdav.is/2016/09/01/How-the-JVM-compares-your-strings/
    How the JVM compares your strings using the craziest x86 instruction you've never heard of.
    ---
    A comment about this article:
    PCMPxSTRx is no longer faster than equivalent "simple" vector instruction sequences for straightforward comparisons
    (this had already been the case for a few years when that article was written, which is curious).
    It can be used productively (with some care) for some other operations like substring matching,
    but that's not as much of a heavy-hitter.
    There's a bunch of string stuff that will benefit from general vectorization, and which is absolutely on our roadmap to tackle,
    but using the PCMPxSTRx instructions specifically isn't a source of wins on the most important operations

C#
    https://docs.microsoft.com/en-us/dotnet/standard/base-types/comparing
    https://docs.microsoft.com/en-us/dotnet/core/extensions/performing-culture-insensitive-string-comparisons

JSON

https://www.reddit.com/r/programming/comments/q5vmxc/parsing_json_is_a_minefield_2018/
https://seriot.ch/projects/parsing_json.html
Parsing JSON is a Minefield
Search for "unicode"

30/05/2021
https://datatracker.ietf.org/doc/html/rfc8259
The JavaScript Object Notation (JSON) Data Interchange Format
See this section about strings and encoding:
https://datatracker.ietf.org/doc/html/rfc8259#section-7

TOML serialization format

https://github.com/toml-lang/toml
Tom's Obvious, Minimal Language
TOML is a nice serialization format for human-maintained data structures.
It’s line-delimited and—of course!—allows comments, and any Unicode code point can be expressed in simple hexadecimal.
TOML is fairly new, and its specification is still in flux;

CBOR Concise Binary Representation

https://cbor.io/
RFC 8949 Concise Binary Object Representation
CBOR improves upon JSON’s efficiency and also allows for storage of binary strings.
Whereas JSON encoders must stringify numbers and escape all strings,
CBOR stores numbers “literally” and prefixes strings with their length,
which obviates the need to escape those strings.

https://www.rfc-editor.org/rfc/rfc8949.html
RFC 8949 Concise Binary Object Representation (CBOR)
In contrast to formats such as JSON, the Unicode characters in this type are never escaped.
Thus, a newline character (U+000A) is always represented in a string as the byte 0x0a,
and never as the bytes 0x5c6e (the characters "\" and "n")
nor as 0x5c7530303061 (the characters "\", "u", "0", "0", "0", and "a").

Binary encoding in Unicode

10/07/2021
https://qntm.org/unicodings
Efficiently encoding binary data in Unicode
in UTF-8, use Base64 or Base85
in UTF-16, use Base32768
in UTF-32, use Base65536

https://qntm.org/safe
What makes a Unicode code point safe?

https://github.com/qntm/safe-code-point
Ascertains whether a Unicode code point is 'safe' for the purposes of encoding binary data

https://github.com/qntm/base2048
Binary encoding optimised for Twitter
Originally, Twitter allowed Tweets to be at most 140 characters.
On 26 September 2017, Twitter allowed 280 characters.
Maximum Tweet length is indeed 280 Unicode code points.
Twitter divides Unicode into 4,352 "light" code points (U+0000 to U+10FF inclusive)
and 1,109,760 "heavy" code points (U+1100 to U+10FFFF inclusive).
Base2048 solely uses light characters, which means a new "long" Tweet can contain
at most 280 characters of Base2048. Base2048 is an 11-bit encoding, so those 280
characters encode 3080 bits i.e. 385 octets of data, significantly better than Base65536.

https://github.com/qntm/base65536
Unicode's answer to Base64
Base2048 renders Base65536 obsolete for its original intended purpose of sending
binary data through Twitter.
However, Base65536 remains the state of the art for sending binary data through
text-based systems which naively count Unicode code points, particularly those
using the fixed-width UTF-32 encoding.

Invalid format

22/07/2021
https://stackoverflow.com/questions/52131881/does-the-winapi-ever-validate-utf-16
Does the WinApi ever validate UTF-16?
Windows wide characters are arbitrary 16-bit numbers (formerly called "UCS-2",
before the Unicode Standard Consortium purged that notation). So you cannot
assume that it will be a valid UTF-16 sequence. (MultiByteToWideChar is a
notable exception that does return only UTF-16)

28/07/2021
https://invisible-island.net/xterm/bad-utf8/
Unicode replacement character in the Linux console.
This test text examines, how UTF-8 decoders handle various types of
corrupted or otherwise interesting UTF-8 sequences.
jlf : difficult to understand what is the conclusion...
What I notice in this review is :
Unicode 10.0.0's chapter 3 (June 2017): each of the ill-formed code units is separately replaced by U+FFFD.
That recommendation first appeared in Unicode 6's chapter 3 on conformance (February 2011).
However the comments about “best practice” were removed in Unicode 11.0.0 (June 2018).
The W3C WHATWG page entitled Encoding Standard started in January 2013.
    The constraints in the utf-8 decoder above match “Best Practices for Using
    U+FFFD” from the Unicode standard. No other behavior is permitted per the
    Encoding Standard (other algorithms that achieve the same result are
    obviously fine, even encouraged).
Although Unicode withdrew the recommendation more than two years ago, to date (August 2020) that is not yet corrected in the WHATWG page.

30/07/2021
https://hsivonen.fi/broken-utf-8/
---
The Unicode Technical Committee retracted the change in its meeting on August 3
2017, so the concern expressed below is now moot.
---
Not all byte sequences are valid UTF-8. When decoding potentially invalid UTF-8
input into a valid Unicode representation, something has to be done about invalid input.
The naïve answer is to ignore invalid input until finding valid input again (i.e.
finding the next byte that has a lead-byte value), but this is dangerous and
should never be done. The danger is that silently dropping bogus bytes might
make a string that didn’t look dangerous with the bogus bytes present become
valid active content. Most simply, <scr�ipt> (� standing in for a bogus byte)
could become <script> if the error is ignored. So it’s non-controversial that
every sequence of bogus bytes should result in at least one REPLACEMENT CHARACTER
and that the next lead-valued byte is the first byte that’s no longer part of
the invalid sequence.
But how many REPLACEMENT CHARACTERs should be generated for a sequence of
multiple bogus bytes?
jlf: the answer is not clear to me...

https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
UTF-8 decoder capability and stress test

Mojibake

https://github.com/LuminosoInsight/python-ftfy
ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters
that were clearly meant to be UTF-8 but were decoded as something else

03/07/2021
Notebook in python-ftfy:
Services such as Slack and Discord don't use Unicode for their emoji.
They use ASCII strings like :green-heart: and turn them into images.
These won't help you test anything.
I recommend getting emoji for your test cases by copy-pasting them from emojipedia.org.
https://emojipedia.org/

https://en.wikipedia.org/wiki/Mojibake

Filenames

https://opensource.apple.com/source/subversion/subversion-52/subversion/notes/unicode-composition-for-filenames.auto.html
2 problems follow:
 1) We can't generally depend on the OS to give us back the
     exact filename we gave it
 2) The same filename may be encoded in different codepoints

https://linux.die.net/man/1/convmv
convmv - converts filenames from one encoding to another

https://news.ycombinator.com/item?id=33986655
jlf: discussion about text vs byte for filenames

https://news.ycombinator.com/item?id=33991506
Python already has the "surrogateescape" error handler [0] that performs
something similar to what you described: undecodable bytes are translated into
unpaired U+DC80 to U+DCFF surrogates. Of course, this isn't standardized in any
way, but I've found it useful myself for smuggling raw pathnames through Java.
[0] https://peps.python.org/pep-0383/

https://news.ycombinator.com/item?id=33988943
I’m a little confused, how can a file name be non-decodable? A file with that
name exists, so someone somewhere knows how to decode it. Why wouldn’t Python
just always use the same encoding as the OS it’s running on? Is this some
locale-related thing?
---
    > A file with that name exists, so someone somewhere knows how to decode it.
    No. A unix filename is just a bunch of bytes (two of them being off-limits).
    There is no requirement that it be in any encoding.
    You can always use a fallback encoding (an iso-8859) to get something out of
    the garbage, but it's just that, garbage.
    Windows has a similar issue, NTFS paths are sequences of UCS2 code units, but
    there's no guarantee that they form any sort of valid UTF-16 string, you can
    find random lone surrogates for instance.
    And I'm sure network filesystems have invented their own even worse issues,
    because being awful is what they do.

    > Why wouldn’t Python just always use the same encoding as the OS it’s running on?
    1. because OS don't really have encodings, Python has a function to try and
       retrieve FS encoding[0] but per the above there's no requirement that it
       is correct for any file, let alone the one you actually want to open
       (hell technically speaking it's not even a property of the FS)
    2. because OS lie and user configurations are garbage, you can't even trust
       the user's locale to be configured properly for reading files (an other
       mistake Python 3 made, incidentally)
    3. because the user may not even have created the file, it might come from a
       broken archive, or some random download from someone having fun with
       filenames, or from fetching crap from an FTP or network share

    There are a few FS / FS configurations which are reliable, in that case they
    either error or pre-mangle the files on intake.
    IIRC ZFS can be configured to only accept valid UTF-8 filenames, HFS(+)
    requires valid unicode (stored as UTF-16) and APFS does as well (stored as UTF-8).

    [0] https://docs.python.org/3/library/sys.html#sys.getfilesystem...

https://news.ycombinator.com/item?id=33986421
Stefan Karpinski:
On UNIX, paths are UTF-8 by convention, but not forced to be valid. Treating
paths as UTF-8 works very well as long as you hadn't also make the mistake of
requiring your UTF-8 strings to be valid (which Python did, unfortunately).
On Windows, paths are UTF-16 by convention, but also not forced to be valid.
However, invalid UTF-16 can be faithfully converted to WTF-8 and converted back
losslessly, so you can translate Windows paths to WTF-16 and everything Just
Works™ [1].
There aren't any operating systems I'm aware of where paths are actually
Shift-JIS by convention, so that seems like a non-issue. Using "UTF-8 by
convention" strings works on all modern OSes.

    [1] Ok, here's why the WTF-8 thing works so well. If we write WTF-16 for
    potentially invalid UTF-16 (just arbitrary sequences of 16-bit code units), then
    the mapping between WTF-16 and WTF-8 space is a bijection because it's
    losslessly round-trippable. But more importantly, this WTF-8/16 bijection is
    also a homomorphism with respect to pretty much any string operation you can
    think of. For example `utf16_concat(a, b) == utf8_concat(wtf8(a), wtf8(b))` for
    arbitrary UTF-16 strings a and b. Similar identities hold for other string
    operations like searching for substrings or splitting on specific strings.
---
> There aren't any operating systems I'm aware of where paths are actually
  Shift-JIS by convention, so that seems like a non-issue. Using "UTF-8 by
  convention" strings works on all modern OSes.
Nonsense. Unix paths use the system locale by convention, and it's entirely
normal for that to be Shift-JIS.

https://news.ycombinator.com/item?id=33985510
Stefan Karpinski:
Absolutely right. Deprecating direct string indexing would have been the right
move. Require writing `str.chars()` to get something that lets you slice by
Unicode characters (i.e. code points); provide `str.graphemes()` and
`str.grapheme_clusters()` to get something that lets you slice by graphemes and
grapheme clusters, respectively. Cache an index structure that lets you do that
kind of indexing efficiently once you've asked for it the first time. Provide an
API to clear the caches.
Not allowing strings to represent invalid Unicode is also a huge mistake (and
essentially forced by the representation strategy that they adopted). It forces
any programmer who wants to robustly handle potentially invalid string data to
use byte vectors instead. Which is exactly what they did with OS paths, but
that's far from the only place you can get invalid strings. You can get invalid
strings almost anywhere! Worse, since it's incredibly inconvenient to work with
byte vectors when you want to do stringlike stuff, no one does it unless forced
to, so this design choice effectively guarantees that all Python code that works
with strings will blow up if it encounters anything invalid—which is a very
common occurrence.
If only there was a type that behaves like a string and supports all the handy
string operations but which handles invalid data gracefully. Then you could
write robust string code conveniently. But at that point, you should just make
that the standard string type! This isn't hypothetical, it's exactly how Burnt
Sushi's bstr type [1] works in Rust and how the standard String type works in
Julia.
[1] https://github.com/BurntSushi/bstr
---
Jasper_
It's worth noting that Python str's are sequences of code points, not scalar
values. This was a truly horrendous mistake made mostly out of ignorance, but
now they rely upon it in surrogateescape to hide "invalid" data, so...
I have ranted for long hours go friends about the insanity of Python 3's text
model before. It's mostly the blind leading the blind.
---
Animats:
Unicode string indexing should have been made lazy, rather than deprecated.
Random access to strings is rare. Mostly, operations are moving forward linearly
or using saved positions.
So, only build the index for random access if needed. Optimize "advance one
glyph" and "back up one glyph" expressed as indexing, and you'll get most of the
frequently used cases. Have the "index" functions that return a string index
return an opaque type that's a byte index. Attempting to convert that to an
integer forces creation of the string index.
This preserves the user visible semantics but keeps performance.
PyPy does something like this.

WTF8

https://news.ycombinator.com/item?id=9611710
The WTF-8 encoding (simonsapin.github.io)
https://news.ycombinator.com/item?id=9613971
https://simonsapin.github.io/wtf-8/#acknowledgments
Thanks to Coralie Mercier for coining the name WTF-8.
---
The name is unserious but the project is very serious, its writer has responded
to a few comments and linked to a presentation of his on the subject[0].
It's an extension of UTF-8 used to bridge UTF-8 and UCS2-plus-surrogates:
while UTF8 is the modern encoding you have to interact with legacy systems,
for UNIX's bags of bytes you may be able to assume UTF8 (possibly ill-formed)
but a number of other legacy systems used UCS2 and added visible surrogates
(rather than proper UTF-16) afterwards.
Windows and NTFS, Java, UEFI, Javascript all work with UCS2-plus-surrogates.
Having to interact with those systems from a UTF8-encoded world is an issue
because they don't guarantee well-formed UTF-16, they might contain unpaired
surrogates which can't be decoded to a codepoint allowed in UTF-8 or UTF-32
(neither allows unpaired surrogates, for obvious reasons).
WTF8 extends UTF8 with unpaired surrogates (and unpaired surrogates only,
paired surrogates from valid UTF16 are decoded and re-encoded to a proper
UTF8-valid codepoint) which allows interaction with legacy UCS2 systems.
WTF8 exists solely as an internal encoding (in-memory representation),
but it's very useful there.
[0] http://exyr.org/2015/!!Con_WTF-8/slides.pdf

https://twitter.com/koalie/status/506821684687413248
Coralie Mercier
@koalie
I have a hunch we use "wtf-8" encoding.
Appreciate the irony of:
"ÃƒÆ’Ã‚Æ’ÃƒÂ¢Ã‚â‚¬Ã‚Å¡ÃƒÆ’Ã‚â€šÃƒâ€šÃ‚Â the future of publishing at W3C"

16/07/2021
Windows allows unpaired surrogates in filenames

https://github.com/golang/go/issues/32334
syscall: Windows filenames with unpaired surrogates are not handled correctly #32334

https://github.com/rust-lang/rust/issues/12056
path: Windows paths may contain non-utf8-representable sequences #12056
I don't know the precise details, but there exist portions of Windows in which
paths are UCS2 rather than UTF-16. I ignored it because I thought it wasn't going
to be an issue but at some point someone (and I wish I could remember who) showed
me some output that showed that they were actually getting a UCS2 path from some
Windows call and Path was unable to parse it.
---
JLF: this is the birth of WTF-8 in 2014.
The result is:
https://simonsapin.github.io/wtf-8

Codepoint/grapheme indexation

https://nullprogram.com/blog/2019/05/29/

ObjectIcon
    http://objecticon.sourceforge.net/Unicode.html
    ucs (standing for Unicode character string) is a new builtin type, whose behaviour closely mirrors
    that of the conventional Icon string. It operates by providing a wrapper around a conventional
    conventional Icon string, which must be in utf-8 format. This has several advantages, and only one
    serious disadvantage, namely that a utf-8 string is not randomly accessible, in the sense that one
    cannot say where the representation for unicode character i begins. To alleviate this disadvantage,
    the ucs type maintains an index of offsets into the utf-8 string to make random access faster. The
    size of the index is only a few percent of the total allocation for the ucs object.
Jlf: I made a code review, but could not understand how they do that :-(
Not clear if it's a codepoint indexation or a grapheme indexation.

https://lwn.net/Articles/864994/
jlf: discussion about Raku NFG and its technical limitations.
It's also the traditional discussion about "why do you need a direct access to the graphemes".

Rope

See also ZenoString (from Alan Kay - Saxonica)

https://github.com/josephg/librope
Little C library for heavyweight utf-8 strings (rope).

https://news.ycombinator.com/item?id=8065608
Discussion about ropes, ideal of strings...

https://github.com/xi-editor/xi-editor/blob/e8065a3993b80af0aadbca0e50602125d60e4e38/doc/rope_science/rope_science_03.md

https://news.ycombinator.com/item?id=34948308
Several references to older papers

https://news.ycombinator.com/item?id=37820532
Text showdown: Gap Buffers vs. Ropes

https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation
Text Buffer Reimplementation

https://en.wikipedia.org/wiki/Piece_table
In computing, a piece table is a data structure typically used to represent a
text document while it is edited in a text editor.

Encoding title

https://www.iana.org/assignments/character-sets/character-sets.xhtml
Character Sets
(IANA Character Sets registry)
These are the official names for character sets that may be used in
the Internet and may be referred to in Internet documentation.  These
names are expressed in ANSI_X3.4-1968 which is commonly called
US-ASCII or simply ASCII.  The character set most commonly use in the
Internet and used especially in protocol standards is US-ASCII, this
is strongly encouraged.  The use of the name US-ASCII is also
encouraged.
---
jlf: see encoding.spec.whatwg.org elsewhere in this document. They say:
"User agents have also significantly deviated from the labels listed in the
IANA Character Sets registry. To stop spreading legacy encodings further,
this specification is exhaustive about the aforementioned details and therefore
has no need for the registry."

https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape
OCTOBER 12, 2022
JeanHeyd Meneide
Project Editor for ISO/IEC JTC1 SC22 WG14 - Programming Languages, C.
The Wonderfully Terrible World of C and C++ Encoding APIs (with Some Rust)
---
jlf:
Is he criticizing the work of Zach Laine? ( https://github.com/tzlaine/text )
"someone was doing something wrong on the internet and I couldn’t let that pass:"

Same person:
https://github.com/ThePhD
https://github.com/soasis

Any Encoding, Ever - ztd.text and Unicode for C++ - JUNE 30, 2021 : https://thephd.dev/any-encoding-ever-ztd-text-unicode-cpp

Starting a Basis - Shepherd's Oasis and Text - MAY 01, 2020: https://thephd.dev/basis-shepherds-oasis-text-encoding

https://ztdtext.readthedocs.io/en/latest/index.html
ztd.text
The premiere library for handling text in different encoding forms and reducing
transcoding bugs in your C++ software.
List of encodings: https://ztdtext.readthedocs.io/en/latest/encodings.html
List of Unicode encodings: https://ztdtext.readthedocs.io/en/latest/known%20unicode%20encodings.html
Design Goals and Philosophy: https://ztdtext.readthedocs.io/en/latest/design.html
---
jlf: don't know what to think about that...
related to https://github.com/soasis

https://github.com/soasis/text
JeanHeyd Meneide
This repository is an implementation of an up and coming proposal percolating
through SG16, P1629 - Standard Text Encoding
( https://thephd.dev/_vendor/future_cxx/papers/d1629.html )
---
https://github.com/soasis
Shepherd's Oasis
Software Services and Consulting.

https://encoding.spec.whatwg.org/
Encoding
The Encoding Standard defines encodings and their JavaScript API.
---
The table below lists all encodings and their labels user agents must support.
User agents must not support any other encodings or labels.
<table>
---
Most legacy encodings make use of an index. An index is an ordered list of entries,
each entry consisting of a pointer and a corresponding code point. Within an index
pointers are unique and code points can be duplicated.
Note: An efficient implementation likely has two indexes per encoding.
One optimized for its decoder and one for its encoder.

https://www.git-tower.com/help/guides/faq-and-tips/faq/encoding/windows
Character encoding for commit messages
---
When Git creates and stores a commit, the commit message entered by the user is
stored as binary data and there is no conversion between encodings. The encoding
of your commit message is determined by the client you are using to compose the
commit message.
Git stores the name of the commit encoding if the config key "i18n.commitEncoding"
is set (and if it's not the default value "utf-8").
If you commit changes from the command line, this value must match the encoding
set in your shell environment. Otherwise, a wrong encoding is stored with the
commit and can result in garbled output when viewing the commit history.
If you view the commit log on the command line, the config value "i18n.logOutputEncoding"
(which defaults to "i18n.commitEncoding") needs to match your shell encoding as well.
The command converts messages from the commit encoding to the output encoding.
If your shell encoding does not match the output encoding, you will again receive
garbled output!

https://www.git-scm.com/docs/gitattributes/2.18.0#_working_tree_encoding
gitattributes - Defining attributes per path
working-tree-encoding
Git recognizes files encoded in ASCII or one of its supersets (e.g. UTF-8,
ISO-8859-1, …) as text files. Files encoded in certain other encodings (e.g.
UTF-16) are interpreted as binary and consequently built-in Git text processing
tools (e.g. git diff) as well as most Git web front ends do not visualize the
contents of these files by default.
In these cases you can tell Git the encoding of a file in the working directory
with the working-tree-encoding attribute. If a file with this attribute is added
to Git, then Git reencodes the content from the specified encoding to UTF-8.
Finally, Git stores the UTF-8 encoded content in its internal data structure
(called "the index"). On checkout the content is reencoded back to the specified
encoding.
---
jlf: there is a number of pitfalls, read the article.

https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text
How to determine the encoding of text
jlf: for Python, not reviewed, may bring interesting infos.

https://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file
How can I detect the encoding/codepage of a text file?
jlf: for C#, not reviewed, may bring interesting infos.

https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1
What is the difference between UTF-8 and ISO-8859-1?
jlf: the interesting part are the comments about ISO-8859-1.
---
ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters.
---
cp1252 is a superset of the ISO-8859-1, containing additional printable
characters in the 0x80-0x9F range, notably the Euro symbol € and the much
maligned "smart quotes" “”. This frequently leads to a mismatch where 8859-1 can
be displayed as 1252 perfectly fine, and 1252 may seem to display fine as 8859-1,
but will misbehave when one of those extra symbols shows up.
---
jlf: so the previous comment says that ISO-8859-1 is not defined in the 0x80-0x9F
range... IS IT or IS IT NOT???
---
One more important thing to realise: if you see iso-8859-1, it probably refers
to Windows-1252 rather than ISO/IEC 8859-1. They differ in the range 0x80–0x9F,
where ISO 8859-1 has the C1 control codes, and Windows-1252 has useful visible
characters instead.
For example, ISO 8859-1 has 0x85 as a control character (in Unicode, U+0085, ``),
while Windows-1252 has a horizontal ellipsis (in Unicode, U+2026 HORIZONTAL ELLIPSIS, …).
The WHATWG Encoding spec (as used by HTML) expressly declares iso-8859-1 to be
a label for windows-1252, and web browsers do not support ISO 8859-1 in any way.

https://www.mobilefish.com/tutorials/character_encoding/character_encoding_quickguide_iso8859_1.html
jlf: not sure this page is a good reference. The fact they wrote "Unicode, a 16-bit character set."
brings a doubt about the rest of their page...
I reference it for their definition of ISO-8859-1.
---
HTML and HTTP protocols make frequent reference to ISO Latin-1 and the character
code ISO-8859-1. The HTTP specification mandates the use of the code ISO-8859-1
as the default character code that is passed over the network.
ISO-8859-1 explicitly does not define displayable characters for positions 0-31
and 127-159, and the HTML standard does not allow those to be used for displayable
characters. The only characters in this range that are used are 9, 10 and 13,
which are tab, newline and carriage return respectively.
Note: ISO-8859-1 is also known as Latin-1.
---
jlf: so they say
- 00..1F is not defined except 09, 0A, 0D (so they are different from
 https://en.wikipedia.org/wiki/ISO/IEC_8859-1) where all 00..1F is undefined.
- 7F..9F is not defined
Confirmed by their text file:
https://www.mobilefish.com/download/character_set/iso8859_1.txt

ICU title

https://icu.unicode.org

https://unicode-org.github.io/icu/
ICU documentation

https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/
Entry point of API Reference

https://icu-project.org/docs/
ICU Documents and Papers
jlf: old?

https://unicode-org.atlassian.net/jira/software/c/projects/ICU/issues/?filter=allissues
ICU tickets

https://github.com/microsoft/icu
jlf: fork by Microsoft

http://stackoverflow.com/questions/8253033/what-open-source-c-or-c-libraries-can-convert-arbitrary-utf-32-to-nfc
What open source C or C++ libraries can convert arbitrary UTF-32 to NFC?

std::string normalize(const std::string &unnormalized_utf8) {
    // FIXME: until ICU supports doing normalization over a UText
    // interface directly on our UTF-8, we'll use the insanely less
    // efficient approach of converting to UTF-16, normalizing, and
    // converting back to UTF-8.

    // Convert to UTF-16 string
    auto unnormalized_utf16 = icu::UnicodeString::fromUTF8(unnormalized_utf8);

    // Get a pointer to the global NFC normalizer
    UErrorCode icu_error = U_ZERO_ERROR;
    const auto *normalizer = icu::Normalizer2::getInstance(nullptr, "nfc", UNORM2_COMPOSE, icu_error);
    assert(U_SUCCESS(icu_error));

    // Normalize our string
    icu::UnicodeString normalized_utf16;
    normalizer->normalize(unnormalized_utf16, normalized_utf16, icu_error);
    assert(U_SUCCESS(icu_error));

    // Convert back to UTF-8
    std::string normalized_utf8;
    normalized_utf16.toUTF8String(normalized_utf8);

    return normalized_utf8;
}

https://begriffs.com/posts/2019-05-23-unicode-icu.html
Unicode programming, with examples

https://en.wikipedia.org/wiki/Trie
Tries are a form of string-indexed look-up data structure, which is used to store a dictionary list
of words that can be searched on in a manner that allows for efficient generation of completion lists.
Tries can be efficacious on string-searching algorithms such as predictive text,
approximate string matching, and spell checking in comparison to a binary search trees.
A trie can be seen as a tree-shaped deterministic finite automaton.

https://icu.unicode.org/charts/charset
    https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/convrtrs.txt
    ICU alias table
    jlf: the ultimate reference?
    ---
    # Here is the file format using BNF-like syntax:
    #
    # converterTable ::= tags { converterLine* }
    # converterLine ::= converterName [ tags ] { taggedAlias* }'\n'
    # taggedAlias ::= alias [ tags ]
    # tags ::= '{' { tag+ } '}'
    # tag ::= standard['*']
    # converterName ::= [0-9a-zA-Z:_'-']+
    # alias ::= converterName
    ---
    standard
    # The * after the standard tag denotes that the previous alias is the
    # preferred (default) charset name for that standard. There can only
    # be one of these default charset names per converter.
    ---
    Affinity tags
    If an alias is given to more than one converter, it is considered to be an
    ambiguous alias, and the affinity list will choose the converter to use when
    a standard isn't specified with the alias.
    The general ordering is from specific and frequently used to more general
    or rarely used at the bottom.
    {   UTR22           # Name format specified by https://www.unicode.org/reports/tr22/
        IBM             # The IBM CCSID number is specified by ibm-*
        WINDOWS         # The Microsoft code page identifier number is specified by windows-*. The rest are recognized IE names.
        JAVA            # Source: Sun JDK. Alias name case is ignored, but dashes are not ignored.
        IANA            # Source: http://www.iana.org/assignments/character-sets
        MIME            # Source: http://www.iana.org/assignments/character-sets
        }

https://github.com/unicode-org/icu/tree/main/icu4c/source/data/mappings
Encodings

https://unicode-org.atlassian.net/browse/ICU-22422
Collation folding
jlf: see Markus Scherer feedback

https://sourceforge.net/p/icu/mailman/icu-design/thread/SN6PR00MB04468327B475F4D6A19CF26FAFFFA%40SN6PR00MB0446.namprd00.prod.outlook.com/#msg38268251
[icu-design] Collation Folding Tables
jlf: this is a discussion related to ICU-22422

https://www.unicode.org/reports/tr10/#Collation_Folding
Collation Folding
    Matching can be done by using the collation elements, directly, as discussed above.
    However, because matching does not use any of the ordering information, the same
    result can be achieved by a folding. That is, two strings would fold to the same
    string if and only if they would match according to the (tailored) collation.
    For example, a folding for a Danish collation would map both "Gård" and "gaard"
    to the same value. A folding for a primary-strength folding would map "Resume"
    and "résumé" to the same value. That folded value is typically a lowercase string,
    such as "resume".
    jlf:
    Chrome matches "Gård" with "gard", but not with "gaard".

    A comparison between folded strings cannot be used for an ordering of strings,
    but it can be applied to searching and matching quite effectively. The data for
    the folding can be smaller, because the ordering information does not need to be
    included. The folded strings are typically much shorter than a sort key, and are
    human-readable, unlike the sort key. The processing necessary to produce the
    folding string can also be faster than that used to create the sort key.

Transliterate "micro sign" to "u" using Transliterator from icu4j
jlf: next is an answer on icu-support@lists.sourceforge.net
https://sourceforge.net/p/icu/mailman/message/58712806/
    On Wed, Dec 13, 2023 at 7:52 PM <go.al.ni@gmail.com> wrote:
    > Micro sign transliterated to "m" in one case, but not in another.

    While I don't know enough about the Any-Latin transliteration rules to
    be able to tell you why this happens, the thing that happens is that
    when you have any preceding Greek letter the transliterator will
    afterwards treat also the micro sign (U+00B5) as a Greek letter, while
    it otherwise will leave it as-is, as any other symbol.

    If you want to transliterate only Greek letters you could explicitly
    create a Greek transliterator, which then will always treat also the
    micro sign (U+00B5) as a Greek letter:

    var tr = Transliterator.getInstance("Greek-Latin");

    Or, if you want to first treat any symbols that are also Greek letters
    explicitly as Greek letters and then perform the Any-Latin
    transliteration:

    var tr = Transliterator.getInstance("Greek-Latin; Any-Latin;");

    Or, if you want just Any-Latin but with a special case for the micro
    sign (U+00B5):

    var tr = Transliterator.createFromRules("MyAnyLatin", "µ > m;
    ::Any-Latin;", Transliterator.FORWARD);

[icu-support] CollationKey for efficient collation-aware in-place substring comparison

    Question
    https://sourceforge.net/p/icu/mailman/message/58741675/

        I have a question regarding the use of CollationKey
        <https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/CollationKey.html>
        to check whether one string "contains" the other (i.e. right string is
        found anywhere in the left string, accounting for any specified rule-based
        collation using ICU4J). With this, my use case in Java would be something
        like: *contains(String left, String right, String collation)*. Suppose that
        *collation* here is a parameter indicating the collation at hand (for
        example: "Latin1_General_CS_AI"), and is used to get the appropriate
        instance of *com.ibm.icu.text.Collator* (exact routing for this collation
        is handled elsewhere in the codebase).

        Problem description

        Due to the nature of this operation, using *Collator.compare(String,
        String)* proves inefficient for this problem, because it would require
        allocating O(N) substrings of *left *before calling *compare(left.substring(),
        right)*. Suppose N here is the length of the *left* string.

        Example: *contains*("Abć", "a", "Latin1_General_CS_AI"); // returns false
          - calls: *collator.compare("A", "a")* // returns false ("A" here is
        "Abć".substring(0,1))
          - calls: *collator.compare("b", "a")* // returns false ("b" here is
        "Abć".substring(1,2))
          - calls: *collator.compare("ć", "a")* // returns false ("ć" here is
        "Abć".substring(2,3))
        Here, this approach allocates *3 new strings* in order to do the
        comparisons.

        Using CollationKey

        As I understood, *com.ibm.icu.text.CollationKey* is the way to go for
        repeated comparison of strings. Here, I would like to compare strings in a
        way that only requires generating one key for *left* (let's call it
        *leftKey*) and one key for *right* (let's call it *rightKey*), and then
        comparing these arrays in-place, byte per byte.

        However, it doesn't seem that this operation is supported out-of-the-box
        with *CollationKey*. While one can easily use two collation keys
        for equality comparison and collation-aware ordering, I'm not sure if this
        holds for substring operations as well? Given a collation key for "Abć", is
        there a constant-time way to obtain collation keys for "A", "b", and "ć"?
        Ideally, I would want to only traverse the "Abć" collation key (*leftKey*)
        as a plain byte array, and do in-place comparison with the "ć" collation
        key (*rightKey*) as a plain byte array. However, it doesn't seem
        straightforward given the structure of the collation key (suffixes, etc.)

        public boolean contains(String left, String right, String collation) {
        >   Collator collator = ...(collation);
        >   // get collation keys
        >   CollationKey leftKey = collator.getCollationKey(left);
        >   CollationKey rightKey = collator.getCollationKey(right);
        >   // get byte arrays
        >   byte[] lBytes = leftKey.toByteArray();
        >   byte[] rBytes = rightKey.toByteArray();
        >   // in-place comparison
        >   for (int i = 0; i <= lBytes.length - rBytes.length; i++) {
        >     if (compareKeys(lBytes, rBytes, i)) {
        >       return true;
        >     }
        >   }
        >   return false;
        > }

        Suppose there's a simple helper function such as:

        > private boolean compareKeys(byte[] lBytes, byte[] rBytes, int offset) {
        >   int len = rBytes.length;
        >   // compare lBytes[i, i+len] to rBytes[0, len] in-place, byte by byte...
        > }

        Could you please provide any support regarding how to implement this
        solution so that it fully takes into account the collation key byte array
        structure? As of now, this simple comparison doesn't work because there are
        some suffixes in both *leftKey* and *rightKey*, so exact comparison is not
        possible, but I'm wondering if there is a way to go around this.

        Alternative

        It turns out that making use of *Collator.compare(Object, **Object**)* instead
        of *Collator.compare(String, **String**)* doesn't prove to be any better
        either, because it does *toString()* anyway, regressing the performance in
        a similar fashion. Ideally, an implementation such as
        *Collator.compare(Character, **Character**)* could do the trick, however
        only under the condition that it would *not allocate* a new *String* for
        the two arguments. This would allow traversing *left* and *right* strings
        and comparing individual characters just by using *String.charAt* (with no
        extra *String* allocation whatsoever).

        However, I don't believe there is currently anything like
        *Collator.compare(**Character**, **Character**)* that works exactly like
        this. So for now, I'm trying to implement this functionality using
        *CollationKey*.

    Answer from Markus Sherer
    https://sourceforge.net/p/icu/mailman/message/58741856/

        Yes, but CollationKey is too low-level, and you would have to compute and
        store the CollationKey for the entire left string at once, which could be
        large.
        “Don't do this at home” :-)

        Please use class
        <https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html>
        StringSearch
        <https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html>
        https://unicode-org.github.io/icu/userguide/collation/string-search.html

        I don't remember if StringSearch automatically loads "search" tailorings;
        it's possible that you may have to request that explicitly.

        https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback
        https://www.unicode.org/reports/tr10/#Searching

https://docs.google.com/document/d/1_sbbFermCe24uK9Q59AOSY2XKnl2qQPCTrmUExrcDaI/edit#heading=h.86zz4h2lnsqb
[icu-design] Mixed locales
Mail from Rich Williams
The topic of mixed locales (aka region-based inheritance/fallback) came up in
this morning’s CLDR/iCU design meeting, and it sounded like maybe it was time to
restart the discussion we had on this topic a couple years ago.
I was asked to again send around the proposal I wrote up for this back then.
So here it is.

[icu-support] Rounding collated strings up/down
    On Thu, May 9, 2024 at 2:56 PM Stefan Kandic <stefan.kandic@databricks.com> wrote:
        | Let's say you are working with a string "abcdef" with utf8 binary collation
        | and you want to truncate it to only 3 characters,
        | rounded down version would just be "abc" => all strings that are greater than
        | the original are also greater than the new one
Markus Scherer
    The complication here would be contractions and expansions, and interactions with
    normalization.

    It would probably be best if you normalized the input to NFD, or at least to
    something that fits "FCD" (which is supported in ICU), in order to avoid canonical
    reordering.

    Then you should be able to use a CollationElementIterator and observe when its
    source text index moves. When you are inside an expansion, it should remain the
    same for several collation elements.
    And when it moves, it will have consumed multiple characters for a contraction.
    For example, in Slovak it will move from before "ch" to after it, not in between.

        | rounded up version would be "abd" or even "ab" + utf8 max character
        | => all strings that are less than the original are also less than the new one

    This kind of "rounding up" is supported in CLDR/ICU collation.
    Take your lower bound and append a U+FFFF.
    See https://www.unicode.org/reports/tr35/tr35-collation.html#tailored_noncharacter_weights

        | Basically, idea is to get truncated values to save on storage but also to have
        | sortKey(roundedDown) < sortKey(original) < sortKey(roundedUp) be as tight bound
        | as possible.

    Try to avoid "rounding down" because it's messy.
    ICU uses several tricks to "compress" sort keys while maintaining their binary order.
    Note that sort keys are not stable across Unicode/CLDR/ICU versions.

ICU design

https://icu.unicode.org/design
Design Docs
The older  design docs are in svn: http://source.icu-project.org/repos/icu/icuhtml/trunk/design/.

https://icu.unicode.org/design/normalizing-to-shortest-form
Canonically Equivalent Shortest Form (CESF)
This is usually, but not always, the NFC form.

https://icu.unicode.org/design/struct/utrie
ICU Code Point Tries
We use a form of "trie" adapted to single code points.
The bits in the code point integer are divided into two or more parts.
The first part is used as an array offset, the value there is used as a start offset into another array.
The next code point bit field is used as an additional offset into that array, to fetch another value.
The final part yields the data for the code point.
Non-final arrays are called index arrays or tables.
---
For a general-purpose structure, we want to be able to be able to store a unique value for every character.
This determines the number of bits needed in the last index table.
With 136,690 characters assigned in Unicode 10, we need at least 18 bits.
We allocate data values in blocks aligned at multiples of 4, and we use 16-bit index words shifted left by 2 bits.
This leads to a small loss in how densely the data table can be used, and how well it can be compacted, but not nearly as much as if we were using 32-bit index words.

https://icu.unicode.org/design/struct/tries/bytestrie
It maps from arbitrary byte sequences to 32-bit integers.
(Small non-negative integers are stored more efficiently. Negative integers are the least efficient.)
The BytesTrie and UCharsTrie structures are nearly the same, except that the UCharsTrie uses fewer, larger units.

https://icu.unicode.org/design/struct/tries/ucharstrie
Same design as a BytesTrie, but mapping any UnicodeString (any sequence of 16-bit units) to 32-bit integer values.

ICU demos

https://icu4c-demos.unicode.org/icu-bin/icudemos
todo: review

https://icu4c-demos.unicode.org/icu-bin/collation.html
ICU Collation Demo

https://icu4c-demos.unicode.org/icu-bin/convexp
Demo Converter Explorer

https://icu4c-demos.unicode.org/icu-bin/scompare
ICU Unicode String Comparison
Interactive demo application

https://icu4c-demos.unicode.org/icu-bin/translit
ICU Transform Demonstration

ICU bindings

02/06/2021
https://gitlab.pyicu.org/main/pyicu
Python extension wrapping the ICU C++ libraries.

02/06/2021
https://docs.microsoft.com/en-us/windows/win32/intl/international-components-for-unicode--icu-
In Windows 10 Creators Update, ICU was integrated into Windows, making the C APIs and data publicly accessible.
The version of ICU in Windows only exposes the C APIs.
It is impossible to ever expose the C++ APIs due to the lack of a stable ABI in C++.
Getting started
1) Your application needs to target Windows 10 Version 1703 (Creators Update) or higher.
2) Add in the header:
    #include <icu.h>
3) Link to:
    icu.lib
Example:
    void FormatDateTimeICU()
    {
        UErrorCode status = U_ZERO_ERROR;

        // Create a ICU date formatter, using only the 'short date' style format.
        UDateFormat* dateFormatter = udat_open(UDAT_NONE, UDAT_SHORT, nullptr, nullptr, -1, nullptr, 0, &status);

        if (U_FAILURE(status))
        {
            ErrorMessage(L"Failed to create date formatter.");
            return;
        }

        // Get the current date and time.
        UDate currentDateTime = ucal_getNow();

        int32_t stringSize = 0;

        // Determine how large the formatted string from ICU would be.
        stringSize = udat_format(dateFormatter, currentDateTime, nullptr, 0, nullptr, &status);

        if (status == U_BUFFER_OVERFLOW_ERROR)
        {
            status = U_ZERO_ERROR;
            // Allocate space for the formatted string.
            auto dateString = std::make_unique<UChar[]>(stringSize + 1);

            // Format the date time into the string.
            udat_format(dateFormatter, currentDateTime, dateString.get(), stringSize + 1, nullptr, &status);

            if (U_FAILURE(status))
            {
                ErrorMessage(L"Failed to format the date time.");
                return;
            }

            // Output the formatted date time.
            OutputMessage(dateString.get());
        }
        else
        {
            ErrorMessage(L"An error occured while trying to determine the size of the formatted date time.");
            return;
        }

        // We need to close the ICU date formatter.
        udat_close(dateFormatter);
    }

http://www.boost.org/doc/libs/1_58_0/libs/locale/doc/html/index.html
Boost.Locale creates the natural glue between the C++ locales framework, iostreams, and the powerful ICU library

http://blog.lukhnos.org/post/6441462604/using-os-xs-built-in-icu-library-in-your-own
Using OS X’s Built-in ICU Library in Your Own Project

ICU4X title

https://icu4x.unicode.org/
lead by Shane Carr (https://www.sffc.xyz)

https://github.com/unicode-org/icu4x

https://docs.rs
jlf: if there is a version number in the path, you can replace it with "latest"

https://www.unicode.org/faq/unicode_license.html
jlf: ICU4X uses UNICODE LICENSE V3
The Unicode License is a permissive MIT type of license. However, there are
several additional considerations identified separately in the associated
Unicode Terms of Use (https://www.unicode.org/copyright.html).
---
Comparison with other licenses:
https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licenses
jlf: hum... the "unicode license" is not in this table...

https://www.reddit.com/r/rust/comments/q4xaig/icu_vs_rust_icu/
icu vs rust_icu
Oct 10, 2021
---
jlf : here "icu" is ICU4X and rust_icu is another crate.
Well... it's a mess, plenty of separated crates more or less finalized.
There is a comment from an ICU4X committer saying "ICU4X does not have normalization".
Of course, it's now supported but it's to say that ICU4X is far to be as complete
as ICU.

https://news.ycombinator.com/item?id=35608997
ICU4X 1.2: Now With Text Segmentation and More on Low-Resource Devices
MONDAY, APRIL 17, 2023

http://blog.unicode.org/2022/09/announcing-icu4x-10.html
    SEPTEMBER 29, 2022
    Announcing ICU4X 1.0

    This week, after 2½ years of work by Google, Mozilla, Amazon, and community partners,
    the Unicode Consortium has published ICU4X 1.0, its first stable release.
        Lightweight:
        ICU4X is Unicode's first library to support static data slicing and dynamic data loading.
        Portable:
        ICU4X supports multiple programming languages out of the box. ICU4X can be used
        in the Rust programming language natively, with official wrappers in C++ via the
        foreign function interface (FFI) and JavaScript via WebAssembly.

    ICU4X does not seek to replace ICU4C or ICU4J; rather, it seeks to replace the large number
    of non-Unicode, often-unmaintained, often-incomplete i18n libraries that have been written
    to bring i18n to new programming languages and resource-constrained environments

    One of the most visible departures that ICU4X makes from ICU4C and ICU4J is an
    explicit data provider argument on most constructor functions.
    ICU4X team member Manish Goregaokar wrote a blog post series detailing how the zero-copy deserialization works under the covers.
        https://manishearth.github.io/blog/2022/08/03/zero-copy-1-not-a-yoking-matter/
        https://manishearth.github.io/blog/2022/08/03/zero-copy-2-zero-copy-all-the-things/
        https://manishearth.github.io/blog/2022/08/03/zero-copy-3-so-zero-its-dot-dot-dot-negative/
        (jlf: Related to ICU4X, but should I read that ? It's internal Rust stuff)

https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/cpp.md
Using ICU4X from C++

https://www.reddit.com/r/programming/comments/xrmine/the_unicode_consortium_announces_icu4x_10_its_new/
    The C and C++ APIs are header-only, you use them by linking to the icu_capi crate (more on this here).
    https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/cpp.md
    The C API is just not that idiomatic, so we don't advertise it as much.
    It exists more as a crutch for other languages to be able to call in, and it's optimized for cross language interop.
    That said, it has been pointed out to me that it's not that unidiomatic when you compare it with other large C libraries,
    so perhaps that's okay. We do have some tests that use it directly and it's .... fine to work with.
    Not an amazing experience, not terrible either.
    ---
    jlf: to investigate
    The C wrapper is probably better to use from Executor, because there is no hidden magic for memory management.
    The C++ wrapper is difficult to understand (at least to me, for the moment) because it's modern C++.

https://www.reddit.com/r/rust/comments/xrh7h6/announcing_icu4x_10_new_internationalization/
    icu_segmenter implements rule based segmentation, so you can actually customize
    the segmentation rules based on your needs by writing some toml and feeding it to datagen.
    The concept of a "character" or "word" has no single cross-linguistic meaning;
    it is not uncommon to need to tailor these algorithms by use case or even just
    the language being used. E.g. handling viramas in Indic scripts as a part of
    grapheme segmentation is a thing people might need, but may also not need,
    and UAX29 doesn't support that at the moment¹. CLDR contains a bunch of common
    tailorings for specific locales here, but as I mentioned folks may tailor further based on use case.

    Furthermore, icu_segmenter supports dictionary-based segmentation:
    for languages like Japanese and Thai where spaces are not typically used,
    you need a large dictionary to be able to segment them accurately
    (and again, it's language-specific).
    ICU4X's flexible data model means that you don't need to ship your application
    with this data and instead fetch it when it's actually necessary.
    We both support using dictionaries and an LSTM model depending on your code size/data size needs.

https://docs.google.com/document/d/1ojrOdIchyIHYbg2G9APX8j2p0XtmVLj0f9jPIbFYVUE/edit#heading=h.xy9pq2mk1ypz
ICU4X Segmenter Investigation

https://github.com/unicode-org/icu4x/issues/1397
Character names
jlf: Not yet supported by ICU4X, too bad... I need that for Executor.

https://github.com/unicode-org/icu4x/issues/545
Reconsider UTF-32 support
jlf: see also the comments about PyICU

https://github.com/unicode-org/icu4x/issues/131
Port BytesTrie to ICU4X #131
with feedback from Markus Scherer (ICU)

https://github.com/unicode-org/icu4x/issues/2721
Specialized zerovec collections for stringy types
Sketch of a potential AsciiTrie.

https://github.com/unicode-org/icu4x/pull/2722
Experimental AsciiTrie implementation

https://github.com/unicode-org/icu4x/issues/2755
Get word break type
When you iterate through text using the WordBreakIterator, you get the boundaries of words, spaces, punctuation, etc.
It does not appear to tell you what kind of token or break that is has found.
The C-language version of ICU has a function on the iterator called getRuleStatus()
that returns an enum that describes the last break it found. The documentation is here:
https://unicode-org.github.io/icu/userguide/boundaryanalysis/

https://github.com/unicode-org/icu4x/pull/2777/files
added initial benchmarks for normalizer.

https://github.com/unicode-org/icu4x/discussions/2877
How to use segmenter

https://github.com/unicode-org/icu4x/issues/2886
Consider providing mapping to Standardized Variants for CJK Compatibility Ideographs
Across GitHub, I found 3 users of this feature in unicode-normalization:
    https://github.com/sunfishcode/basic-text (by the implementor of the unicode-normalization feature)
    https://github.com/logannc/fuzzywuzzy-rs (unclear to me why you'd want this for a fuzzy match; I'd expect a fuzzy match not to want to distinguish the variations)
    https://github.com/crlf0710/runestr-rs

https://github.com/unicode-org/icu4x/issues/2975
How supported do we consider non-keyextract users?

https://github.com/unicode-org/icu4x/issues/2908
Time zone needs for calendar application
Use case by team member of Mozilla Thunderbird

Not related to Unicode, but related to the fact I put the ICU4X cdylib in Executor github...
https://github.com/ankane/polars-ruby/blob/master/ext/polars/Cargo.toml
Is it a way to avoid bundling the original rust lib?
https://news.ycombinator.com/item?id=34425233
---
Not clear to me: for Python, are the lib binaries installed by
https://pypi.org/project/polars/
?
apparently yes, see https://pypi.org/project/polars/#files
---
For ruby, is it built by a github workflow?
https://github.com/ankane/polars-ruby/blob/master/.github/workflows/release.yml

https://github.com/unicode-org/icu4x/pull/2779/files
add collator initial bench

https://github.com/unicode-org/icu4x/issues/3151
icu_casemapping feature request: methods fold and full_fold should apply Turkic mappings depending on locale
    ---
    Markus Scherer:
    Applying Turkic case foldings automatically is dangerous.
    While case mappings are intended for human consumption and take a locale parameter,
    case foldings are used for processing (case-insensitive matching) not for display,
    and in most cases it is very surprising when "IBM" and "ibm" don't match when the
    locale is Turkish or Azerbaijani.
    It is much safer to let the developer control this explicitly. (By comparison,
    ICU4C/ICU4J have folding functions that take a boolean parameter for default vs.
    Turkic foldings. This also models the boolean condition in the relevant Unicode
    data file.)
    ---
    lucatrv
    If I understand correctly, icu_collator should be used when strings need to be sorted,
    while a case-folding method of icu_casemapping should be used when strings need
    just to be matched. However icu_collator can also be used to match strings, see
    for instance examples using Ordering::Equal here, so it is not clear to me which
    one to use in this case.
    Finally, another source of confusion (at least for me) is that icu_casemapping
    can be used for both case mapping and case folding, but its documentation mentions
    only "Case mapping for Unicode characters and strings".
    ---
    sffc
    The collator does a fuzzier match. The example you cited shows that it considers
    "às" and "as" to be equal, for example.
    @markusicu is it safe to say that most users who are looking for a fuzzy string
    comparison utility should favor the collator over casefold+nfd?
    ---
    sffc
    See also https://github.com/tc39/ecma402/issues/256
    ---
    hsivonen
    Casefold+NFD and ignoring combining diacritics after the NFD operation gives a
    general case-insensitive, diacritic-insensitive match. To further match the root
    search collation (apart from the Hangul aspect for which I don't understand the
    use case), you'd have to also ignore certain Arabic marks and the Thai phinthu
    (virama). (The Hebrew aspect of the search root is gone from CLDR trunk already.)

    Apart from Turkic case-insensitivity, the key thing that the search collation
    tailorings provide on top of the above is being able to have a diacritic-insensitive
    mode where certain things that technically are diacritics but that are on a
    per-language basis considered to form a distinct base letter are not ignored on
    a locale-sensitive basis. For example, o and ö are distinct for Finnish, Swedish,
    Icelandic, and Turkish (not sure if them being equal for Estonian search is
    intentional or a CLDR bug) in collator-based search even when ignoring diacritics.

    Based on observing the performance of Firefox's ctrl/cmd-f (not collator based)
    relative to Chrome's and Safari's (collator-based), I believe that casefold+NFD
    and ignoring certain things post-NFD will be faster than collator-based search.
    However, if you also want not to ignore certain diacritics on a per-locale basis,
    it's up to you to implement those rules. That is, ICU4X doesn't do it for you.
    You can find out what the rules are by reading the CLDR search collation sources.

    (FWIW, Firefox's ctrl/cmd-f does not have locale-dependent rules for diacritics.
    The checkbox either ignores all of them or none.)

    ECMA-402 and ICU4X don't have API surface for collator-based substring match.
    You can only do full-string comparison, so you can search in the sense of
    filtering a set/list of items by a search key.
    ---
    Markus Scherer
        > If I understand correctly, CaseMapping::to_full_fold applies full case folding
        > + NFD and ignores combining diacritics.
    I think not. I believe it just applies the “full” Case_Folding mappings to each
    character, as opposed to the Simple_Case_Folding. Normalization and removing
    diacritics etc. would be separate steps / function calls.
    https://www.unicode.org/reports/tr44/#Case_Folding

        > Therefore it actually provides the fuzziest match (general case-insensitive
        > and diacritic-insensitive match). To my understanding this should be equivalent
        > to the icu_collator primary strength level,
        > https://icu4x.unicode.org/doc/icu_collator/enum.Strength.html#variant.Primary
    No. Similar in effect, but as Henri said, collation mappings do a lot more, such
    as ignoring control codes and variation selectors.

        > which I guess is independent from locale
    Not really. There are language-specific collation mappings, such as German "ä"="ae"
    (on primary level), but of course for the majority of Unicode characters each
    tailoring behaves like the Unicode default.

    Collation also provides for a number of parametric settings, although most of
    those are relevant for sorting, not for matching and searching. They do let you
    select things like “ignore punctuation” and “ignore diacritics but not case”.
    https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
    ---
    lucatrv
    Referring to Section 3.13, Default Case Algorithms in the Unicode Standard,
    now I understand that CaseMapping::full_fold applies the toCasefold(X)
    operation (R4 page 155), which is the Case_Folding property.

    To allow proper caseless matching of strings interpreted as identifiers,
    in my opinion another method CaseMapping::NFKC_full_fold should be added,
    to apply the toNFKC_Casefold(X) operation (R5 page 155), which is the
    NFKC_Casefold property. Then another method should be added to allow
    identifier caseless matching, which could be either the combined function
    toNFKC_Casefold(NFD(X)) (D147 page 158) or the lower level NFD(X)
    normalization function. Otherwise to keep things simpler, maybe just a
    method named CaseMapping::caseless could be added which applies
    toNFKC_Casefold(NFD(X)) (D147 page 158). Do you agree, or otherwise how can
    I perform proper caseless categorization and matching?
    ---
    eggrobin
    For case-insensitive identifier comparison (identifiers include programming
    language identifiers, but also things like usernames: @EGGROBIN and @eggrobin
    are the same person), Unicode provides the operation toNFKC_Casefold, used
    in the definition of identifier caseless match (D147 in Default Caseless Matching).

    Earlier versions of Unicode (prior to 5.2) recommended the use of NFKC and
    casefolding directly, without the removal of default ignorable code points
    performed by toNFKC_Casefold.

    The foldings thus have stability guarantees that make them suitable for usage
    in identifier comparison in conjunction with NFKC
    (see https://www.unicode.org/policies/stability_policy.html#Case_Folding).

    As @markusicu wrote above, since identifier systems typically need to use a
    locale-independent comparison, the Turkic foldings need to be used with great
    care: whether @eggrobin is the same as @EGGROBIN should not depend on
    someone’s language.

    @markusicu is it safe to say that most users who are looking for a fuzzy
    string comparison utility should favor the collator over casefold+nfd?
    ^ @macchiati for advice on the most recommended way to perform fuzzy string
    matching.

    I am neither Markus nor Mark, but I would say that for general-purpose
    matching that does not have stability requirements, something collation-based
    is more appropriate. In particular, Chrome’s Ctrl+F search uses that.

    This is, as has been mentioned, fuzzier (beyond the accents already mentioned,
    note that ŒUF and œuf are primary-equal to oeuf, whereas they are not
    identifier caseless matches).

    An important consideration is that, being unstable (there is a somewhat
    squishy stability policy, see https://www.unicode.org/policies/collation_stability.html
    and https://www.unicode.org/collation/ducet-changes.html), fuzzy matching
    based on collation can be improved. Most recently the UTC approved (in
    consensus 174-C4) a change to the collation of punctuation marks that look
    like the ASCII ' and ", which has the effect that O'Connor will now be
    primary-equal to O’Connor.

https://github.com/unicode-org/icu4x/issues/3178
Consider supporting three layers of collation data for search collations
Markus Scherer
Outside of ICU4X we usually try to make code & data work according to the
algorithms, not according to what the known data looks like right now.
ICU4C/J allow users to build custom tailorings at build time and at runtime.
It should be possible to tailor relative to something that is tailored in the
intermediate root search.

https://github.com/unicode-org/icu4x/issues/3174#issuecomment-1624080765
Should search collation be a different data key + constructor? #3174
---
jlf
Don't know if this long comment brings something useful for Rexx.
They are searching for use-cases.
whole-string matching, collation, substring or prefix matching.
https://www.unicode.org/reports/tr10/#Searching: It's typically used for a substring
match, like Ctrl-F in a browser.
Why is collation the way it is? There's a use case for diacritic-insensitive string
matching. And there is also the observation that you need special handling for certain
diacritics like German umlauts.
It seems weird that Thai for example has certain tailorings that are not in other
Brahmic languages.

https://github.com/unicode-org/icu4x/discussions/3981#discussioncomment-6882618
String search with collators
references this ICU link:
https://unicode-org.github.io/icu/userguide/collation/string-search.html

https://github.com/unicode-org/icu4x/issues/3174#issuecomment-1624080765
Should search collation be a different data key + constructor?
jlf: referenced from #3981 with this comment:
We've had discussions about search collations in the past, such as #3174
Basically, we need a client with a clear and compelling use case who ideally can
make some contributions, and then the team can provide mentorship to help land
this type of feature.

icu_collator version 1.3.3 is released.
https://github.com/unicode-org/icu4x/releases/tag/ind%2Ficu_collator%401.3.3

https://docs.rs/icu_collator/latest/icu_collator/
Comparing strings according to language-dependent conventions.
jlf: with examples
jlf: implementation notes. https://docs.rs/icu_collator/latest/icu_collator/docs/index.html
They use NFD?
"The key design difference between ICU4C and ICU4X is that ICU4C puts the
canonical closure in the data (larger data) to enable lookup directly by
precomposed characters while ICU4X always omits the canonical closure and always
normalizes to NFD on the fly."
jlf: ok, on the fly, so part of their algorithm.

https://github.com/unicode-org/icu4x/discussions/3231#discussioncomment-5599221
    @sffc , Will ICU4X Test Data provider give correct results for Lao language?
    I was running segment_utf16 on Lao string but its results are not inline with ICU4C results.

    The ICU4X Test Data provider supports Japanese and Thai. For the other languages,
    you should follow the steps in the tutorial to generate your own data; in general
    the testdata provider is intended for testing. You can also track #2945 which
    will make it possible to get full data without needing to build it using the tool.

https://www.youtube.com/watch?v=ZzsbN7HBd7E
Rust Zürisee, Dec 2022: Next Generation i18n with Rust Using ICU4X
Talk by Shane Carr (starts at 11:20, with some intros from the organizers first)

https://github.com/unicode-org/icu4x/discussions/3522
Some word segmentation results are different than we get in ICU4C
- Khmer string មនុស្សទាំងអស់ is giving 13 index as a breakpoint in ICU4X while ICU4C gives 6
- ຮ່ສົ່ສີ 5 in ICU4C while 7 in ICU4X
- กระเพรา 3 in ICU4C while 7 in ICU4X
I'm using the full data blob with all keys and locales.
jlf: see the discussion, there is some code.

https://github.com/unicode-org/icu4x/issues/2945
Default constructors with full data
jlf: remember "close #2743 in favour of #2945. the solution we're working on there trivially extends to FFI."
sffc
We have built data providers as a first-class feature in ICU4X. We currently tutor
clients on how to build their data file and detail all the knobs at their disposal,
which is essential to ICU4X's mission.

https://github.com/unicode-org/icu4x/issues/3552#issuecomment-1600050638
/// ICU4C's TestGreekUpper
#[test]
fn test_greek_upper() {
    let cm = CaseMapping::new_with_locale(&locale!("el"));

    // https://unicode-org.atlassian.net/browse/ICU-5456
    assert_eq!(cm.to_full_uppercase_string("άδικος, κείμενο, ίριδα"), "ΑΔΙΚΟΣ, ΚΕΙΜΕΝΟ, ΙΡΙΔΑ");
    // https://bugzilla.mozilla.org/show_bug.cgi?id=307039
    // https://bug307039.bmoattachments.org/attachment.cgi?id=194893
    assert_eq!(cm.to_full_uppercase_string("Πατάτα"), "ΠΑΤΑΤΑ");
    assert_eq!(cm.to_full_uppercase_string("Αέρας, Μυστήριο, Ωραίο"), "ΑΕΡΑΣ, ΜΥΣΤΗΡΙΟ, ΩΡΑΙΟ");
    assert_eq!(cm.to_full_uppercase_string("Μαΐου, Πόρος, Ρύθμιση"), "ΜΑΪΟΥ, ΠΟΡΟΣ, ΡΥΘΜΙΣΗ");
    assert_eq!(cm.to_full_uppercase_string("ΰ, Τηρώ, Μάιος"), "Ϋ, ΤΗΡΩ, ΜΑΪΟΣ");
    assert_eq!(cm.to_full_uppercase_string("άυλος"), "ΑΫΛΟΣ");
    assert_eq!(cm.to_full_uppercase_string("ΑΫΛΟΣ"), "ΑΫΛΟΣ");
    assert_eq!(cm.to_full_uppercase_string("Άκλιτα ρήματα ή άκλιτες μετοχές"), "ΑΚΛΙΤΑ ΡΗΜΑΤΑ Ή ΑΚΛΙΤΕΣ ΜΕΤΟΧΕΣ");
    // http://www.unicode.org/udhr/d/udhr_ell_monotonic.html
    assert_eq!(cm.to_full_uppercase_string("Επειδή η αναγνώριση της αξιοπρέπειας"), "ΕΠΕΙΔΗ Η ΑΝΑΓΝΩΡΙΣΗ ΤΗΣ ΑΞΙΟΠΡΕΠΕΙΑΣ");
    assert_eq!(cm.to_full_uppercase_string("νομικού ή διεθνούς"), "ΝΟΜΙΚΟΥ Ή ΔΙΕΘΝΟΥΣ");
    // http://unicode.org/udhr/d/udhr_ell_polytonic.html
    assert_eq!(cm.to_full_uppercase_string("Ἐπειδὴ ἡ ἀναγνώριση"), "ΕΠΕΙΔΗ Η ΑΝΑΓΝΩΡΙΣΗ");
    assert_eq!(cm.to_full_uppercase_string("νομικοῦ ἢ διεθνοῦς"), "ΝΟΜΙΚΟΥ Ή ΔΙΕΘΝΟΥΣ");
    // From Google bug report
    assert_eq!(cm.to_full_uppercase_string("Νέο, Δημιουργία"), "ΝΕΟ, ΔΗΜΙΟΥΡΓΙΑ");
    // http://crbug.com/234797
    assert_eq!(cm.to_full_uppercase_string("Ελάτε να φάτε τα καλύτερα παϊδάκια!"), "ΕΛΑΤΕ ΝΑ ΦΑΤΕ ΤΑ ΚΑΛΥΤΕΡΑ ΠΑΪΔΑΚΙΑ!");
    assert_eq!(cm.to_full_uppercase_string("Μαΐου, τρόλεϊ"), "ΜΑΪΟΥ, ΤΡΟΛΕΪ");
    assert_eq!(cm.to_full_uppercase_string("Το ένα ή το άλλο."), "ΤΟ ΕΝΑ Ή ΤΟ ΑΛΛΟ.");
    // http://multilingualtypesetting.co.uk/blog/greek-typesetting-tips/
    assert_eq!(cm.to_full_uppercase_string("ρωμέικα"), "ΡΩΜΕΪΚΑ");
    assert_eq!(cm.to_full_uppercase_string("ή."), "Ή.");
}

https://github.com/unicode-org/icu4x/discussions/3688#discussioncomment-6456010
Recommended data provider type for libraries depending on ICU4X
---
I finished creating a library that uses ICU4X as its backend, while learning Rust.
For my library I used the DataProvider for as the interface to CLDR data
(currently just using icu_testdata, though seen the page to generate customised
datasets).
So now I am wondering what would be the recommended data provider to use for a
library using ICU4X as its backend?
---
If you know the data you want at build time, I suggest using a baked data provider,
otherwise use a Blob one with postcard.
You can generate data using these steps
https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/data_management.md
In the 1.3 release there will be a compiled_data feature that lets you include
data by default, kinda like testdata but intended for production.
---
compiled_data feature may just be what my library could use without the need for
users to supply data provider for my library, if I understand the intended
purpose of this up coming feature. Where is this feature located in the master,
so I may start looking at it for design purposes, while waiting for 1.3 release?
---
jlf: this answer is seriously incomprehensible!
The feature is present on all of the component crates and it exposes functions
like DateTimeFormatter::try_new() that don't have a provider argument.
https://unicode-org.github.io/icu4x/docs/icu/datetime/struct.DateTimeFormatter.html#method.try_new
The crate also does contain an unstable baked provider that users can pass in
themselves, but note that it only implements data stuff from that particular
crate and they'll need to combine it with providers from other crates if the
type they are using uses data from everywhere (like DateTimeFormat: it uses
plurals and decimal data too)
https://unicode-org.github.io/icu4x/docs/icu/datetime/provider/struct.Baked.html
---
This is a good question; what should intermediate libraries expose to their
users? I'll schedule this for a discussion at an upcoming developers call.

https://github.com/unicode-org/icu4x/issues/3709
Chinese and Dangi inconsistent with ICU implementations for extreme dates
The current implementation of the Chinese calendar, as well as the Dangi calendar
in #3694, are not consistent with ICU for all dates; based on writing a number of
manual test cases (see the aforementioned PR), this seems to only be an issue for
dates very far in the past or far in the future (ex. year -3000 ISO).
Furthermore, the ICU4X Chinese/Dangi and astronomy functions are newly-written
and have several algorithms based on the most recent edition of Calendrical
Calculations, while the existing ICU code seems to be from 2000, incorporating
algorithms from the 1997 edition of Calendrical Calculations.
---
jlf: I take note of this because it's interesting to see the differences with ICU.

Calendars
in https://github.com/unicode-org/icu4x/pull/3744#discussion_r1277062568
they reference this common lisp code
https://github.com/EdReingold/calendar-code2/blob/main/calendar.l#L2352
---
jlf: I take note of this to remember
;;;; The Functions (code, comments, and definitions) contained in this
;;;; file (the "Program") were written by Edward M. Reingold and Nachum
;;;; Dershowitz (the "Authors")
;;;; These Functions are explained in the Authors'
;;;; book, "Calendrical Calculations", 4th ed. (Cambridge University
;;;; Press, 2016)
---
https://en.wikipedia.org/wiki/Calendrical_Calculations
https://reingold.co/calendars.shtml
The resource page for the book makes all the source code for the book available for download.
https://www.cambridge.org/ch/universitypress/subjects/computer-science/computing-general-interest/calendrical-calculations-ultimate-edition-4th-edition?format=PB&isbn=9781107683167#resources
The code has been ported to Python
https://github.com/espinielli/pycalcal

https://github.com/uni-algo/uni-algo/issues/31
L with stroke letter (U+0141, U+0142) doesn't normalize.
    auto const polish = std::string{"ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ"};
    auto norm = una::norm::to_unaccent_utf8(polish);
    Everything is normalized except 'ł' and 'Ł'.
Everything is normalized except 'ł' and 'Ł'.
---
Strokes are not accents. As far as I know there is no data table in Unicode that
maps L with stroke to L so no plans to implemented it, you need to do it
manually if needed.
--
jlf: idem with utf8proc
"ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ"~text~nfc(stripmark:)=     -- T'acełnoszz ACEŁNOSZZ'
"ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ"~text~nfd(stripmark:)=     -- T'acełnoszz ACEŁNOSZZ'
---
https://en.wikipedia.org/wiki/%C5%81
    Character                   Ł           ł
    Unicode                     321 0141    322 0142
    CP 852                      157 9D      136 88
    CP 775                      173 AD      136 88
    Mazovia                     156 9C      146 92
    Windows-1250, ISO-8859-2    163 A3      179 B3
    Windows-1257, ISO-8859-13   217 D9      249 F9
    Mac Central European        252 FC      184 B8

https://github.com/unicode-org/icu4x/issues/2715
Minor and patch release policy

https://docs.google.com/document/d/1yg_2l5FFo0aAuNi4jpgcIhIYjHqJyUoJWtMduyQ0vR8/edit
ICU4X Data Versioning Design
This document has been migrated to Markdown in
https://github.com/unicode-org/icu4x/pull/2919
jlf: I don't see any markdown...

https://github.com/unicode-org/icu4x/issues/1471
Decide on data file versioning policy
jlf: For the comment of Marcus Scherer

https://github.com/unicode-org/icu4x/issues/165
Data Version
jlf: maybe to read
As far as semantic versioning, I no longer give deference to it as the preferred
way to do versioning or see the topic so singularly after seeing this talk.
https://www.youtube.com/watch?v=oyLBGkS5ICk
jlf: Spec-ulation Keynote - Rich Hickey
The comments say it's good, did not watch.

DateTime
https://github.com/unicode-org/icu4x/issues/3347
DateTimeFormatter still lacks power user APIs
jlf: this ticket contains potentially interesting links:
    Class hiearchy: https://github.com/unicode-org/icu4x/issues/380
    Design doc: https://docs.google.com/document/d/1vJKR1s--RBmXLNIJSCtiTNPp08mab7ZwcTGxIZ9-ytI/edit#

https://github.com/unicode-org/icu4x/pull/4334#discussion_r1403198515
Add is_normalized_up_to to Normalizer
#4334
jlf remember:
the Web-exposed ICU4C-backed behavior of current String.prototype.normalize in
both SpiderMonkey and V8 retains unpaired surrogates in the normalization process
(even after the first point in the string that needs to change under normalization).
We've previously decided that ICU4X operates on the Unicode Scalar Value / Rust
char value space and, therefore, will perform replacement of unpaired surrogates
with the REPLACEMENT CHARACTER.

https://github.com/unicode-org/icu4x/issues/4365
Segmenter does not work correctly in some languages
"as `নমস্কাৰ, আপোনাৰ কি খবৰ?`"'0D'x"hi `हैलो, क्या हाल हैं?`"'0D'x"mai `नमस्ते अहाँ केना छथि?`"'0D'x"mr `नमस्कार, कसे आहात?`"'0D'x"ne `नमस्ते, कस्तो हुनुहुन्छ?`"'0D'x"or `ନମସ୍କାର ତୁମେ କେମିତି ଅଛ?`"'0D'x"sa `हे त्वं किदं असि?`"'0D'x"te `హాయ్, ఎలా ఉన్నారు?`"
icu4c: 151
rust: 161
executor: 151
---
ICU4X and ICU4C are just using different definitions of EGCs; ICU4C has had a
tailoring for years which has just been incorporated into Unicode 15.1, whereas
ICU4X implements the 15.0 version without that tailoring.
The difference is the handling of aksaras in some indic scripts:
in Unicode 15.1 (and in any recent ICU4C) क्या is one EGC, but it is two EGCs
(क्, या) in untailored Unicode 15.0 (and in ICU4X).
---
eggrobin
(For what it’s worth, क्या would be three legacy grapheme clusters, namely क्, य,
and ा, see Table 1a of UAX29, whereas it is two 15.0 extended grapheme clusters
and a single 15.1 extended grapheme cluster.)
---
Fixed by #4536

https://github.com/unicode-org/icu4x/pull/4334
is_normalized_up_to and unpaired surrogates
---
jlf: interesting discussion about the support of ill-formed strings

https://github.com/unicode-org/icu4x/pull/4389
Line breaking
---
jlf: they don't want to support a tailored line breaking, because this requires
more than one code point of lookahead.

https://github.com/unicode-org/icu4x/issues/4342
Add functions to get ICU4X, CLDR, and Unicode versions
---
jlf: strange that they did not consider that earlier...

https://github.com/unicode-org/icu4x/issues/2689
Consider exposing sort keys
---
jlf : interesting for the description of the use cases (encryption, xpath)
I created a section Xpath with their comments.

https://github.com/unicode-org/icu4x/issues/3336
Add support for Unicode BCP 47 locale identifiers
---
jlf: what is that?
it's defined in https://www.unicode.org/reports/tr35/
UNICODE LOCALE DATA MARKUP LANGUAGE (LDML)
Unicode LDML uses stable identifiers based on [BCP47] for distinguishing among
languages, locales, regions, currencies, time zones, transforms, and so on.
https://www.rfc-editor.org/rfc/bcp/bcp47.txt

https://github.com/unicode-org/icu4x/issues/3247#issuecomment-1856577508
This month @anba landed Intl.Segmenter in Firefox based on the ICU4X Segmenter impl, reviewed by @dminor
https://phabricator.services.mozilla.com/D195803
I had been under the impression that Intl.Segmenter was not implementable without
support for random access in order to implement the containing() function.
It looks like @anba's implementation loops from the start of the string and
repeatedly calls next() until we reach the index. While this strategy gets the
job done, I'm concerned about the performance of this with large strings where
we need to reach an index deep into the string. I therefore hope that we can
continue to prioritize this issue on the basis of 402 compatibility.
---
jlf: to watch

https://github.com/unicode-org/icu4x/issues/4523
Linebreak generated before CL (Close Punctuation)
---
https://www.unicode.org/reports/tr14/#CL
UNICODE LINE BREAKING ALGORITHM

https://github.com/typst/typst/issues/3082
Chinese punctuation is placed at the beginning of the line in some cases
---
jlf: Linebreak referenced from icu4x/issues/4523
The example is wrong, a better example is provided in icu4x/issues/4523.

https://github.com/unicode-org/icu4x/pull/4389
Fix Unicode 15.0 line breaking
jlf: Linebreak

https://github.com/unicode-org/icu4x/issues/4146
icu_segmenter::LineSegmenter incorrectly applies rule LB8a
---
jlf: Linebreak, for the examples of line breaks.

https://github.com/unicode-org/icu4x/discussions/4525#discussioncomment-8155602
Mapping between browser Intl and ICU4X
jlf: I don't understand what they are talking about, but there are maybe good
to know informations in this thread. In particular this URL:
"Sensitivity" in browsers maps to a combination of strength and case level.
https://searchfox.org/mozilla-central/rev/1aa61dcd48e128a8cbfbe59b7ba43d31bd3c248a/intl/components/src/Collator.cpp#171-185

https://github.com/unicode-org/icu4x/issues/3284#issuecomment-1911226051
Should the Segmenter types accept a locale?
    ---
    Steven Loomis:
    Please put it into the API.
    I was doing planning on a work item to move this forward.
    This is for example languages that want to keep "ch" together etc.
    ---
    jlf: so it appears from the discussion that ICU4C implements specific rules that
    are not part of UAX #29.
    ---
    Markus Scherer:
        No language parameter for grapheme cluster segmenter
    +1
        Language parameter for the other three segmenters
    +1
    ---
    sffc
    The conclusions from the discussion of this issue with the CLDR design group:
    - Grapheme clusters should not be language-specific; baked into much low-level
      processing (e.g., Swift, font mappings) which we don’t want to be language-specific
    - Content locale/text language parameter (not UI locale): Potential for accuracy;
      make it optional, name it well
    - Ok to leave the locale on the constructor; benefit: more specific data loading
      even for existing dictionaries & models
    My suggested path forward for this issue, then, is to add an options bag to the
    WordSegmenter, LineSegmenter, and SentenceSegmenter constructors with an optional
    content_locale field of type &LanguageIdentifier.
    ---
    Steven Loomis
    This makes no sense and contradicts the long standing requests.
    I would have joined, did not realize this was coming up today.
    ---
    sffc
    Based on additional discussion in the email thread, I would like to move forward
    with the recommendation in #3284 (comment), with the additional understanding
    that we may add support for locale-based grapheme segmentation in the future if
    CLDR adds data for this, but it might take the form of another (fifth) segmenter type.
    Concretely:
    - All segmenters retain a new or try_new function without an options bag
    - Word, Sentence, and Line segmenters get a try_new_with_options function that
      includes a content_locale option

https://github.com/unicode-org/icu4x/issues/58
Design a cohesive solution for supported locales

https://github.com/tc39/proposal-intl-segmenter/issues/133
Custom Dictionaries
and a political point of view from a Hong Kong immigrant.

https://github.com/unicode-org/icu4x/issues/3990
Consider supporting retrieval of the language preference list from the system
---
jlf: some infos and pointers, for general culture.

https://github.com/unicode-org/icu4x/issues/4705
Bridge the gap between icu::properties::Script and icu::locid::subtags::Script
---
jlf: this is about script names
---
Markus Scherer
    Conversion is probably fine, but in the end they are just script codes, so
    it also makes sense to define the full set once and have Unicode APIs use a
    subset of the values.
    The ones in the UCD are a subset of the full set.
    And only the ones in the UCD have Unicode-defined long value names (identifiers).
Eggrobin
    https://unicode.org/iso15924/codelists.html
    https://unicode.org/iso15924/iso15924.txt
    The PVA column is from https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt.
Markus Scherer
    Also https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
    look for Type: script

    which becomes this in CLDR:
    https://github.com/unicode-org/cldr/blob/main/common/validity/script.xml

    Note that the CLDR list includes one or more private use script subtags:

    https://www.unicode.org/reports/tr35/#unicode_script_subtag_validity
    https://www.unicode.org/reports/tr35/#Private_Use_Codes
    Qaag is current but yucky... Don't include Qaai which has become an alias for Zinh

https://github.com/unicode-org/icu4x/issues/3014
Provide the Numeric_Value character property
ICU4X is missing an API for querying the Numeric_Value property of a character.
Markus Scherer
    Note that Numeric_Value is easy when Numeric_Type=Decimal or Numeric_Type=Digit.
    And maybe you need/want it only if Numeric_Type=Decimal.

    When Numeric_Type=Numeric, then the Numeric_Value can be negative, huge, or a fraction.
    These are rarely useful. https://www.unicode.org/reports/tr44/#Numeric_Value

    I would start with an API that returns the value of a decimal digit.
Markus Scherer
    Most of the nt=digit characters are not part of a contiguous 0..9 range of characters.
    In particular, there is often no zero.
    Some of them are simply nt=digit because their nv is 0..9 although they are
    part of a larger set of "numbered list bullets" where the nv>9 numbers have nt=numeric.

    In UTS46, they are variously disallowed/mapped/valid.
    See https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ant%3Ddigit%3A%5D&g=uts46&i=

    It makes sense to me to have an API that returns the nv of nt=decimal but
    the nv of other characters is rarely useful to programmers.

https://github.com/unicode-org/icu4x/issues/4771
LineBreakStrictness::Anywhere gives the wrong breakpoints for Arabic in icu_segmenter
I am aware this is probably a unicode spec issue, rather than a rust library issue, but I thought I would point it out regardless.
This is the minimal application I was using to test this behavior:
    use icu_segmenter::{LineBreakOptions, LineBreakStrictness, LineSegmenter};
    fn main() {
        let test = "الخيل والليل";

        let mut options = LineBreakOptions::default();
        options.strictness = LineBreakStrictness::Anywhere;

        let segmenter = LineSegmenter::new_auto_with_options(options);

        let breakpoints = segmenter.segment_str(test);

        for bp in breakpoints {
            println!("{bp}:  {}", &test[bp..]);
        }
    }
This gives the following output:
(jlf: bbedit doesn't support well this text, can't indent the whole block, can't
indent a single line)
0:  الخيل والليل
2:  لخيل والليل
4:  خيل والليل
6:  يل والليل
8:  ل والليل
10:   والليل
11:  والليل
13:  الليل
15:  لليل
17:  ليل
19:  يل
21:  ل
23:
as you can tell, it is breaking after every single letter, without respect to the letters' connections. However, as I am sure you are aware, the letters' connections are not optional.
The output I expected is the following:
0:  الخيل والليل
2:  لخيل والليل
10:   والليل
11:  والليل
13:  الليل
15:  لليل
23:
Putting the break points across the visual boundaries of the letters. This is not the current orthodoxy, but any looser breaks than that and you'd be rendering the text illegible and unnatural.
Note: This is how old written manuscripts break their words.
---
Closed as not planned

https://github.com/unicode-org/icu4x/issues/4780
Unexpected grapheme boundary with regional indicators (GB12)
use icu::segmenter::GraphemeClusterSegmenter;
    fn main() {
        let segmenter = GraphemeClusterSegmenter::new();
        let text = "🇺🇸🏴󠁧󠁢󠁥󠁮󠁧󠁿";
        segmenter
            .segment_str(text)
            .for_each(|i| println!("{}", i));
    }
Reports the following break points:
    0
    4
    8
    36
which means "🇺🇸" is split into two graphemes, which should be disallowed per GB12
---
This is fixed by #4536.
---
jlf: utf8proc is ok
"🇺🇸"~graphemes==
    a CharacterSupplier
     1 : T'🇺🇸'
"🇺🇸"~unicodecharacters==
    an Array (shape [2], 2 items)
     1 : ( "🇺"   U+1F1FA So 1 "REGIONAL INDICATOR SYMBOL LETTER U" )
     2 : ( "🇸"   U+1F1F8 So 1 "REGIONAL INDICATOR SYMBOL LETTER S" )

#4536
https://github.com/unicode-org/icu4x/pull/4536
Update grapheme cluster break rules to Unicode 15.1
jlf: lot of discussions about stability that I did not try to understand.

#4859
https://github.com/unicode-org/icu4x/issues/4859
Make the normalizer work with new Unicode 16 normalization behaviors
---
jlf: they have this reference
See topic 5.1 in https://www.unicode.org/L2/L2024/24009r-utc178-properties-recs.pdf

#5595
https://github.com/unicode-org/icu4x/issues/5595
Allow line segmenter break around quotation mark for CJ
---
#5565 Recently brings locale option to LineSegmenter, but it do not take effect in some situation.
In UAX#14 says:
    Note: If language information is available, it can be used to determine which
    character is used as the opening quote and which as the closing quote. See the
    information in Section 6.2, General Punctuation, in [Unicode].
    https://unicode.org/reports/tr41/tr41-34.html#Unicode
    In such a case, the quotation marks could be tailored to either OP or CL
    depending on their actual usage.
    https://unicode.org/reports/tr14/#OP
    https://unicode.org/reports/tr14/#CL
In Chinese and Japanese, quotation mark U+201C and U+201D meets the above condition
and should be classified to OP and CL respectively instead of QU. This is implemented
in icu4c and mainstream browsers. In icu4c text like 你“好 will have break opportunity
in 你|“好 if zh locale is given. But in icu4x it doesn't:
jlf: see code example in the ticket.
The above code gives 你“好”啊 instead of 你|“好”|啊.

utf8proc title

https://codeberg.org/dnkl/foot/pulls/100
Grapheme shaping using libutf8proc #100
jlf tag: character width
jlf: to read?

https://github.com/JuliaStrings/utf8proc/pull/270#issuecomment-2288220121
properties: add "ambiwidth" property for ambiguous East Asian Width
Some characters have their width defined as "Ambiguous" in UAX#11.
These are typically rendered as single-width by modern monospace fonts,
and utf8proc correctly returns charwidth==1 for these.
However some applications might need to support older CJK fonts where two-byte
characters in legacy encodings were rendered as double-width. An example of this
is the 'ambiwidth' option of vim and neovim which supports rendering in
terminals using such wideness rules.
Add an 'ambiwidth' property to utf8proc_property_t for such characters, by using
a previously unused padding bit.
---
This is an example how this property will be used in neovim: neovim/neovim#30042
https://github.com/neovim/neovim/pull/30042

Twitter text parsing

https://github.com/twitter/twitter-text
Twitter Text Libraries. This code is used at Twitter to tokenize and parse text
to meet the expectations for what can be used on the platform.

https://swiftpack.co/package/nysander/twitter-text
This is the Swift implementation of the twitter-text parsing library.
The library has methods to parse Tweets and calculate length, validity, parse @mentions, #hashtags, URLs, and more.

terminal / console / cmd

https://www.reddit.com/r/bash/comments/wfbf3w/determine_if_the_termconsole_supports_utf8/
Determine if the term/console supports UTF8?

https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line
jlf: with my current version of Windows (21H2 - 10.0.19044), I have the input bug describe below:

    In general using codepage 65001 will only work without bugs in Windows 10 with the Creators update.
    In Windows 7 it will have both output and input bugs.
    In Windows 8 and older versions of Windows 10 it only has the input bug, which limits input to 7-bit ASCII.
    Eryk Sun Sep 9, 2017 at 13:43

jlf: the sentence above is not true, I have the input bug with my version of Windows which is AFTER Creators update.

http://archives.miloush.net/michkap/archive/2006/03/13/550191.html
Who broke the UTF-8 support?
by Michael S. Kaplan, published on 2006/03/13 03:21 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/13/550191.aspx
---
jlf : we are in 2022 and the UTF-8 support in cmd is still broken...

https://stackoverflow.com/questions/39736901/chcp-65001-codepage-results-in-program-termination-without-any-error
jlf : Thanks to this post, I suddenly understood why ooRexxShell no longer supports UTF-8 input.
      It's because I deactivated readline on Dec 20, 2020.
      When readline is on, ooRexxShell delegates to cmd to read a line:
          set /p inputrx="My prompt> "
      This input mode is not impacted by the UTF-8 input bug!

https://stackoverflow.com/questions/10651975/unicode-utf-8-with-git-bash
git-bash (Windows)

https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window
Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10)
Describes how to set the system locale (language for non-Unicode programs) to UTF-8.
Optional reading: Why the Windows PowerShell ISE is a poor choice
---
jlf: this is a clear description of the UTF-8 input bug.
For ReadFile from the console, even in Windows 10, you'll be limited to 7-bit ASCII if the
input codepage is set to UTF-8, due to buggy assumptions in the console host, conhost.exe.
In Windows 10, it returns non-ASCII characters as null ("\0") in the buffer.
In older versions, the read succeeds with 0 bytes read, which looks like EOF.
Eryk Sun Jul 21, 2019 at 13:31

https://stackoverflow.com/questions/49476326/displaying-unicode-in-powershell/49481797#49481797
Displaying Unicode in Powershell

https://akr.am/blog/posts/using-utf-8-in-the-windows-terminal
Using UTF-8 in the Windows Terminal
https://github.com/microsoft/terminal
https://github.com/Microsoft/Cascadia-Code

https://github.com/PowerShell/PowerShell/issues/7233
Make console windows fully UTF-8 by default on Windows, in line with the behavior on Unix-like platforms
mklement0 opened this issue on Jul 5, 2018
---
jlf still opened on 2023.08.08

https://github.com/contour-terminal/terminal-unicode-core
Unicode Core specification for Terminal (grapheme clusters, character widths, ...)
jlf: only a poor tex file... dead? no commit since 2 years.
https://news.ycombinator.com/item?id=37804829
ZERO comment in HN

QT Title

https://bugreports.qt.io/browse/QTBUG-48726
Combining diacritics misplaced when using monospace fonts
jlf tag: character width

IBM OS

https://www.ibm.com/docs/en/personal-communications/15.0?topic=pages-contents#ToC
Host Code Page Reference
Referenced in a IBM-MAIN thread about "TN3270, EBCDIC and ASCII"
>>There are a few layers to getting the codepages right for using a terminal
>>emulator and ISPF Edit and Browse on the host.
>>For example, in Personal Communications I first define my host codepage.  I
>>have a lot of choices. From 420 (Arabic) to 1130 (Vietnamese).  I tend to
>>use 1047 (U.S.) to get my square brackets right.
jlf: tables of character codes

https://www.ibm.com/docs/en/zos/3.1.0?topic=317-zos-unix-directory-list-utility-line-commands
z/OS UNIX directory list utility line commands
Referenced in a IBM-MAIN thread about "TN3270, EBCDIC and ASCII"
>>Then on the host side. If you are using the ISPF UDLIST interface to Unix
>>(OMVS) you can use either EBCDIC, ASCII, or UTF8 for EDIT or VIEW.
Actions:
E—edit regular file
EA—edit ASCII file
EU—edit UTF-8 file
V—view regular file
VA—view ASCII file
VU—view UTF8 file

https://www.ibm.com/docs/en/zos/3.1.0?topic=information-pdf-browse-primary-commands
PDF Browse primary commands
Referenced in a IBM-MAIN thread about "TN3270, EBCDIC and ASCII"
>>In ISPF Browse, you can use the DISPLAY command to view data as UTF8,
>>UTF32, UCS2, UNICODE, ASCII, USASCII, and EBCDIC, or specify the numeric
>>CCSID.
Syntax diagram DISPLAY CCSIDccsid_number
    ASCII
    USASCII
    EBCDIC
    UCS2
    UTF8
    UTF16
    UTF32
Syntax diagram FIND
    UTF8
    ASCII
    USASCII

https://www.ibm.com/docs/en/zos/3.1.0?topic=sequences-ebcdic
Table 1. EBCDIC Collating Sequence
    Table 1 shows the collating sequence for EBCDIC character and unsigned decimal data.
    The collating sequence ranges from low (00000000) to high (11111111).
    The bit configurations which do not correspond to symbols (that is, 0 through 73,
    81 through 89, and so forth) are not shown. Some of these correspond to control
    commands for the printer and other devices.

    ALTSEQ, CHALT, and LOCALE can be used to select alternate collating sequences
    for character data.

    Packed decimal, zoned decimal, fixed-point, and normalized floating-point data
    are collated algebraically, that is, each quantity is interpreted as having a sign.

IBM RPG Lang

https://www.ibm.com/docs/en/i/7.4?topic=cdt-processing-string-data-by-natural-size-each-character
Processing string data by the natural size of each character
    String data can have characters of different sizes.
    - UTF-8 data can have characters with 1, 2, 3, or 4 bytes.
      For example, the character 'a' has one byte, and the character 'á' has two bytes.
      UTF-8 data is defined as alphanumeric with CCSID(*UTF8) or CCSID(1208).

    - UTF-16 data can have characters with 2 or 4 bytes.
      UTF-16 data is defined as UCS-2 with CCSID(*UTF16) or CCSID(1200).

    - EBCDIC mixed SBCS/DBCS data can have characters with 1 or 2 bytes.
      Additionally, double-byte data is surrouned by shift bytes.
      The shift-out byte x'0E' begins a section of DBCS data and the shift-in
      byte x'0F' ends the section of DBCS data.

    - ASCII mixed SBCS/DBCS data can have characters with 1 or 2 bytes.
      ASCII mixed SBCS/DBCS data is defined as alphanumeric with a CCSID that
      represents mixed SBCS/DBCS data such as 950.

    Default behaviour, CHARCOUNT STDCHARSIZE
    By default, data is processed using the standard-character-size mode.
    The compiler processes string data by bytes or double bytes without regard
    for size of each character.

    When CHARCOUNT NATURAL is in effect:
    The compiler processes string operations by the natural size of each character.
    The compiler sets the CHARCOUNT NATURAL mode for a file if the CHARCOUNT is
    not specified for the file.
    The CHARCOUNT mode for the file affects the movement of data from RPG fields
    to the output buffer and key buffer used for the file operations.

https://www.ibm.com/docs/en/i/7.4?topic=fdk-charcountnatural-stdcharsize
CHARCOUNT(*NATURAL | *STDCHARSIZE)
    The CHARCOUNT keyword controls how RPG handles string truncation when moving
    data from RPG program variables to the output buffer and key buffer for the file.

    *NATURAL
    If the data type of the field in the output buffer or key buffer is relevant
    according to the CHARCOUNTTYPES Control keyword, any necessary truncation
    when data is moved is done according to the CHARCOUNT NATURAL mode for assignment.

    *STDCHARSIZE
    Any necessary truncation when data is moved is done by bytes or double bytes,
    without regard for the size of each character.
    When the CHARCOUNT keyword is not specified, the current CHARCOUNT setting
    is used for the file, as determined by the CHARCOUNT Control keyword or the
    most recent /CHARCOUNT directive preceding the definition for the file.

https://www.ibm.com/docs/en/i/7.4?topic=keywords-charcounttypesutf8-utf16-jobrun-mixedebcdic-mixedascii
CHARCOUNTTYPES(*UTF8 *UTF16 *JOBRUN *MIXEDEBCDIC *MIXEDASCII)
    The Control keyword CHARCOUNTTYPES specifies the types of data that are
    processed by characters rather than by bytes or double bytes when
    CHARCOUNT NATURAL mode is in effect.

    *UTF8
    Specify *UTF8 if your module might work with UTF-8 data which has characters
    of different lengths. For example, the UTF-8 character 'a' has one byte, and
    the UTF-8 character 'á' has two bytes.

    *UTF16
    Specify *UTF16 if your module might work with UTF-16 data which has some
    4-byte characters.

    *JOBRUN
    Specify *JOBRUN if your job CCSID might support mixed SBCS and DBCS data,
    and the RPG variables in your module defined to have the job CCSID might
    contain some DBCS data.

    *MIXEDEBCDIC
    Specify *MIXEDEBCDIC if your module might work with EBCDIC data which
    supports both SBCS and DBCS characters. This includes data defined with
    CCSID(*JOBRUNMIX) and data defined with a mixed SBCS/DBCS CCSID such as 937.

    *MIXEDASCII
    Specify *ASCII if your module might work with ASCII data which supports
    both SBCS and DBCS characters.

IBM z/OS

https://www.ibm.com/docs/en/zos/2.5.0?topic=mvs-zos-unicode-services-users-guide-reference
Unicode services

https://www.ibm.com/docs/en/zos/2.5.0?topic=reference-application-programmer-information
Character conversion
Case conversion
Normalization
Collation
Bidi transformation
Stringprep conversion
---
jlf:
There is this note at the begining of the page "Bidi transformation":
    "IBM does not intend to enhance the bidi transformation service. Instead, it is
    recommended that you use the character conversion 'extended bidi support' for all
    new development and for the highest level of bidi support."
Can't find where is described this 'extended bidi support'.

https://www-40.ibm.com/servers/resourcelink/svc00100.nsf/pages/zOSV2R5IndexFile/$file/index.html
search Ctrl+F "unicode": only one result:
cunu100_v2r5.pdf	SA38-0680-50	z/OS Unicode Services User's Guide and Reference

https://www.ibm.com/docs/en/zos/2.5.0
    Search "Unicode" in z/OS 2.5 documentation:
    https://www.ibm.com/docs/en/search/unicode?scope=SSLTBW_2.5.0
jlf: not sure it's very interesting... All the links are just one page with few informations.

https://listserv.ua.edu/cgi-bin/wa?A2=IBM-MAIN;5304fbc3.2304&S=
Re: TSO Rexx C2X Incorrect Output
Events such as this affirm my belief in minimal munging of user data by default.
jlf: this sentence is to remember when designing how Unicode should be supported
by Rexx...

https://stackoverflow.com/questions/76569347/what-are-the-supported-code-points-for-special-characters-for-valid-z-os-datas
What are the supported code points for 'special characters' for valid z/OS datasets?
jlf: the link above was given in this IBM-MAIN thread
https://listserv.ua.edu/cgi-bin/wa?A2=ind2307&L=IBM-MAIN&D=0&P=121856
---
Matt Hogstrom:
I did some testing by creating a file in USS in CP047 with the characters “@#$”
and then used iconv to convert them to a variety of code pages and compare the
results. Some conversions failed but when looking at the code pages that failed
they didn’t appear to me to be what I would consider mainstream.  For the ones
I’m familiar with they all converted correctly.

The command was
    'iconv -f 1047 -t 37 special > converted;chtag -t -c 37 converted;cmp special converted’
I changed to the encoding of 37 to other code pages and most worked fine.
You can find the list of cps supported by issuing 'iconv -l’ and there are a lot
of them.

https://listserv.ua.edu/cgi-bin/wa?A2=ind2307&L=IBM-MAIN&D=0&P=183611
Python 3.11 on z/OS - UTF-8 errors
---
I am trying to get a python package (psutil) to run on z/OS.
I downloaded the package from github and then tar'ed it and uploaded it binary
to my home-dir in OMVS.
In my homedir I untar'ed to files and ran the command "chtag -tc IBM-1047 *' to
set the files to UTF-8.
I got make to work by converting the tab char to x'05' - no problem - and I got
the C compiler to work also.
Now my problem is that I can not make Python compile the setup.py file.
It dies with a UTF-error on a char x'97' in statement 48 pos 2:
    from _common import AIX  # NOQA
---
It's this package
https://github.com/giampaolo/psutil/blob/master/INSTALL.rst
---
I believe UTF-8 is IBM-1208.
---
Have you tried the z/OS Open Tools phytonport - https://github.com/ZOSOpenTools
---
Have you considered cloning the repository and utilizing Git's file
tagging feature? It can handle the tagging process for you. If you don't
have internet access, a suggestion would be to tag all the files as
ISO8859-1. It's advisable to avoid using UTF-8, as it may cause issues
with some ported tools that will not work. That includes the majority of
Rocket ported tools. If you list the IBM Python runtime library you will
notice that all source files are tagged "iso8859-1" even though Python
mandates UTF-8.
---
I'm doing this on the company sandbox so I can not make a git clone.
And trying 8859-1 (cp 819) does not change anything:
    /home/bc6608/psutil:chtag -p setup.py
    t ISO8859-1   T=on  setup.py

    PYTHONWARNINGS=all python3 setup.py build_ext -i `python3 -c  "import sys, os;  py36 = sys.version_info[:2] >= (3, 6);  cpus = os.cpu_count() or 1 if py36 else 1;  print('--parallel %s' % cpus if cpus > 1 else '')"`
    Traceback (most recent call last):
     File "/home/bc6608/psutil/setup.py", line 47, in <module>
       from _common import AIX  # NOQA
       ^^^^^^^^^^^^^^^^^^^^^^^
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 2: invalid start byte
---
Found the error.
The error was not the codepage of the setup.py, but the codepage of the imported file _common .
Once it got chtag -tc 1047 _common.py I got further.
---
I can’t recreate your problem but I used a different method. I downloaded a zip file from Github, uploaded it to z/OS and followed these steps:
    jar xf psutill-master.zip
    cd psutil-master
    chtag -R -tc iso8859-1 .
    python3 setup.py
---
A quick question -
Will the same chtag command work for, say, Java packages/projects?
    Answer: yes
Or, would I have to use chtag -R -tc UTF-8 if a project expects to things to be in UTF8?
    Answer:
    I'd like to understand your reasons for wanting to encode your Java source
    files in UTF-8. It's important to note that the default encoding on z/OS is
    IBM-1047 (EBCDIC). We typically use ISO8859-1 and have to specify the
    "-encoding iso8859-1" option when using the javac compiler. As mentioned
    earlier, tagging files as UTF-8 can lead to unexpected issues, which is why
    it's not commonly done. If you examine the file attributes of modern
    languages like Python, Node.js, Go, etc., you'll notice that their source
    files are tagged as ISO8859-1.
    A while ago, one of our ported tools developers provided me with a detailed
    explanation regarding the challenges associated with UTF-8 for ported tools.
    Although I don't recall all the specifics, it had something to do with
    double conversions. Therefore, the general rule of thumb is to avoid using
    UTF-8 unless it is necessary, such as when embedding a YAML document into a
    Java JAR file.
---
We specify <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> in
our Maven builds as most of the time we are building off host on machines with
UTF8 locales. However,  we tag our files ISO8859-1 on z/OS other then some YAML
docs that must be tagged UTF-8 or else SnakeYaml barfs when reading it from the
class path which doesn’t support tags :). The server runs with file.encoding=ISO8859-1
as well. If we cared about the euro sign we could change it to ISO8859-15 which
is still an 8-bit character set. It’s those pesky codes above 0x7F in UTF-8 that
cause the issues.

https://www.ibm.com/support/pages/system/files/inline-files/Managing%20the%20code%20page%20conversion%20when%20migrating%20zOS%20source%20files%20to%20Git%20-%201.0.pdf
(PDF)
Managing the code page conversion when migrating z/OS source files to Git
---
Git has proven to be the de-facto standard in the Open Source world, and the
z/OS platform can interact with Git through the z/OS Git client, which is
maintained by Rocket Software in its “Open Source Languages and Tools for z/OS”
package.

https://www.ibm.com/docs/en/zos/2.2.0?topic=server-different-end-line-characters-in-text-files
Different end of line characters in text files
---
In general, z/OS UNIX text files contain a newline character at the end of each
line. In ASCII, newline is X'0A'. In EBCDIC, newline is X'15'. (For example,
ASCII code page ISO8859-1 and EBCDIC code page IBM-1047 translate back and forth
between these characters.) Windows programs normally use a carriage return
followed by a line feed character at the end of each line of a text file. In
ASCII, carriage return/line feed is X'0D'/X'0A'. In EBCDIC, carriage return/line
feed is X'0D'/X'15'. The tr command shown in the preceding example deletes all
of the carriage return characters. (Line feed and newline characters have the
same hexadecimal value.) The SMB server can translate end of line characters
from ASCII to EBCDIC and back but it does not change the type of delimiter (PC
versus z/OS UNIX) nor the number of characters in the file.

https://www.ibm.com/docs/en/zos/2.5.0?topic=options-record-format-recfm
Record Format (RECFM)
RECFM specifies the characteristics of the records in the data set as fixed-length (F),
variable-length (V), ASCII variable-length (D), or undefined-length (U). Blocked
records are specified as FB, VB, or DB. Spanned records are specified as VS, VBS,
DS, or DBS. You can also specify the records as fixed-length standard by using
FS or FBS. You can request track overflow for records other than standard format
by adding a T to the RECFM parameter (for example, by coding FBT). Track overflow
is ignored for PDSEs.
The type of print control can be specified to be in ANSI format-A, or in machine code
format-M. See
Using Optional Control Characters (https://www.ibm.com/docs/en/SSLTBW_2.5.0/com.ibm.zos.v2r5.idad400/occ.htm#occ)
and z/OS DFSMS Macro Instructions for Data Sets (https://www.ibm.com/docs/en/SSLTBW_2.5.0/com.ibm.zos.v2r5.idad500/abstract.htm)
for information about control characters.

https://docs.tibco.com/pub/mftps-zos/8.0.0/doc/html/GUID-A0CF702B-C126-43BE-86B2-8DF589FAD6BF.html
TIBCO® Managed File Transfer Platform Server for z/OS
RECFM={ F | FB | V | VB | U | VS | VBS}
Default=V
This parameter defines the significance of the character logical record length
(semantics of LRECL boundaries). You can specify fixed, variable, or system default
The valid values are as follows:
    - F: each string contains exactly the number of characters defined by the string length parameter.
    - FB:   all blocks and all logical record are fixed in size.
            One or more logical records reside in each block.
    - V:    the length of each string is less than or equal to the string length parameter.
    - VB:   blocks as well as logical record length can be of any size.
            One or more logical records reside in each block.
    - U:    blocks are of variable size. No logical records are used.
            The logical record length is displayed as zero.
            This record format is usually only used in load libraries.
            Block size must be used if you are specifying U.
    - VS:   records are variable and can span logical blocks.
            RECFM=VS is not supported when checkpoint restart is used.
    - VBS:  blocks as well as logical record length can be of any size.
            One or more logical records reside in each block.
            Records are variable and can span logical blocks.
            RECFM=VBS is not supported when checkpoint restart is used.

macOS OS

you can enter emoji (and other Unicode characters) using standard operating
system tools—like ctrl cmd space.

https://eclecticlight.co/2021/05/08/explainer-unicode-normalization-and-apfs/
Explainer: Unicode, normalization and APFS
hoakley  May 8, 2021
---
One of the oldest problems with Apple’s APFS file system is how it encodes file
and directory names using Unicode.

Windows OS

https://learn.microsoft.com/en-us/windows/win32/intl/international-support
jlf: I search which functionalities are available only to unicode apps...
- can be multilingual without managing code pages
- IME? not sure if it's only for unicode apps
- other?

https://blog.keyman.com/2011/06/accepting-unicode-input-in-your-windows-application/
Accepting Unicode input in your Windows application
jlf: old article (2011)
- If the window is created as an ANSI window, WM_CHAR will always be received by
the window class as codepage characters.
- If the window is created as a Unicode window, all WM_CHAR messages will contain
Unicode (UTF-16) characters.
- The IsWindowUnicode(HWND) function will tell you whether the window supports
Unicode input.

https://medium.com/@rdavidson1911/how-to-enable-or-use-and-display-unicode-in-a-windows-environment-124d9a8b36c4
How to Enable or Use and Display Unicode in a Windows Environment
jlf: the title is misleading, the article is about entering a Unicode character
- Universal Method
- Input-language Specific Method
- Code-page Specific Method
- Application-specific Method
- Unicode IME Method

https://stackoverflow.com/questions/59404120/what-is-the-difference-in-using-cstringw-cstringa-and-ct2w-ct2a-to-convert-strin
What is the difference in using CStringW/CStringA and CT2W/CT2A to convert strings?
    CString offers a number of conversion constructors to convert between ANSI and
    Unicode encoding. They are as convenient as they are dangerous, often masking bugs.

    By contrast, the Cs2d macros (where s = source, d = destination) work on raw
    C-style strings; no CString instances are created in the process of converting
    between character encodings.

    Both of the above perform a conversion with an implied ANSI codepage (either
    CP_THREAD_ACP or CP_ACP in case the _CONVERSION_DONT_USE_THREAD_LOCALE
    preprocessor symbol is defined). CP_ACP is particularly troublesome, as it's a
    process-global setting, that any thread can change at any time.

    Which one should you choose for your conversions? Neither of the above. Use the
    EX versions instead (see string and text classes for a full list).

https://learn.microsoft.com/en-us/cpp/atl/string-and-text-classes?view=msvc-170
String and Text Classes

https://stackoverflow.com/questions/15362859/getclipboarddata-cf-unicodetext
GetClipboardData (CF_UNICODETEXT)

https://jerrington.me/posts/2015-12-31-windows-debugging-for-fun-and-profit.html
jlf: I reference this page for the code related to clipboard. Search for "locale".

https://learn.microsoft.com/en-us/windows/win32/dataxchg/standard-clipboard-formats
Standard Clipboard Formats
CF_LOCALE
    Locale identifier (LCID) associated with text in the clipboard.
    The system uses the code page associated with CF_LOCALE to implicitly
    convert from CF_TEXT to CF_UNICODETEXT.
CF_TEXT
    Text format. Each line ends with a carriage return/linefeed (CR-LF) combination.
    A null character signals the end of the data. Use this format for ANSI text.
CF_UNICODETEXT
    Unicode text format. Each line ends with a carriage return/linefeed (CR-LF)
    combination. A null character signals the end of the data.

Locale
    https://learn.microsoft.com/en-us/windows/win32/intl/language-identifiers
    A language identifier is a standard international numeric abbreviation
    for the language in a country or geographical region. Each language has
    a unique language identifier (data type LANGID), a 16-bit value that
    consists of a primary language identifier and a sublanguage identifier.
    +-------------------------+-------------------------+
    |     SubLanguage ID      |   Primary Language ID   |
    +-------------------------+-------------------------+
    15                    10  9                         0   bit

    https://learn.microsoft.com/en-us/windows/win32/intl/sort-order-identifiers
    A sort order identifier is defined in the form "_sortorder", at the end
    of the locale name used in the identifier, for example, "de-DE_phoneb",
    where "phoneb" is the sort order.
    The corresponding locale identifier is created as follows:
    MAKELCID(MAKELANGID(LANG_GERMAN, SUBLANG_GERMAN), SORT_GERMAN_PHONE_BOOK).

    https://learn.microsoft.com/en-us/windows/win32/intl/locale-identifiers
    Each locale has a unique identifier, a 32-bit value that consists of a
    language identifier and a sort order identifier.
    +-------------+---------+-------------------------+
    |   Reserved  | Sort ID |      Language ID        |
    +-------------+---------+-------------------------+
    31         20 19     16 15                      0   bit

    https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuilanguageinfo

    https://learn.microsoft.com/en-us/previous-versions/windows/embedded/ms930130(v=msdn.10)?redirectedfrom=MSDN
    Locale Code Table
    jlf: obsolete, but for the moment I don't haver better.
    Correspondance  Locale identifier (LCID) <--> Default code page
    ---
    LCID    Code page   Language: sublanguage
    0x0436  1252        Afrikaans: South Africa
    0x041c  1250        Albanian: Albania
    0x1401  1256        Arabic: Algeria
    0x3c01  1256        Arabic: Bahrain
    etc...

https://devblogs.microsoft.com/oldnewthing/20161007-00/?p=94475
How can I get the default code page for a locale?
    UINT GetAnsiCodePageForLocale(LCID lcid)
    {
      UINT acp;
      int sizeInChars = sizeof(acp) / sizeof(TCHAR);
      if (GetLocaleInfo(lcid,
                        LOCALE_IDEFAULTANSICODEPAGE |
                        LOCALE_RETURN_NUMBER,
                        reinterpret_cast<LPTSTR>(&acp),
                        sizeInChars) != sizeInChars) {
        // Oops - something went wrong
      }
      return acp;
    }

https://www.w3.org/TR/ltli/#dfn-locale-neutral
Locale neutral
jlf: je pige que dalle
Locale-neutral. A non-linguistic field is said to be locale-neutral when it is
stored or exchanged in a format that is not specifically appropriate for any
given language, locale, or culture and which can be interpreted unambiguously
for presentation in a locale aware way.
Many specifications use a serialization scheme, such as those provided by
[XMLSCHEMA11-2] or [JSON-LD], to provide a locale neutral encoding of
non-linguistic fields in document formats or protocols.
A locale-neutral representation might itself be linked to a specific cultural
preference, but such linkages should be minimized.

http://archives.miloush.net/michkap/archive/2005/04/18/409095.html
A few of the gotchas of WideCharToMultiByte
by Michael S. Kaplan, published on 2005/04/18 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/18/409095.aspx

http://archives.miloush.net/michkap/archive/2005/04/19/409566.html
A few of the gotchas of MultiByteToWideChar
by Michael S. Kaplan, published on 2005/04/19 04:30 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/19/409566.aspx
---
jlf: I reached this page because the flag MB_COMPOSITE is not working!
This page brings the answer: the Microsoft doc has this note
    Note  For UTF-8 or code page 54936 (GB18030, starting with Windows Vista),
    dwFlags must be set to either 0 or MB_ERR_INVALID_CHARS. Otherwise, the function
    fails with ERROR_INVALID_FLAGS.
Uh?

http://archives.miloush.net/michkap/archive/2005/02/26/381020.html
What the &%#$ does MB_USEGLYPHCHARS do?
by Michael S. Kaplan, published on 2005/02/26 15:26 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/26/381020.aspx

https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
Use UTF-8 code pages in Windows apps

https://mastodon.gamedev.place/@AshleyGullen/111109299141510319
what it takes to pass a file path to a Windows API in C++

https://github.com/neacsum/utf8
This library simplifies usage of UTF-8 encoded strings under Win32
Related articles:
https://www.codeproject.com//Articles/5252037/Doing-UTF-8-in-Windows
https://www.codeproject.com/Articles/5259868/Doing-UTF-8-in-Windows-Part-2-Tolower-or-Not-to-Lo
https://www.codeproject.com/Tips/5263944/UTF-8-in-Windows-INI-Files
---
Reddit review: https://www.reddit.com/r/cpp/comments/174ee8q/doing_utf8_in_windows/
---
This article about UTF-8 in Windows that does not discuss how to use a manifest
to get UTF-8 process ANSI codepage, directs people back to the 1990's.
Or pre-2019, at any rate.
---
Something else to note, if you're in the habit of keeping UTF-8 strings in
`std::string`, is that the Visual C++ version of `std::filesystem::path`
initialized from a `std::string` will use the default codepage for the process
to convert the path to UTF-16. That will result in interesting failures on
systems whose default codepage is MBCS. All without a single Windows API to be
seen in your source.
The solution to this is to upgrade to C++20 and use `std::u8string`, or to keep
filenames in `std::wstring` if you don't want to deal with the odd and
occasionally surprising limitations of `std::u8string`.

https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests#activeCodePage
Application manifests - activeCodePage
---
On Windows 10, this element forces a process to use UTF-8 as the process code page.
On Windows 10, the only valid value for activeCodePage is UTF-8.
Starting in Windows 11, this element also allows selection of either the legacy
non-UTF-8 code page, or code pages for a specific locale for legacy application
compatibility. Modern applications are strongly encouraged to use Unicode.
On Windows 11, activeCodePage may also be set to the value Legacy or a locale
name such as en-US or ja-JP.

https://devblogs.microsoft.com/oldnewthing/20210527-00/?p=105255
How can I convert between IANA time zones and Windows registry-based time zones?
A copy of ICU has been included with Windows since Windows 10 Version 1703 (build 15063).
All you have to do is include icu.h, and you’re off to the races.
An advantage of using the version that comes with Windows is that it is actively
maintained and updated by the Windows team. If you need to run on older systems,
you can build your own copy from their fork of the ICU repo,
https://github.com/microsoft/icu
but the job of servicing the project is now on you.

https://superuser.com/questions/1715715/can-i-enable-unicode-utf-8-worldwide-support-in-windows-11-but-set-another-enco
Can I enable Unicode UTF-8 WorldWide Support in WIndows 11, but set another encoding for a specific app?
Self answer:
- Control Panel > Clock and Region > Region > Administrative tab > Change system locale button > enable Beta:Use Unicode UTF-8 for worldwide language support.
- The same GUI window can also be launched from
  Settings > Time & Language > Language & region > Administrative language settings > Administrative tab > Change system locale button > enable Beta:Use Unicode UTF-8 for worldwide language support.

Microsoft Word

https://medium.com/@turgenev.e9g/microsoft-word-find-and-replace-for-unicode-ranges-u-10000-79ab2b32e138
[Microsoft Word] Find and Replace for Unicode Ranges ≥ U+10000
jlf: interesting!
----------------
Grapheme Cluster
----------------
Find and Replace in MS Word seems to recognize characters only according to “grapheme cluster” in Unicode.
Therefore, when MS Word is ordered to replace “R” with “X”, it won’t change “R͆” to “X͆”. We must specify the whole “R͆” to replace “R͆”.
    "R͆"~unicodecharacters==
        an Array (shape [2], 2 items)
         1 : ( "R"   U+0052 Lu 1 "LATIN CAPITAL LETTER R" )
         2 : ( "͆"    U+0346 Mn 0 "COMBINING BRIDGE ABOVE" )
    "X͆"~unicodecharacters==
        an Array (shape [2], 2 items)
         1 : ( "X"   U+0058 Lu 1 "LATIN CAPITAL LETTER X" )
         2 : ( "͆"    U+0346 Mn 0 "COMBINING BRIDGE ABOVE" )
It is not that the character(s) “R͆” itself must be present at the specification.
jlf: next is subtle... The character "COMBINING BRIDGE ABOVE" is applied to the whole range!
For example, a specification [A-Z]͆ can be used to change the font of all characters A͆, B͆, …, Z͆.
    "[A-Z]͆"~unicodecharacters==
        an Array (shape [6], 6 items)
         1 : ( "["   U+005B Ps 1 "LEFT SQUARE BRACKET" )
         2 : ( "A"   U+0041 Lu 1 "LATIN CAPITAL LETTER A" )
         3 : ( "-"   U+002D Pd 1 "HYPHEN-MINUS" )
         4 : ( "Z"   U+005A Lu 1 "LATIN CAPITAL LETTER Z" )
         5 : ( "]"   U+005D Pe 1 "RIGHT SQUARE BRACKET" )
         6 : ( "͆"    U+0346 Mn 0 "COMBINING BRIDGE ABOVE" )
-------------------------
Surrogate pairs in UTF-16
-------------------------
MS Word seems to use UTF-16 encoding, so a Unicode character outside BMP (above U+10000)
is internally a kind of “ligature” of two characters from 0xD800–0xD8FF (high surrogate)
and 0xDC00–0xDCFF (low surrogate).
For example, [𐀀-𐀏] (10000–1000F) appears to MS Word like [{D800}{DC00}-{D800}{DC0F}]
    "[𐀀-𐀏]"~unicodecharacters==
        an Array (shape [5], 5 items)
         1 : ( "["   U+005B Ps 1 "LEFT SQUARE BRACKET" )
         2 : ( "𐀀"   U+10000 Lo 1 "LINEAR B SYLLABLE B008 A" )
         3 : ( "-"   U+002D Pd 1 "HYPHEN-MINUS" )
         4 : ( "𐀏"   U+1000F Lo 1 "LINEAR B SYLLABLE B077 KA" )
         5 : ( "]"   U+005D Pe 1 "RIGHT SQUARE BRACKET" )
Word VBA:
U+10000 — U+107FF
UniRange = "[" & ChrW(&HD800) & "-" & ChrW(&HD801) & "][" & ChrW(&HDC00) & "-" & ChrW(&HDFFF) & "]"

Language comparison

https://blog.kdheepak.com/my-unicode-cheat-sheet
Vim, Python, Julia and Rust.

Regular expressions

https://regex101.com/
Testing a regular expression.
There is even a debugger!

https://www.regular-expressions.info/unicode.html
\X matches a grapheme

https://www.regular-expressions.info/posixbrackets.html
POSIX Bracket Expressions
jlf: see the table in the section Character Classes

https://pypi.org/project/regex/
>>> a = "बिक्रम मेरो नाम हो"
>>> regex.findall(r'\X', a)
['बि', 'क्', 'र', 'म', ' ', 'मे', 'रो', ' ', 'ना', 'म', ' ', 'हो']
---
https://regex101.com/r/eD0eZ9/1
---
jlf: the results above are correct extended grapheme clusters, but tailored
grapheme clusters will group 'क्' 'र' in one cluster क्र

https://blog.burntsushi.net/ripgrep/
ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}
search for "unicode" and read...

https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions
Character classes in regular expressions

https://github.com/micromatch/posix-character-classes
POSIX character classes for creating regular expressions.
jlf: careful, not official. Looks similar to the table at
https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions
    POSIX class	Equivalent to	Matches
    [:alnum:]	[A-Za-z0-9]                         digits, uppercase and lowercase letters
    [:alpha:]	[A-Za-z]                            upper- and lowercase letters
    [:ascii:]	[\x00-\x7F]                         ASCII characters
    [:blank:]	[ \t]                               space and TAB characters only
    [:cntrl:]	[\x00-\x1F\x7F]                     Control characters
    [:digit:]	[0-9]                               digits
    [:graph:]	[^ [:cntrl:]]                       graphic characters (all characters which have graphic representation)
    [:lower:]	[a-z]                               lowercase letters
    [:print:]	[[:graph:] ]                        graphic characters and space
    [:punct:]	[-!"#$%&'()*+,./:;<=>?@[]^_`{|}~]   all punctuation characters (all graphic characters except letters and digits)
    [:space:]	[ \t\n\r\f\v]                       all blank (whitespace) characters, including spaces, tabs, new lines, carriage returns, form feeds, and vertical tabs
    [:upper:]	[A-Z]                               uppercase letters
    [:word:]	[A-Za-z0-9_]                        word characters
    [:xdigit:]	[0-9A-Fa-f]                         hexadecimal digits

https://unicode-org.github.io/icu/userguide/icu/posix.html
C/POSIX Migration
    Character classes, point 7:
    For more about the problems with POSIX character classes in a Unicode context
    see Annex C: Compatibility Properties in Unicode Technical Standard #18: Unicode Regular Expressions
    http://www.unicode.org/reports/tr18/#Compatibility_Properties
    and see the mailing list archives for the unicode list (on unicode.org).
    See also the ICU design document about C/POSIX character classes
    https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/main/design/posix_classes.html

https://stackoverflow.com/questions/50570322/regex-pattern-matching-in-right-to-left-languages
Regex pattern matching in right-to-left languages
---
jlf: only one answer. Why control characters?
What I understand is that the bytes are in the spelling order of the characters.
The "/"
ooRexx returns the same sequence of bytes under macOS.
---
    /Store/عرمنتجات/عرع
     2F53746F72652F   D8B9D8B1D985D986D8AAD8ACD8A7D8AA   2F   D8B9D8B1D8B9
    |--------------| |--------------------------------| |--| |------------|
      "/Store/"                   عرمنتجات               /  i    عرع

    /Store/عرع/عرمنتجات
     2F53746F72652F   D8B9D8B1D8B9   2F   D8B9D8B1D985D986D8AAD8ACD8A7D8AA
    |--------------| |------------| |--| |--------------------------------|
      "/Store/"           عرع        /  i              عرمنتجات

    /Store/عرمنتجات/whatever
    2F53746F72652F D8B9D8B1D985D986D8AAD8ACD8A7D8AA  2F  7768617465766572
    |------------| |------------------------------| |--| |--------------|
      "/Store/"                عرمنتجات              /        whatever

https://stackoverflow.com/questions/20641297/unicode-characters-in-regex
Unicode characters in Regex

Test cases, test-cases, tests files

https://github.com/lemire/unicode_lipsum

font bold, italic, strikethrough, underline, backwards, upside down

I remember I saw an open-sourced implementation, but forgot to note it.
The URLs below are not providing a link to an open-sourced implementation, to
remove sooner or later.
https://convertcase.net/unicode-text-converter/
https://yaytext.com/
https://capitalizemytitle.com/
https://capitalizemytitle.com/fancy-text-generator/
http://slothsoft.net/UnicodeMapper/
https://www.fontgenerator.org/
https://peterwunder.de/projects/prettify/
https://texteditor.com/

https://gwern.net/utext
https://news.ycombinator.com/item?id=38016735
Utext: Rich Unicode Documents (gwern.net)
An esoteric document proposal: abuse Unicode to create the fanciest possible ‘plain text’ documents.

https://fonts.google.com/noto

https://www.unicode.org/mail-arch/unicode-ml/y2019-m01/0121.html
Encoding italic (was: A last missing link)

youtube

https://www.youtube.com/playlist?list=PLMc927ywQmTNQrscw7yvaJbAbMJDIjeBh
Videos from Unicode's Overview of Internationalization and Unicode Projects

xxx lang

https://rosettacode.org/wiki/Unicode_strings

https://langdev.stackexchange.com/questions/1493/how-have-modern-language-designs-dealt-with-unicode-strings
How have modern language designs dealt with Unicode strings?
Asked 2023-06-13
Answer for
- Swift
- Rust
- Python 3
- Treat it as a (mostly) library issue
jlf: the Swift part is interesting, the rest is bof.
    In order to speed up repeated accesses to utf16, UTF-8 strings may put a breadcrumbs pointer after the null terminator:
    https://github.com/apple/swift/blob/1532fb188c55f29de7bf8aaee94896557b3a3db1/stdlib/public/core/StringStorage.swift#L157

    The breadcrumbs are a list of the UTF-8 offsets of every 64th UTF-16 code unit:
    https://github.com/apple/swift/blob/483087a47dfb56e78fcc20ef2b43085ebfb48ea0/stdlib/public/core/StringBreadcrumbs.swift

    A string stores whether it has breadcrumbs in an unused bit in its capacity field:
    https://github.com/apple/swift/blob/1532fb188c55f29de7bf8aaee94896557b3a3db1/stdlib/public/core/StringStorage.swift#L45

http://xahlee.info/comp/unicode_essays_index.html
Unicode for Programers
jlf: this page contains several URL for programming languages. Short articles
but there is maybe something to learn. [later] After review, no so many things
to learn, the articles are very very short...

Ada lang

https://docs.adacore.com/live/wave/xmlada/html/xmlada_ug/unicode.html

http://www.dmitry-kazakov.de/ada/strings_edit.htm

UXStrings Ada Unicode Extended Strings
https://www.reddit.com/r/ada/comments/t4hpip/ann_uxstrings_package_available_uxs_20220226/
https://github.com/Blady-Com/UXStrings
---
2023.10.14 https://groups.google.com/g/comp.lang.ada/c/rWqDxiOwa1g
[ANN] Release of UXStrings 0.6.0
- Add string convenient subprograms [2]: Contains, Ends_With,Starts_With,
[2] https://github.com/Blady-Com/UXStrings/blob/master/src/uxstrings3.ads#L346
jlf: see https://github.com/Blady-Com/UXStrings/blob/master/src/uxstrings3.adb
After a quick look, I still don't know which kind of position is managed.
There is a parameter Case_Sensitivity, but I never see it used with a position
(that's the tricky part)

https://github.com/AdaForge/Thematics/wiki/Unicode-and-String-manipulations
Unicode and String manipulations in UTF-8, UTF-16, ...

https://stackoverflow.com/questions/48829940/utf-8-on-windows-with-ada
UTF-8 on Windows with Ada

https://github.com/AdaCore/VSS/
High level string and text processing library

https://blog.adacore.com/vss-cursors-iterators-and-markers
VSS (Virtual String Subsystem): Cursors, Iterators and Markers
jlf: bof...

Awk lang

Brian Kernighan adds Unicode support to Awk
https://github.com/onetrueawk/awk/commit/9ebe940cf3c652b0e373634d2aa4a00b8395b636
https://github.com/onetrueawk/awk/tree/unicode-support
https://news.ycombinator.com/item?id=32534173

C++ lang, cpp lang, Boost

https://en.cppreference.com/w/cpp/language/string_literal
String literal
(referenced by Adrian)
Some examples:
https://learn.microsoft.com/en-us/cpp/cpp/string-and-character-literals-cpp

https://www.youtube.com/watch?v=iQWtiYNK3kQ
A Crash Course in Unicode for C++ Developers - Steve Downey - [CppNow 2021]
jlf: good video for pronunciation
  57:16 Algorithms
1:12:27 The future for C++ (you can stop here, not very interesting)

02/06/2021
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1238r1.html
    SG16 initial Unicode direction and guidance for C++20 and beyond.
https://github.com/sg16-unicode/sg16
    SG16 is an ISO/IEC JTC1/SC22/WG21 C++ study group tasked with improving Unicode and text processing support within the C++ standard.
https://github.com/sg16-unicode/sg16-meetings
    Summaries of SG16 virtual meetings
https://lists.isocpp.org/mailman/listinfo.cgi/sg16
    SG16 mailing list

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1629r1.html
P1629R1
Transcoding the 🌐 - Standard Text Encoding
Published Proposal, 2020-03-02
---
jlf: referenced by Zach Laine in P2728R0.
[P1629R1] from JeanHeyd Meneide is a much more ambitious proposal that aims to
standardize a general-purpose text encoding conversion mechanism. This proposal
is not at odds with P1629; the two proposals have largely orthogonal aims. This
proposal only concerns itself with UTF interconversions, which is all that is
required for Unicode support. P1629 is concerned with those conversions, plus a
lot more. Accepting both proposals would not cause problems; in fact, the APIs
proposed here could be used to implement parts of the P1629 design.

01/06/2021
Zach Laine
https://www.youtube.com/watch?v=944GjKxwMBo
https://tzlaine.github.io/text/doc/html/boost_text__proposed_/the_text_layer.html
https://tzlaine.github.io/text/doc/html/index.html
The Text Layer
https://tzlaine.github.io/text/doc/html/
Chapter 1. Boost.Text (Proposed) - 2018
https://github.com/tzlaine/text
    last commit :
        master                          26/09/2020
        boost_serialization             24/10/2019
        coroutines                      25/08/2020
        experimental                    13/11/2019
        gh-pages                        04/09/2020
        optimization                    27/10/2019
        rope_free_fn_reimplementation   26/07/2020
No longer working on this project ?
---
Restart working on 22/03/2022
Zach's library was last discussed at the 2023-05-10 SG16 meeting; see
https://github.com/sg16-unicode/sg16-meetings#may-10th-2023.
---
https://www.youtube.com/watch?v=AoLl\_ZZqyOk
Applying the Lessons of std::ranges to Unicode in the C++ Standard Library - Zach Laine CppNow 2023

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2729r0.html
Unicode in the Library, Part 2: Normalization
Document #: P2729R0
Date:       2022-11-20
Reply-to:   Zach Laine  <whatwasthataddress@gmail.com>

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1949r7.html
C++ Identifier Syntax using Unicode Standard Annex 31
Document #: P1949R7
Date:       2021-04-12
---
Adopt Unicode Annex 31 as part of C++ 23.
- That C++ identifiers match the pattern (XID_Start + _ ) + XID_Continue*.
- That portable source is required to be normalized as NFC.
- That using unassigned code points be ill-formed.
This proposal also recommends adoption of Unicode normalization form C (NFC)
for identifiers to ensure that when compared, identifiers intended to be the
same will compare as equal. Legacy encodings are generally naturally in NFC when
converted to Unicode. Most tools will, by default, produce NFC text.
Some scripts require the use of characters as joiners that are not allowed by
base UAX #31, these will no longer be available as identifiers in C++.
As a side-effect of adopting the identifier characters from UAX #31, using emoji
in or as identifiers becomes ill-formed.

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2528r0.html
C++ Identifier Security using Unicode Standard Annex 39
Document #: P2538R0
Date:       2022-01-22

14/06/2021
https://hsivonen.fi/non-unicode-in-cpp/
Same contents in sg16 mailing list + feedbacks
https://lists.isocpp.org/sg16/2019/04/0309.php

03/07/2021
https://news.ycombinator.com/item?id=27695412
Any Encoding, Ever – ztd.text and Unicode for C++

14/07/2021
https://hsivonen.fi/non-unicode-in-cpp/
It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++
The Microsoft Code Page 932 Issue

https://stackoverflow.com/questions/58878651/what-is-the-printf-formatting-character-for-char8-t/58895428#58895428.
What is the printf() formatting character for char8_t *?
jlf: todo read it? not sure yet if it's useful to read.
Referenced from https://corp.unicode.org/pipermail/unicode/2020-April/008579.html
Basic Unicode character/string support absent even in modern C++

https://github.com/nemtrif/utfcpp/
referenced from https://corp.unicode.org/pipermail/unicode/2020-April/008582.html
Basic Unicode character/string support absent even in modern C++

https://www.boost.org/doc/libs/1_80_0/libs/locale/doc/html/index.html
Boost.Locale
Boost.Locale uses the-state-of-the-art Unicode and Localization library: ICU - International Components for Unicode.

https://github.com/uni-algo/uni-algo
Unicode Algorithms Implementation for C/C++
https://www.reddit.com/r/cpp/comments/xspvn4/unialgo_v050_modern_unicode_library/
uni-algo v0.5.0: Modern Unicode Library
https://www.reddit.com/r/cpp/comments/vtgckq/new_unicode_library/
Older post with more infos
https://github.com/uni-algo/uni-algo-single-include
Single include version for Unicode Algorithms Implementation
This repository contains single include version of uni-algo library.

https://www.reddit.com/r/cpp/comments/14t2lzm/unialgo_v100_modern_unicode_library/
uni-algo v1.0.0: Modern Unicode Library
---
jlf: see the critics of Zach Laine's library... mg152 has good arguments.
---
jlf: this library is referenced in the comments
https://github.com/hikogui/hikogui/tree/main/src/hikogui/unicode
https://github.com/hikogui/hikogui/tree/main/tools/ucd

https://github.com/hikogui/hikogui/tree/main/src/hikogui/unicode
https://github.com/hikogui/hikogui
Modern accelerated GUI
jlf: the point is not the GUI, but the tools to parse Unicode UCD.
See https://github.com/hikogui/hikogui/tree/main/tools
---
Comment of the author in https://www.reddit.com/r/cpp/comments/vtgckq/new_unicode_library/
I recently discovered a way to compress the unicode-data-set, while still being
able to do quick lookups, with a single associative indirection.
Basically you chunk the data in groups of 32 entries. Then you de-duplicate
these chunks and make a index table (about 64kbyte) that points to the chunks.
This works because a code-point is only 21 bits, which you can split in 16 bit
msb and 5 bit lsb. This means that the index table has less than 64k uint16_t
entries.
My data is including the index around 700 KByte. With the following data:
    general category: 5 bit
    grapheme cluster break: 4
    line break class: 6
    word break property: 5
    sentence break property: 4
    east asian width: 3
    bidi class:5
    bidi bracket type: 2
    bidi mirroring glyph: 16
    ccc: 8
    script: 8
    decomposition type: 5
    decomposition index: 21 (decomposition table not included in the 700kbyte)
    composition index: 14 (composition table not included in the 700kbyte)
Of the 128 bits per entry, 22 bits are currently unused. It is also possible to
compress a single entry. For example ccc is always zero for non-composing
code-points, so it could share those bits with properties that are only allowed
for non-composing code-points.

https://news.ycombinator.com/item?id=38424689
Bjarne Stroustrup Quotes (stroustrup.com)
---
Interesting discussion about strings (not limited to C++): search for "string".

https://www.sandordargo.com/blog/2023/11/29/cpp23-unicode-support
C++23: Growing unicode support
---
The standardization committee has accepted (at least) four papers which clearly
show a growing Unicode support in C++23.
- C++ Identifier Syntax using Unicode Standard Annex 31
- Remove non-encodable wide character literals and multicharacter wide character literals
- Delimited escape sequences
- Named universal character escapes
        U'\N{LATIN CAPITAL LETTER A WITH MACRON}' // Equivalent to U'\u0100'
        u8"\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}" // Equivalent to u8"\u0100\u0300"
    One of the concerns was the sheer size of the Unicode name database that
    contains the codes (e.g. U+0100) and the names (e.g. {LATIN CAPITAL LETTER A
    WITH MACRON}). It’s around 1.5 MiB which can significantly impact the size
    of compiler distributions. The authors proved that a non-naive
    implementation can be around 300 KiB or even less.

    jlf: next point sounds discutable, no?
    Another open question was how to accept the Unicode-assigned names.
    Is {latin capital letter a with macron} just as good as
    {LATIN CAPITAL LETTER A WITH MACRON}?
    Or what about {LATIN_CAPITAL_LETTER_A_WITH_MACRON}?
    While the Unicode consortium standardized an algorithm called UAX44-LM2 for
    that purpose and it’s quite permissive, language implementors barely follow
    it. C++ is going to require an exact match with the database therefore the
    answer to the previous question is no, {latin capital letter a with macron}
    is not the same as {LATIN CAPITAL LETTER A WITH MACRON}. On the other hand,
    if there will be a strong need, the requirements can be relaxed in a later
    version.

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2071r2.html
Named universal character escapes
---
jlf: they don't want to support UAX44-LM2
jlf: todo, read the section "Design considerations"

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r6.html
char8_t: A type for UTF-8 characters and strings (Revision 6)
Date: 2018-11-09
jlf: maybe to read

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r3.html
char8_t backward compatibility remediation

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1747r0.html
Don't use `char8_t` and `std::u8string` yet in P1389

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2773r0.pdf
Considerations for Unicode algorithms
(pdf...)
    Generally speaking, reducing the size of the tables has a direct impact on
    performance, if only because increasing cache locality is the most effective
    way to improve the performance of anything.
    I landed on a set of strategies developed by the rust team
    https://github.com/rust-lang/rust/tree/master/src/tools/unicode-table-generator/src

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/
JTC1/SC22/WG21 - Papers 2022
    https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2513r4.html
    char8_t Compatibility and Portability Fixes

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/
JTC1/SC22/WG21 - Papers 2023

    https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2047r6.html
    An allocator-aware optional type

    https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2843r0.pdf
    Preprocessing is never undefined

    https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2727r3.html
    std::iterator_interface
    Zach Laine

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/
JTC1/SC22/WG21 - Papers 2024

    https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3263r0.html
    Encoded annotated char

    https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2873r2.pdf
    Remove Deprecated locale category facets for Unicode from C++26

    https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3364r0.pdf
    Remove Deprecated u8path overloads From C++26

    https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p1729r5.html
    Text Parsing

    https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2728r7.html
    Unicode in the Library, Part 1: UTF Transcoding
    Zach Laine

    https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3094r4.html
    std::basic_fixed_string

cRexx lang

cRexx uses this library:
https://github.com/sheredom/utf8.h
---
Codepoint Case
Various functions provided will do case insensitive compares, or transform utf8
strings from one case to another. Given the vastness of unicode, and the authors
lack of understanding beyond latin codepoints on whether case means anything,
the following categories are the only ones that will be checked in case insensitive code:
    ASCII
    Latin-1 Supplement
    Latin Extended-A
    Latin Extended-B
    Greek and Coptic
    Cyrillic

DotNet, CoreFx

28/07/2021
https://github.com/dotnet/corefxlab/issues/2368
Scenarios and Design Philosophy - UTF-8 string support

    https://gist.github.com/GrabYourPitchforks/901684d0aa1d2440eb378d847cfc8607 (jlf: replaced by the following URL)
    https://github.com/dotnet/corefx/issues/34094 (go directly to next URL)
    https://github.com/dotnet/runtime/issues/28204
    Motivations and driving principles behind the Utf8Char proposal

    https://github.com/dotnet/runtime/issues/933
    The NuGet package generally follows the proposal in dotnet/corefxlab#2350, which
    is where most of the discussion has taken place. It's a bit aggravating that the
    discussion is split across so many different forums, I know. :(

        ceztko
        I noticed dotnet/corefxlab#2350 just got closed. Did the discussion moved
        somewhere else about more UTF8 first citizen support efforts?

        @ceztko The corefxlab repo was archived, so open issues were closed to
        support that effort. That thread also got so large that it was difficult
        to follow. @krwq is working on restructuring the conversation so that we
        can continue the discussion in a better forum.

        jlf
        Not clear where the discussion is continued...
        This URL just show some tags, one of them is "Future".
        https://github.com/orgs/dotnet/projects/7#card-33368432

    https://github.com/dotnet/corefxlab/issues/2350
    Utf8String design discussion - last edited 14-Sep-19
    Tons of comments, with this conclusion:
    The discussion in this issue is too long and github has troubles rendering it.
    I think we should close this issue and start a new one in dotnet/runtime.

https://github.com/dotnet/runtime/tree/main
.Net runtime
jlf: could be useful
https://github.com/dotnet/runtime/blob/main/src/libraries/System.Console/src/System/Console.cs

Dafny lang

https://corp.unicode.org/pipermail/unicode/2021-May/009434.html
Dafny natively supports expressing statements about sets
and contract programming and a toy implementation turned out to be a fairly
rote translation of the Unicode spec.  Dafny is also transpilation focused,
so the primary interface must be highly functional and encoding neutral.

Dart lang

Dart SDK uses ICU4X?
jlf: to investigate...
---
On Fuchsia, the Dart SDK uses createTimeZone() with metazone names obtained from the OS (usage site).
ICU4X currently only supports this stuff with BCP-47 ids. We should have a way to go from metazone names to BCP-47 ids.
I suspect this is already part of the plan but I'm not sure if there's a specific issue filed (@nordzilla?)
---
In the link you posted, it shows "America/New_York", which is an IANA time zone name, not a metazone name.
Did you mean to ask about IANA-to-BCP47 mapping? That would be #2909

https://github.com/dart-lang/sdk/blob/main/sdk/lib/core/string.dart
https://github.com/dart-lang/sdk/blob/e995cb5f7cd67d39c1ee4bdbe95c8241db36725f/pkg/analyzer/lib/source/source_range.dart

https://github.com/dart-lang/
https://github.com/dart-lang/language
https://github.com/dart-lang/sdk

https://dart.dev/guides/language/language-tour#strings
A Dart string (String object) holds a sequence of UTF-16 code units.

https://dart.dev/guides/language/language-tour#runes-and-grapheme-clusters
In Dart, runes expose the Unicode code points of a string.
You can use the characters package to view or manipulate user-perceived
characters, also known as Unicode (extended) grapheme clusters.

https://dart.dev/guides/libraries/library-tour#strings-and-regular-expressions

https://pub.dev/packages/characters
Characters are strings viewed as sequences of user-perceived characters,
also known as Unicode (extended) grapheme clusters.
The Characters class allows access to the individual characters of a string,
and a way to navigate back and forth between them using a CharacterRange.

https://medium.com/dartlang/dart-string-manipulation-done-right-5abd0668ba3e
Like many other programming languages designed before emojis started to dominate
our daily communications and the rise of multilingual support in commercial apps,
Dart represents a string as a sequence of UTF-16 code units.
---
jlf: they say that the Dart users are not aware of the Characters package.
They try to improve the situation in the Flutter framework, but they are not
very happy of the situation:
Those mitigations can help, but they are limited to string manipulations
performed in the context of a Flutter project. We need to carefully measure
their effectiveness after they become available. A more complete solution at the
Dart language level will likely require migration of at least some existing code,
although a few options (for example, static extension types) might make breaking
changes manageable.
More technical investigation is needed to fully understand the trade-offs.

https://github.com/robertbastian/icu4x/tree/dart/ffi/capi/dart/package
jlf: A fork with DART FFI

Elixir lang

https://elixir-lang.org/
"Elixir" |> String.graphemes() |> Enum.frequencies()        %{"E" => 1, "i" => 2, "l" => 1, "r" => 1, "x" => 1}
---
"Elixir"~text~reduce(by: "characters", initial: .stem~new~~put(0)){accu[item] += 1}=
    a Stem (5 items)
    'E' :  1
    'i' :  2
    'l' :  1
    'r' :  1
    'x' :  1

https://hexdocs.pm/elixir/String.html
Strings in Elixir are UTF-8 encoded binaries.
Works at grapheme level.
The functions in this module rely on the Unicode Standard, but do not contain
any of the locale specific behaviour.
To act according to the Unicode Standard, many functions in this module run in
linear time, as they need to traverse the whole string considering the proper
Unicode code points.
For example, String.length/1 will take longer as the input grows.
On the other hand, Kernel.byte_size/1 always runs in constant time (i.e.
regardless of the input size).
---
Interesting: they manage correctly the upper/lower without using a locale.
upcase(string, mode \\ :default)
Converts all characters in the given string to uppercase according to mode.
mode may be :default, :ascii, :greek or :turkic.
    The :default mode considers all non-conditional transformations outlined in the Unicode standard.
    :ascii uppercases only the letters a to z.
    :greek includes the context sensitive mappings found in Greek.
    :turkic properly handles the letter i with the dotless variant.

https://hexdocs.pm/elixir/unicode-syntax.html
Strings are UTF-8 encoded.
Charlists are lists of Unicode code points. In such cases, the contents are kept
as written by developers, without any transformation.
Elixir allows Unicode characters in its variables, atoms, and calls.
From now on, we will refer to those terms as identifiers.
The characters allowed in identifiers are the ones specified by Unicode.
Elixir normalizes all characters to be the in the NFC form.
Mixed-script identifiers are not supported for security reasons.
    аdmin
    "аdmin"~text~unicodecharacters==
    an Array (shape [5], 5 items)
     1 : ( "а"   U+0430 Ll 1 "CYRILLIC SMALL LETTER A" )
     2 : ( "d"   U+0064 Ll 1 "LATIN SMALL LETTER D" )
     3 : ( "m"   U+006D Ll 1 "LATIN SMALL LETTER M" )
     4 : ( "i"   U+0069 Ll 1 "LATIN SMALL LETTER I" )
     5 : ( "n"   U+006E Ll 1 "LATIN SMALL LETTER N" )
The character must either be all in Cyrillic or all in Latin.
The only mixed-scripts that Elixir allows, according to the Highly Restrictive
Unicode recommendations, are:
    Latin and Han with Bopomofo
    Latin and Japanese
    Latin and Korean
Elixir will also warn on confusable identifiers in the same file.
For example, Elixir will emit a warning if you use both variables а (Cyrillic)
and а (Latin) in your code.
Elixir implements the requirements outlined in the Unicode Annex #31
(https://www.unicode.org/reports/tr31/)
Elixir does not allow the use of ZWJ or ZWNJ in identifiers and therefore does
not implement R1a.
Bidirectional control characters are also not supported.
R1b is guaranteed for backwards compatibility purposes.
Elixir supports only code points \t (0009), \n (000A), \r (000D) and \s (0020)
as whitespace and therefore does not follow requirement R3.
R3 requires a wider variety of whitespace and syntax characters to be supported.

Erlang lang

https://www.erlang.org/doc/apps/stdlib/unicode_usage.html
jlf: good overview

https://blog.r-project.org/2022/11/07/issues-while-switching-r-to-utf-8-and-ucrt-on-windows/
Issues While Switching R to UTF-8 and UCRT on Windows

Factor lang

http://docs.factorcode.org/content/article-unicode.html

http://useless-factor.blogspot.fr/2007/02/doing-unicode-right-part-1.html
JLF : bof...

http://useless-factor.blogspot.fr/2007/02/doing-unicode-right-part-2.html

http://useless-factor.blogspot.fr/2007/08/unicode-implementers-guide-part-3.html

http://useless-factor.blogspot.fr/2007/08/unicode-implementers-guide-part-4.html
grapheme breaking

http://useless-factor.blogspot.fr/2007/08/r-597-rs-unicode-library-is-broken.html

http://useless-factor.blogspot.fr/2007/02/more-string-parsing.html
UTF-8/16 encoder/decoder

    I used a design pattern known as a sentinel, which helps me cross-cut pointcutting concerns
    by instantiating objects which encapsulate the state of the parser. I never mutate these,
    and the program is purely functional except for the use of make (which could trivially be
    changed into a less efficient map [ ] subset, sacrificing efficiency and some terseness
    but making it functional).

    TUPLE: new ;
    TUPLE: double val ;
    TUPLE: quad2 val ;
    TUPLE: quad3 val ;

    : bad-char CHAR: ? ;

    GENERIC: (utf16le) ( char state -- state )
    M: new (utf16le)
        drop <double> ;
    M: double (utf16le)
        over -3 shift BIN: 11011 = [
            over BIN: 100 bitand 0 =
            [ double-val swap BIN: 11 bitand 8 shift bitor <quad2> ]
            [ 2drop bad-char , <new> ] if
        ] [ double-val swap 8 shift bitor , <new> ] if ;
    M: quad2 (utf16le)
        quad2-val 10 shift bitor <quad3> ;
    M: quad3 (utf16le)
        over -2 shift BIN: 110111 = [
            swap BIN: 11 bitand 8 shift
            swap quad3-val bitor HEX: 10000 + , <new>
        ] [ 2drop bad-char , <new> ] if ;

    : utf16le ( state string -- state string )
        [ [ swap (utf16le) ] each ] { } make ;

https://re.factorcode.org/2023/05/unicode.html
jlf: very basic, but may be useful to write little tests

https://re.factorcode.org/2023/05/case-conversion.html
snake_case
camelCase
kebab-case
PascalCase
Ada_Case
Train-Case
COBOL-CASE
MACRO_CASE
UPPER CASE
lower case
Title Case
Sentence case
dot.case

Fortran lang

https://fortran-lang.discourse.group/t/using-unicode-characters-in-fortran/2764
jlf: hum... it's a blind support of UTF-8, as we do with current Rexx.
There is no support of Unicode.
In the unicode_len.f90 example:
    chars = 'Fortran is 💪, 😎, 🔥!'
    if (len(chars) /= 28) error stop
28 is the lentgh in bytes...
In the unicode_index.f90 example:
    chars = '📐: 4.0·tan⁻¹(1.0) = π'
    i = index(chars, 'n')
    if (i /= 14) error stop
    i = index(chars, '¹')
    if (i /= 18) error stop
14 and 18 are byte positions...

GO lang

https://go.dev/

https://go.dev/ref/spec#Conversions_to_and_from_a_string_type
jlf: worth reading, they cover all the possible conversions between bytes, rune
and string.

https://go.dev/play/
The Go Playground

https://github.com/traefik/yaegi
Another Elegant Go Interpreter
---
rlwrap yaegi

https://yourbasic.org/golang/
Tutorial, a selection related to strings
---
    []byte("Noël")                                              // [78 111 195 171 108]

    // 1. Using the string() constructor
    string([]byte{78, 111, 195, 171, 108})                      // Noël

    // 2. Go provides a package called bytes with a function called NewBuffer(), which
    //    creates a new Buffer and then uses the String() method to get the string output.
    bytes.NewBuffer([]byte{78, 111, 195, 171, 108}).String()    // Noël

    // 3. Using fmt.Sprintf() function
    fmt.Sprintf("%s", []byte{78, 111, 195, 171, 108})           // Noël

    // String building
    fmt.Sprintf("Size: %d MB.", 85)                             // Size: 85 MB.

    // High-performance string concatenation
    var b strings.Builder
    b.Grow(32) // preallocate memory when the maximum size of the string is known
    for i, p := range []int{2, 3, 5, 7, 11, 13} {
        fmt.Fprintf(&b, "%d:%d, ", i+1, p)
    }
    s := b.String()   // no copying
    s = s[:b.Len()-2] // no copying (removes trailing ", ")
    fmt.Println(s)    // 1:2, 2:3, 3:5, 4:7, 5:11, 6:13

    // Convert string to runes
    // For an invalid UTF-8 sequence, the rune value will be 0xFFFD for each invalid byte.
    []rune("Noël")        // [78 111 235 108]

    // Convert runes to string
    // When you convert a slice of runes to a string, you get a new string that
    // is the concatenation of the runes converted to UTF-8 encoded strings.
    // Values outside the range of valid Unicode code points are converted to
    // \uFFFD, the Unicode replacement character �.
    string([]rune{'\u004E', '\u006F', '\u00EB', '\u006C'})  // Noël

    // String iteration by runes
    //    the range loop iterates over Unicode code points.
    //    The index is the first byte of a UTF-8-encoded code point;
    //    the second value, of type rune, is the value of the code point.
    //    For an invalid UTF-8 sequence, the second value will be 0xFFFD,
    //    and the iteration will advance a single byte.
    for i, ch := range "日本語" {
        fmt.Printf("%#U starts at byte position %d\n", ch, i)
    }
    // Output:
    U+004E 'N' starts at byte position 0
    U+006F 'o' starts at byte position 1
    U+00EB 'ë' starts at byte position 2
    U+006C 'l' starts at byte position 4

    // String iteration by bytes
    const s = "Noël"
    for i := 0; i < len(s); i++ {
        fmt.Printf("%x ", s[i])
    }
    // Output: 4e 6f c3 ab 6c

https://pkg.go.dev/strings
Package strings implements simple functions to manipulate UTF-8 encoded strings.
jlf: BIFs

https://go.dev/blog/slices
Arrays, slices (and strings): The mechanics of 'append'
Rob Pike
26 September 2013
---
jlf: prerequisite to understand how strings are managed
Next blog also helps (no relation with Unicode, but...)
https://teivah.medium.com/slice-length-vs-capacity-in-go-af71a754b7d8

https://go.dev/blog/strings
Strings, bytes, runes and characters in Go
Rob Pike
23 October 2013
---
In Go, a string is in effect a read-only slice of bytes.
A string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text,
or any other predefined format.
    const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
Indexing a string accesses individual bytes, not characters.
    for i := 0; i < len(sample); i++ {
        fmt.Printf("%x ", sample[i])        # bd b2 3d bc 20 e2 8c 98
    }
A shorter way to generate presentable output for a messy string is to use the %x
(hexadecimal) format verb of fmt.Printf. It just dumps out the sequential bytes
of the string as hexadecimal digits, two per byte.
    fmt.Printf("%x\n", sample)              # bdb23dbc20e28c98
    fmt.Printf("% x\n", sample)             # bd b2 3d bc 20 e2 8c 98
The %q (quoted) verb will escape any non-printable byte sequences in a string so
the output is unambiguous.
    fmt.Printf("%q\n", sample)              # "\xbd\xb2=\xbc ⌘"
    fmt.Printf("%+q\n", sample)             # "\xbd\xb2=\xbc \u2318"
The Go language defines the word rune as an alias for the type int32, so programs
can be clear when an integer value represents a code point.
A for range loop decodes one UTF-8-encoded rune on each iteration. Each time around
the loop, the index of the loop is the starting position of the current rune, measured
in bytes, and the code point is its value.
    const nihongo = "日本語"
    for index, runeValue := range nihongo {
        fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
    }
The output shows how each code point occupies multiple bytes:
    U+65E5 '日' starts at byte position 0
    U+672C '本' starts at byte position 3
    U+8A9E '語' starts at byte position 6

https://go.dev/pkg/unicode/utf8/
Unicode/utf8 package

https://go.dev/blog/normalization
Text normalization in Go
Marcel van Lohuizen
26 November 2013
---
To write your text as NFC, use the https://pkg.go.dev/golang.org/x/text/unicode/norm
package to wrap your io.Writer of choice:
    wc := norm.NFC.Writer(w)
    defer wc.Close()
    // write as before...
If you have a small string and want to do a quick conversion, you can use this simpler form:
    norm.NFC.Bytes(b)

https://cs.opensource.google/go/x/text
This repository holds supplementary Go libraries for text processing, many involving Unicode.

https://pkg.go.dev/golang.org/x/text/collate
The collate package, which can sort strings in a language-specific way, works
correctly even with unnormalized strings

https://pkg.go.dev/golang.org/x/text/encoding
Package encoding defines an interface for character encodings, such as Shift JIS
and Windows 1252, that can convert to and from UTF-8.
Encoding implementations are provided in other packages, such as
    golang.org/x/text/encoding/charmap
    golang.org/x/text/encoding/japanese.
A Decoder converts bytes to UTF-8. It implements transform.Transformer.
    Transforming source bytes that are not of that encoding will not result in
    an error per se. Each byte that cannot be transcoded will be represented in
    the output by the UTF-8 encoding of '\uFFFD', the replacement rune.
    ---
    jlf: strange... I was expecting a more conservative conversion, since the
    core language supports any bytes in a string.
An Encoder converts bytes from UTF-8. It implements transform.Transformer.
    Each rune that cannot be transcoded will result in an error. In this case,
    the transform will consume all source byte up to, not including the offending
    rune. Transforming source bytes that are not valid UTF-8 will be replaced by
    `\uFFFD`.
    ---
    jlf: the previous description seems contradictory.
    "up to, not including the offending rune"
    "not valid UTF-8 will be replaced by `\uFFFD`"

https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/charmap
Package charmap provides simple character encodings such as IBM Code Page 437
and Windows 1252.
    CodePage037 is the IBM Code Page 037 encoding.
    CodePage1047 is the IBM Code Page 1047 encoding.
    CodePage1140 is the IBM Code Page 1140 encoding.
    CodePage437 is the IBM Code Page 437 encoding.
    CodePage850 is the IBM Code Page 850 encoding.
    CodePage852 is the IBM Code Page 852 encoding.
    CodePage855 is the IBM Code Page 855 encoding.
    CodePage858 is the Windows Code Page 858 encoding.
    CodePage860 is the IBM Code Page 860 encoding.
    CodePage862 is the IBM Code Page 862 encoding.
    CodePage863 is the IBM Code Page 863 encoding.
    CodePage865 is the IBM Code Page 865 encoding.
    CodePage866 is the IBM Code Page 866 encoding.
    ISO8859_1 is the ISO 8859-1 encoding.
    ISO8859_10 is the ISO 8859-10 encoding.
    ISO8859_13 is the ISO 8859-13 encoding.
    ISO8859_14 is the ISO 8859-14 encoding.
    ISO8859_15 is the ISO 8859-15 encoding.
    ISO8859_16 is the ISO 8859-16 encoding.
    ISO8859_2 is the ISO 8859-2 encoding.
    ISO8859_3 is the ISO 8859-3 encoding.
    ISO8859_4 is the ISO 8859-4 encoding.
    ISO8859_5 is the ISO 8859-5 encoding.
    ISO8859_6 is the ISO 8859-6 encoding.
    ISO8859_7 is the ISO 8859-7 encoding.
    ISO8859_8 is the ISO 8859-8 encoding.
    ISO8859_9 is the ISO 8859-9 encoding.
    KOI8R is the KOI8-R encoding.
    KOI8U is the KOI8-U encoding.
    Macintosh is the Macintosh encoding.
    MacintoshCyrillic is the Macintosh Cyrillic encoding.
    Windows1250 is the Windows 1250 encoding.
    Windows1251 is the Windows 1251 encoding.
    Windows1252 is the Windows 1252 encoding.
    Windows1253 is the Windows 1253 encoding.
    Windows1254 is the Windows 1254 encoding.
    Windows1255 is the Windows 1255 encoding.
    Windows1256 is the Windows 1256 encoding.
    Windows1257 is the Windows 1257 encoding.
    Windows1258 is the Windows 1258 encoding.
    Windows874 is the Windows 874 encoding.

https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/japanese
Package japanese provides Japanese encodings such as EUC-JP and Shift JIS.
    EUCJP is the EUC-JP encoding.
    ISO2022JP is the ISO-2022-JP encoding.
    ShiftJIS is the Shift JIS encoding, also known as Code Page 932 and Windows-31J.

https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/korean
Package korean provides Korean encodings such as EUC-KR.
    EUCKR is the EUC-KR encoding, also known as Code Page 949.

https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/simplifiedchinese
Package simplifiedchinese provides Simplified Chinese encodings such as GBK.
    HZGB2312 is the HZ-GB2312 encoding.

https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/traditionalchinese
Package traditionalchinese provides Traditional Chinese encodings such as Big5.
    Big5 is the Big5 encoding, also known as Code Page 950.

https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/unicode
Package unicode provides Unicode encodings such as UTF-16.
    UTF8 is the UTF-8 encoding. It neither removes nor adds byte order marks.
    UTF8BOM is an UTF-8 encoding where the decoder strips a leading byte order mark while the encoder adds one.
    UTF16 returns a UTF-16 Encoding for the given default endianness and byte order mark (BOM) policy.
        func UTF16(e Endianness, b BOMPolicy) encoding.Encoding

https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/unicode/utf32
Package utf32 provides the UTF-32 Unicode encoding.
    UTF32 returns a UTF-32 Encoding for the given default endianness and byte order mark (BOM) policy.
        func UTF32(e Endianness, b BOMPolicy) encoding.Encoding

https://go.dev/blog/matchlang
Language and Locale Matching in Go
The Go package https://golang.org/x/text/language implements the BCP 47 standard
for language tags and adds support for deciding which language to use based on
data published in the Unicode Common Locale Data Repository (CLDR).

https://github.com/unicode-org/icu4x/issues/2882
https://cs.opensource.google/go/x/text
The golang x-text library has re-implemented most of ICU from scratch,
and some of their algorithms and data structures might be interesting
for the icu4x project
(afaik x-text was not just a port of the ICU codebase to another language,
but an actual re-implementation).
You might want to have a look at their code, or talk to @mpvl who wrote most of it.

https://github.com/golang/go/blob/master/src/cmd/compile/internal/syntax/scanner.go
Implementation of Golang’s lexer
Identifier is made up of letters and digits (where first is always a letter) and
letter is an arbitrary Unicode code point.

    package main
    import "fmt"
    func 隨機名稱() {
        fmt.Println("It works!")
    }
    func main() {
        隨機名稱()
        źdźbło := 1
        fmt.Println(źdźbło)
    }

https://henvic.dev/posts/go-utf8/
UTF-8 strings with Go: len(s) isn't enough
jlf: in his initial post, the guy was not aware of graphemes
and it's after a feedback on Reddit that he addded stuff about graphemes.

https://github.com/rivo/uniseg
Unicode Text Segmentation, Word Wrapping, and String Width Calculation in Go

https://pkg.go.dev/github.com/rivo/uniseg#hdr-Monospace_Width
Monospace Width
Monospace width, as referred to in this package, is the width of a string in a monospace font.
This package differs from wcswidth() in a number of ways, presumably to generate more visually pleasing results.
Note that whether these widths appear correct depends on your application's render engine,
to which extent it conforms to the Unicode Standard, and its choice of font.
---
Rules implemented by uniseg:
    we assume that every code point has a width of 1, with the following exceptions:

    - Code points with grapheme cluster break properties Control, CR, LF, Extend, and ZWJ have a width of 0.
    - U+2E3A, Two-Em Dash, has a width of 3.
    - U+2E3B, Three-Em Dash, has a width of 4.
    - Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide" (W)
      have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both have a width of 1.)
    - Code points with grapheme cluster break property Regional Indicator have a width of 2.
    - Code points with grapheme cluster break property Extended Pictographic have
      a width of 2, unless their Emoji Presentation flag is "No", in which case the width is 1.

    - For Hangul grapheme clusters composed of conjoining Jamo and for Regional Indicators
      (flags), all code points except the first one have a width of 0.
    - For grapheme clusters starting with an Extended Pictographic, any additional
      code point will force a total width of 2, except if the Variation Selector-15
      (U+FE0E) is included, in which case the total width is always 1.
    - Grapheme clusters ending with Variation Selector-16 (U+FE0F) have a width of 2.
---
jlf: mouais, in conclusion there is no guarantee that the result will be good.
---
uniseg.StringWidth("🇩🇪🏳️‍🌈!")       -- uniseg returns 5
utf8proc:
"🇩🇪🏳️‍🌈!"~text~unicodeCharacters~each("charWidth")=     -- [ 1, 1, 1, 0, 0, 2, 1]
"🇩🇪🏳️‍🌈!"~text~unicodeCharacters==
    an Array (shape [7], 7 items)
     1 : ( "🇩"   U+1F1E9 So 1 "REGIONAL INDICATOR SYMBOL LETTER D" )
     2 : ( "🇪"   U+1F1EA So 1 "REGIONAL INDICATOR SYMBOL LETTER E" )
     3 : ( "🏳"   U+1F3F3 So 1 "WAVING WHITE FLAG" )
     4 : ( "️"    U+FE0F Mn 0 "VARIATION SELECTOR-16", "VS16" )
     5 : ( "‍"    U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" )
     6 : ( "🌈"  U+1F308 So 2 "RAINBOW" )
     7 : ( "!"   U+0021 Po 1 "EXCLAMATION MARK" )

https://www.reddit.com/r/golang/comments/1d19uon/why_can_the_go_string_type_contain_invalid_utf8/
Why can the Go `string` type contain invalid UTF-8 data?
jlf: nothing interesting, it's Go vs Rust, and nobody wins.

jRuby lang

https://github.com/jruby/jruby/blob/master/core/src/main/java/org/jruby/RubyString.java
jlf: big file, more than 7000 lines.

https://github.com/jruby/jruby/blob/master/core/src/main/java/org/jruby/RubyEncoding.java

https://github.com/jruby/jruby/blob/master/lib/ruby/stdlib/unicode_normalize/normalize.rb
https://github.com/jruby/jruby/blob/master/spec/ruby/core/string/unicode_normalize_spec.rb

Java lang

https://docs.oracle.com/en/java/javase/

https://docs.oracle.com/en/java/javase/20/docs/api/java.base/java/text/BreakIterator.html
java.text.BreakIterator
The default implementation of the character boundary analysis conforms to the
Unicode Consortium's Extended Grapheme Cluster breaks. For more detail, refer to
Grapheme Cluster Boundaries section in the Unicode Standard Annex #29.

https://docs.oracle.com/en/java/javase/20/intl/internationalization-overview.html
Internationalization Overview

https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/
Java has supported Unicode since its first release and strings are internally
represented using UTF-16 encoding. UTF-16 is a variable length encoding scheme.
For characters that can fit into the 16 bits space, it uses 2 bytes to represent
them. For all other characters, it uses 4 bytes.
For a character that requires more than 16 bits, like these emojis 👦👩, the
char methods like someString.charAt(0) or someString.substring(0,1) will break
and give you only half the code point.

https://docs.oracle.com/javase/tutorial/i18n/text/unicode.html
    When the specification for the Java language was created, the Unicode standard
    was accepted and the char primitive was defined as a 16-bit data type, with
    characters in the hexadecimal range from 0x0000 to 0xFFFF.

    Because 16-bit encoding supports 216 (65,536) characters, which is insufficient
    to define all characters in use throughout the world, the Unicode standard was
    extended to 0x10FFFF, which supports over one million characters. The definition
    of a character in the Java programming language could not be changed from 16
    bits to 32 bits without causing millions of Java applications to no longer run
    properly. To correct the definition, a scheme was developed to handle characters
    that could not be encoded in 16 bits.

    The characters with values that are outside of the 16-bit range, and within the
    range from 0x10000 to 0x10FFFF, are called supplementary characters and are
    defined as a pair of char values.

https://openjdk.org/jeps/400
JEP 400: UTF-8 by Default
A quick way to see the default charset of the current JDK is with the following command:
    java -XshowSettings:properties -version 2>&1 | grep file.encoding
As envisaged by the specification of Charset.defaultCharset(), the JDK will allow
the default charset to be configured to something other than UTF-8.
    java -Dfile.encoding=COMPAT
        the default charset will be the charset chosen by the algorithm in JDK 17 and earlier,
        based on the user's operating system, locale, and other factors.
        The value of file.encoding will be set to the name of that charset.
    java -Dfile.encoding=UTF-8
        the default charset will be UTF-8.
        This no-op value is defined in order to preserve the behavior of existing command lines.
    The treatment of values other than "COMPAT" and "UTF-8" are not specified.
    They are not supported, but if such a value worked in JDK 17 then it will likely continue to work in JDK 18.

https://www.baeldung.com/java-remove-accents-from-text
Remove Accents and Diacritics From a String in Java
- We will perform the compatibility decomposition represented as the Java enum NFKD.
  because it decomposes more ligatures than the canonical method (for example, ligature “ﬁ”).
- We will remove all characters matching the Unicode Mark category using the \p{M} regex expression.
Test:
    assertEquals("\\u0066 \\u0069", StringNormalizer.unicodeValueOfNormalizedString("ﬁ"));
    assertEquals("\\u0061 \\u0304", StringNormalizer.unicodeValueOfNormalizedString("ā"));
    assertEquals("\\u0069 \\u0308", StringNormalizer.unicodeValueOfNormalizedString("ï"));
    assertEquals("\\u006e \\u0301", StringNormalizer.unicodeValueOfNormalizedString("ń"));
Compare Strings Including Accents Using Collator.
Java provides four strength values for a Collator:
    PRIMARY: comparison omitting case and accents
    SECONDARY: comparison omitting case but including accents and diacritics
    TERTIARY: comparison including case and accents
    IDENTICAL: all differences are significant

https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/io/DataInput.html#modified-utf-8
Implementations of the DataInput and DataOutput interfaces represent Unicode strings in a format
that is a slight modification of UTF-8.
- Characters in the range '\u0001' to '\u007F' are represented by a single byte.
- The null character '\u0000' and characters in the range '\u0080' to '\u07FF' are represented by a pair of bytes.
- Characters in the range '\u0800' to '\uFFFF' are represented by three bytes.
The differences between this format and the standard UTF-8 format are the following:
- The null byte '\u0000' is encoded in 2-byte format rather than 1-byte,
  so that the encoded strings never have embedded nulls.
- Only the 1-byte, 2-byte, and 3-byte formats are used.
- Supplementary characters are represented in the form of surrogate pairs.

Decomposition of ligature
In Java, you'll need to use the Normalizer class and the NFKC form:
---
String ff ="\uFB00";
String normalized = Normalizer.normalize(ff, Form.NFKC);
System.out.println(ff + " = " + normalized);
---
This will print
ﬀ = ff

https://stackoverflow.com/questions/9699071/what-is-the-javas-internal-represention-for-string-modified-utf-8-utf-16
    You can confirm the following by looking at the source code of the relevant version of the java.lang.String class in OpenJDK.
    Prior to Java 9, the standard in-memory representation for a Java String is UTF-16 code-units held in a char[].
        private final char value[];
    With Java 6 update 21 and later, there was a non-standard option (-XX:UseCompressedStrings) to enable compressed strings. This feature was removed in Java 7.
    For Java 9 and later, the implementation of String has been changed to use a compact representation by default.
        private final byte[] value;
        private final byte coder;      // LATIN1 (0) or UTF16 (1)

    https://docs.oracle.com/en/java/javase/20/docs/specs/man/java.html#advanced-runtime-options-for-java
    -XX:-CompactStrings
    Disables the Compact Strings feature.
    By default, this option is enabled.
    When this option is enabled, Java Strings containing only single-byte characters are internally represented
    and stored as single-byte-per-character Strings using ISO-8859-1 / Latin-1 encoding.
    This reduces, by 50%, the amount of space required for Strings containing only single-byte characters.
    For Java Strings containing at least one multibyte character:
    these are represented and stored as 2 bytes per character using UTF-16 encoding.
    Disabling the Compact Strings feature forces the use of UTF-16 encoding as the internal representation for all Java Strings.

    As of 2023, see JEP 254: Compact Strings https://openjdk.org/jeps/254
    https://howtodoinjava.com/java9/compact-strings/

https://stackoverflow.com/questions/44178432/difference-between-compact-strings-and-compressed-strings-in-java-9
    In Java 9 on the other hand, compact strings are fully integrated into the JDK source.
    String is always backed by byte[], where characters use one byte if they are Latin-1 and otherwise two.
    Most operations do a check to see which is the case, e.g. charAt:

        public char charAt(int index) {
            if (isLatin1()) {
                return StringLatin1.charAt(value, index);
            } else {
                return StringUTF16.charAt(value, index);
            }
        }

    Compact strings are enabled by default and can be partially disabled - "partially"
    because they are still backed by a byte[] and operations returning chars must still
    put them together from two separate bytes

    public int length() {
        return value.length >> coder();
    }
    If our String is Latin1 only, coder is going to be zero, so length of value (the byte array) is the size of chars.
    For non-Latin1 divide by two.

https://www.baeldung.com/java-string-encode-utf-8
Encoding With Core Java
    // First, we get the String bytes, and then we create a new one using the retrieved bytes and the desired charset:
    String rawString = "Entwickeln Sie mit Vergnügen";
    byte[] bytes = rawString.getBytes(StandardCharsets.UTF_8);
    String utf8EncodedString = new String(bytes, StandardCharsets.UTF_8);
    assertEquals(rawString, utf8EncodedString);
Encoding With Java 7 StandardCharsets
    // First, we'll encode the String into bytes, and second, we'll decode it into a UTF-8 String:
    String rawString = "Entwickeln Sie mit Vergnügen";
    ByteBuffer buffer = StandardCharsets.UTF_8.encode(rawString);
    String utf8EncodedString = StandardCharsets.UTF_8.decode(buffer).toString();
    assertEquals(rawString, utf8EncodedString);

https://www.baeldung.com/java-string-to-byte-array
Convert String to Byte Array and Reverse in Java
    Converting a String to Byte Array
        A String is stored as an array of Unicode characters in Java.
        To convert it to a byte array, we translate the sequence of characters into a sequence of bytes.
        For this translation, we use an instance of Charset.
        This class specifies a mapping between a sequence of chars and a sequence of bytes.
        We refer to the above process as encoding.

        Using String.getBytes()
            The String class provides three overloaded getBytes methods to encode a String into a byte array:
                - getBytes() – encodes using platform's default charset
                    ---
                    String inputString = "Hello World!";
                    byte[] byteArrray = inputString.getBytes();
                    ---
                    The above method is platform-dependent, as it uses the platform's default charset. We can get this charset by calling Charset.defaultCharset().
                - getBytes (String charsetName) – encodes using the named charset
                - getBytes (Charset charset) – encodes using the provided charset

        Using Charset.encode()
            The Charset class provides encode(), a convenient method that encodes Unicode characters into bytes.
            This method always replaces invalid input and unmappable-characters using the charset's default replacement byte array.
                ---
                String inputString = "Hello ਸੰਸਾਰ!";
                Charset charset = StandardCharsets.US_ASCII;
                byte[] byteArrray = charset.encode(inputString).array();
                ---

        CharsetEncoder
            CharsetEncoder transforms Unicode characters into a sequence of bytes for a given charset.
            Moreover, it provides fine-grained control over the encoding process.
                ---
                String inputString = "Hello ਸੰਸਾਰ!";
                CharsetEncoder encoder = StandardCharsets.US_ASCII.newEncoder();
                encoder.onMalformedInput(CodingErrorAction.IGNORE)
                  .onUnmappableCharacter(CodingErrorAction.REPLACE)
                  .replaceWith(new byte[] { 0 });
                byte[] byteArrray = encoder.encode(CharBuffer.wrap(inputString)).array();
                ---

    Converting a Byte Array to String
        We refer to the process of converting a byte array to a String as decoding.
        Similar to encoding, this process requires a Charset.
        However, we can't just use any charset for decoding a byte array.
        In particular, we should use the charset that encoded the String into the byte array.

https://retrocomputing.stackexchange.com/questions/26535/why-do-java-classfiles-and-jni-use-a-frankensteins-monster-encoding-crossin
Why do Java classfiles (and JNI) use a "Frankenstein's monster" encoding crossing UTF-8 and UTF-16?
jlf: interesting for the history. If I understand correctly, Java uses the CESU-8
encoding to store strings in classfiles and JNI payloads.

https://en.wikipedia.org/wiki/CESU-8
CESU-8 = Compatibility Encoding Scheme for UTF-16: 8-Bit
- A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point
  in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8
- A Unicode supplementary character, i.e. a code point in the range U+10000 to
  U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each
  surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3
  bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four.
CESU-8 is not an official part of the Unicode Standard, because Unicode Technical
Reports are informative documents only. It should be used exclusively for internal
processing and never for external data exchange.
Supporting CESU-8 in HTML documents is prohibited by the W3C and WHATWG HTML
standards, as it would present a cross-site scripting vulnerability.

What is the level of support of surrogates?
    Java.lang.Character.isSurrogatePair()
    Java.lang.Character.toCodePoint(char high, char low) : int
    String.codePointAt()
    Character.codePointAt()

http://hauchee.blogspot.com/2015/05/surrogate-characters-mechanism.html
    Neither String or StringBuilder working properly. To avoid the issue above, use
    java.text.BreakIterator to determine the correct position.
    jlf: the code below show how to pass from logical position to real position.
        public static void main(String[] args) {
            String text = "a\uD834\uDD60s\uD834\uDD60\uD834\uDD60©₂"; // text: a텠s텠텠©₂
            int startIndex = 2;
            int endIndex = 5;

            BreakIterator charIterator = BreakIterator.getCharacterInstance();
            System.out.println(
                    subString(charIterator, text, startIndex, endIndex)); // output: s텠텠
        }

        private static String subString(BreakIterator charIterator,
                String target, int start, int end) {
            int realStart = 0;
            int realEnd = 0;
            charIterator.setText(target);
            int boundary = charIterator.first();
            int i = 0;
            while (boundary != BreakIterator.DONE) {
                if (i == start) {
                    realStart = boundary;
                }
                if (i == end) {
                    realEnd = boundary;
                    break;
                }
                boundary = charIterator.next();
                i++;
            }
            return target.substring(realStart, realEnd);
        }

https://github.com/s-u/rJava/issues/51
    R to Java interface
    Error on UTF-16 surrogate pairs
    Java uses UTF-16 internally and encodes Unicode characters above U+FFFF with
    surrogate pairs. When strings containing such characters are converted to UTF-8
    by rJava they are encoded as a pair of 3 byte sequences rather than as the correct
    4 byte sequence. This is not valid UTF-8 and will result in "invalid multibyte string" errors.
    https://www.unicode.org/faq/utf_bom.html#utf8-4

https://bugs.openjdk.org/browse/JDK-8291660

https://youtrack.jetbrains.com/issue/IDEA-197555
\b{g} not supported in regex
In the docs for java.util.regex.Pattern (https://docs.oracle.com/javase/10/docs/api/java/util/regex/Pattern.html):
\b{g} is listed under the “Boundary matchers” section: “\b{g} A Unicode extended grapheme cluster boundary”

https://www.reddit.com/r/LanguageTechnology/comments/af0ice/seeking_lightweight_java_graphemetophoneme_g2p/
Seeking lightweight Java grapheme-to-phoneme (G2P) model
Succeeded at getting jg2p working. It's doing pretty well in terms of
pronunciation quality but the model is very large for an Android app and takes
forever to load.

https://github.com/steveash/jg2p/
jg2p
Java implementation of a general grapheme to phoneme toolkit using a pipeline of
CRFs, a log-loss re-ranker, and a joint "graphone" language model.

https://horstmann.com/unblog/2023-10-03/index.html
Stop Using char in Java. And Code Points
jlf: moderately interesting...
jlf: idem for the related HN comments https://news.ycombinator.com/item?id=37822967
Since Java 20, there is a way of iterating over the grapheme clusters of a string,
using the BreakIterator class from Java 1.1.
    String s = "Ciao 🇮🇹!";
    BreakIterator iter = BreakIterator.getCharacterInstance();
    iter.setText(s);
    int start = boundary.first();
    int end = boundary.next();
    while (end != BreakIterator.DONE) {
       String gc = s.substring(start, end);
       start = end;
       end = boundary.next();
       process(gc);
    }
Here is a much simpler way, clearly not as efficient. I was stunned to find out
that this worked since Java 9!
    s.split("\\b{g}"); // An array withments "C", "i", "a", "o", " ", "🇮🇹", "!"
Or, to get a stream:
    Pattern.compile("\\X").matcher(s).results().map(MatchResult::group)

JavaScript lang

https://certitude.consulting/blog/en/invisible-backdoor/
THE INVISIBLE JAVASCRIPT BACKDOOR

https://www.npmjs.com/package/tty-strings
A one stop shop for working with text displayed in the terminal.
The goal of this project is to alleviate the headache of working with Javascript's
internal representation of unicode characters, particularly within the context of
displaying text in the terminal for command line applications.
---
jlf tag: character width

https://github.com/foliojs/linebreak
A JS implementation of the Unicode Line Breaking Algorithm (UAX #14)
It is used by PDFKit (https://github.com/foliojs/pdfkit) for line wrapping text
in PDF documents.

https://github.com/codebox/homoglyph
A big list of homoglyphs and some code to detect them

Julia lang

Remember: search in issues with "utf8proc in:title,body"

https://bkamins.github.io/julialang/2020/08/13/strings.html
The String, or There and Back Again

https://docs.julialang.org/en/v1/manual/strings/
    You can input any Unicode character in single quotes using \u followed by up to
    four hexadecimal digits or \U followed by up to eight hexadecimal digits
    (the longest valid value only requires six):

    julia> '\u0'
    '\0': ASCII/Unicode U+0000 (category Cc: Other, control)

    julia> '\u78'
    'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)

    julia> '\u2200'
    '∀': Unicode U+2200 (category Sm: Symbol, math)

    julia> '\U10ffff'
    '\U10ffff': Unicode U+10FFFF (category Cn: Other, not assigned)

    julia> s = "\u2200 x \u2203 y"
    "∀ x ∃ y"

https://docs.julialang.org/en/v1/base/strings/
jlf: search for "ß" in this page with Chrome, you will see it matches with "ss"
It doesn't match the β here: isless("β", "α")
"β"~text~characters=    -- ( "β"   U+03B2 Ll 1 "GREEK SMALL LETTER BETA" )

https://juliapackages.com/p/strs
jlf: the string implemention of Scott P Jones
Seems quiet since last year...

    This uses Swift-style \ escape sequences, such as \u{xxxx} for Unicode constants,
    instead of \uXXXX and \UXXXXXXXX, which have the advantage of not having to worry
    about some digit or letter A-F or a-f occurring after the last hex digit of the Unicode constant.

    It also means that $, a very common character for LaTeX strings or output of currencies,
     does not need to be in a string quoted as '$'

    It uses \(expr) for interpolation like Swift, instead of $name or $(expr), which
    also has the advantage of not having to worry about the next character in the
    string someday being allowed in a name.

    It allows for embedding Unicode characters using a variety of easy to remember
    names, instead of hex codes: \:emojiname: \<latexname> \N{unicodename} \&htmlname;
    Examples of this are:
    f"\<dagger> \&yen; \N{ACCOUNT OF} \:snake:", which returns the string: "† ¥ ℀ 🐍 "

https://discourse.julialang.org/t/stupid-question-on-unicode/27674/7
    Discussion about escape sequence

https://docs.julialang.org/en/v1/stdlib/Unicode/
Unicode.julia_chartransform(c::Union{Char,Integer})
Unicode.isassigned(c) -> Bool
isequal_normalized(s1::AbstractString, s2::AbstractString; casefold=false, stripmark=false, chartransform=identity)
Unicode.normalize(s::AbstractString; keywords...)
    boolean keywords options (which all default to false except for compose)
        - compose=false: do not perform canonical composition
        - decompose=true: do canonical decomposition instead of canonical composition (compose=true is ignored if present)
        - compat=true: compatibility equivalents are canonicalized
        - casefold=true: perform Unicode case folding, e.g. for case-insensitive string comparison
        - newline2lf=true, newline2ls=true, or newline2ps=true: convert various newline sequences (LF, CRLF, CR, NEL) into a linefeed (LF), line-separation (LS), or paragraph-separation (PS) character, respectively
        - stripmark=true: strip diacritical marks (e.g. accents)
        - stripignore=true: strip Unicode's "default ignorable" characters (e.g. the soft hyphen or the left-to-right marker)
        - stripcc=true: strip control characters; horizontal tabs and form feeds are converted to spaces; newlines are also converted to spaces unless a newline-conversion flag was specified
        - rejectna=true: throw an error if unassigned code points are found
        - stable=true: enforce Unicode versioning stability (never introduce characters missing from earlier Unicode versions)
Unicode.normalize(s::AbstractString, normalform::Symbol)
    normalform can be :NFC, :NFD, :NFKC, or :NFKD.

utf8proc doesn't support language-sensitive case-folding
Julia, which uses utf8proc, has decided to remain locale-independent.
See https://github.com/JuliaLang/julia/issues/7848

https://github.com/JuliaLang/julia/pull/42493
This PR adds a function isequal_normalized to the Unicode stdlib to check whether
two strings are canonically equivalent (optionally casefolding and/or stripping combining marks).

https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/13
    julia> '\ub5'
    'µ': Unicode U+00b5 (category Ll: Letter, lowercase)

    julia> '\uff'
    'ÿ': Unicode U+00ff (category Ll: Letter, lowercase)

    julia> Base.Unicode.uppercase("ÿ")[1]
    'Ÿ': Unicode U+0178 (category Lu: Letter, uppercase)

    julia> Base.Unicode.uppercase("µ")[1]
    'Μ': Unicode U+039c (category Lu: Letter, uppercase)

jlf: I find the next thead interesting from a social point of view...
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/40
Yet another Stefan Karpinski against Scott P Jones...
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/42
    jlf: helping Scott P Jones
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/46
    Referencing https://github.com/JuliaLang/julia/pull/25021
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/72
    jlf: Stefan Karpinski not happy
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/79
    jlf: Stefan Karpinski not happy
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/88
    jlf: Scott P Jones not happy
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/130
    Using a hybrid encoding like Python 3’s strings or @ScottPJones’s UniStr
    means that not only do you need to look at every byte of incoming data, but
    you also have to transcode it in general. This is a total performance
    nightmare for dealing with large text files.
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/133
    jlf: Interesting points of Stefan Karpinski regarding the validation of strings.
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/138
    jlf: not sure if Scott P Jones says that graphemes are no needed...
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/144
    jlf: révolte!
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/148
    jlf: "This is a plea for the thread to stop."
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/154
    jlf: very upset guy

https://github.com/JuliaLang/julia/pull/25021
Move Unicode-related functions to new Unicode stdlib package
jlf: nothing interesting in the comments, but this is this PR that Scott P Jones
describes as a bomb.

https://github.com/JuliaLang/julia/pull/19469#issuecomment-264810748
AFAICT does the currently implemented lowercase also not follow the spec.
I do not know anything about Turkish but the following behaviour in Greek
    julia> lowercase("OΔΥΣΣΕΥΣ")
    "oδυσσευσ" # wrong
    "oδυσσευς" # would be correct
is wrong, i.e. the lowercase sigma at the end is the non-final form σ but should be the final form ς instead.

https://github.com/JuliaStrings/utf8proc/issues/54
Feature request: Full Case Folding #54
opened in 2015, still opened in 2022
jlf: related to utf8proc
---
https://github.com/JuliaStrings/utf8proc/issues/54#issuecomment-141545196
our case is to make a perfect search in MAPS.ME :)
In general, we need to preprocess a lot of raw strings added by community of OpenStreetMap,
and match these strings effectively on mobile device, for any language and any input.
This includes stripping out all diacritics, full case folding, and even some special
conversions which are not covered in Unicode standard but are important for users trying to find something.
    I've already mentioned ß=>ss conversion,
    there are also non-standard Ł=>L, й=>и,
    famous turkish İ and ı conversions,
all very important if you don't have a Ł key on your keyboard, for example,
and trying to enter it as L (and find some Polish street for example).
Now we have our own highly-optimized implementation for NFKD and Case Folding.
---
jlf: made a search in https://github.com/mapsme/omim, but could not find where
they handle Ł=>L
Found NormalizeAndSimplifyString, but it doesn't simplify Ł=>L.

https://github.com/JuliaStrings/utf8proc/pull/102
Fixes allowing for “Full” folding and NFKC_CaseFold compliance. #102
---
jlf: this is the creation of the function NFKC_Casefold in utf8proc
---
https://github.com/JuliaStrings/utf8proc/pull/133
Case folding fixes #133
Updated version of #102:
    Restores the original behavior of IGNORE so that this PR is non-breaking, adds new STRIPNA flag.
    Renames the new function to utf8proc_NFKC_Casefold instead of utf8proc_NFKC_CF
    Adds a minimal test.
    Updates the utf8proc_data.c file.
jlf: this explains why the the options in utf8proc are like that.
jlf: "NFKC_CF" seems the name to search to get useful infos about utf8proc_NFKC_Casefold.
    https://unicode-org.github.io/icu/userguide/transforms/normalization/
        NFKC_Casefold: ICU 4.4 supports the combination of NFKC, case folding
        and removing ignorable characters which was introduced with Unicode 5.2.
    https://docs.tibco.com/pub/enterprise-runtime-for-R/5.0.0/doc/html/Language_Reference/terrUtils/normalizeUnicode.html
        normalizeUnicode(x, form = "NCF")
        form: a character string specifying the type of Unicode normalization to be used.
              Should be one of the strings "NFC", "NFD", "NFKC", "NFKD", "NFKC_CF" or "NFKC_Casefold".
        The forms "NFKC_CF" or "NFKC_Casefold" (which are equivalent) are described in https://www.unicode.org/reports/tr31/.
    https://www.lanqiao.cn/library/elasticsearch-definitive-guide-cn/220_Token_normalization/40_Case_folding
        Case folding is the act of converting words into a (usually lowercase) form that does not necessarily result in the correct spelling, but does allow case-insensitive comparisons.
        jlf: they say "The default normalization form that the icu_normalizer token filter uses is nfkc_cf"

https://github.com/JuliaLang/julia/issues/52408
isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false
---
jlf: see the comments and new code
---
This strings is really not well supported by bbedit!
"בְּ"~text~unicodeCharacters==
    an Array (shape [3], 3 items)
     1 : ( "ב"   U+05D1 Lo 1 "HEBREW LETTER BET" )
     2 : ( "ּ"    U+05BC Mn 0 "HEBREW POINT DAGESH OR MAPIQ" )
     3 : ( "ְ"    U+05B0 Mn 0 "HEBREW POINT SHEVA" )
"בְּ"~text~c2x=     -- D791 D6BC D6B0
"בְּ"~text~nfc~c2x=     -- D791 D6B0 D6BC

https://github.com/JuliaStrings/utf8proc/issues/257
normalization does not commute with case-folding?
    julia> using Unicode: normalize
    julia> s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c"
    "J"
    julia> normalize(s, casefold=true) == normalize(normalize(s), casefold=true)
    false
    julia> normalize(normalize(s, casefold=true)) == normalize(normalize(s), casefold=true)
    false
Not sure if this is a bug or just a weird behavior of Unicode.
---
I get something similar in Python 3:
    >>> import unicodedata
    >>> s = "J\u0f72\u0ec8\u0345\u0315\u05bf\u05bb\U0001d16d\u05b0\u0334\u035c"
    >>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", s).casefold()
    False
    >>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", unicodedata.normalize("NFC", s).casefold())
    False
So I guess this is a weird quirk of Unicode?
---
Executor idem:
    s = "J\u0f72\u0ec8\u0345\u0315\u05bf\u05bb\U0001d16d\u05b0\u0334\u035c"~text~unescape

    s~nfc(casefold:.true) == s~nfc~nfc(casefold:.true)=         -- 0
    s~nfc(casefold:.true)~c2x=                                  -- 6A E0BB88 E0BDB2 CEB9 CCB4 D6B0 D6BB D6BF F09D85AD CC95 CD9C
    s~nfc~nfc(casefold:.true)~c2x=                              -- 6A CCB4 D6B0 D6BB D6BF E0BB88 E0BDB2 F09D85AD CC95 CD9C CEB9

    s~nfc(casefold:.true)~nfc == s~nfc~nfc(casefold:.true)=     -- 0
    s~nfc(casefold:.true)~nfc~c2x=                              -- 6A E0BB88 E0BDB2 CEB9 CCB4 D6B0 D6BB D6BF F09D85AD CC95 CD9C
    s~nfc~nfc(casefold:.true)~c2x=                              -- 6A CCB4 D6B0 D6BB D6BF E0BB88 E0BDB2 F09D85AD CC95 CD9C CEB9

https://github.com/JuliaStrings/utf8proc/issues/101#issuecomment-1876151702
jlf: maybe this example of Julia code could be useful for Executor?
function _isequal_normalized!
> I agree that a fast case-folded/normalized comparison function that requires
> no buffers seems possible to write and could be useful, even for Julia;
Note that such a function was implemented in Julia, and could be ported to C:
https://github.com/JuliaLang/julia/blob/0f6c72c71bc947282ae18715c09f93a22828aab7/stdlib/Unicode/src/Unicode.jl#L268-L340

https://github.com/JuliaLang/julia/issues/11315
RFC: rewrite functions used from utf8proc.c in Julia
May 2015 ScottPJones proposes to rewrite utf8proc in Julia.
June 2015 rejected: "As long as we utf8proc is needed by the scheme frontend,
I think we are going to stick to utf8proc for the immediate future."

Kotlin lang

https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/
https://github.com/JetBrains/kotlin/tree/master/libraries/stdlib/jvm/src/kotlin/text

Lisp lang

14/09/2021
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html
    name
    Corresponds to the Name Unicode property.
    The value is a string consisting of upper-case Latin letters A to Z, digits,
    spaces, and hyphen ‘-’ characters. For unassigned codepoints, the value is nil.

    general-category
    Corresponds to the General_Category Unicode property.
    The value is a symbol whose name is a 2-letter abbreviation of the character’s
    classification. For unassigned codepoints, the value is Cn.

    canonical-combining-class
    Corresponds to the Canonical_Combining_Class Unicode property.
    The value is an integer. For unassigned codepoints, the value is zero.

    bidi-class
    Corresponds to the Unicode Bidi_Class property.
    The value is a symbol whose name is the Unicode directional type of the
    character. Emacs uses this property when it reorders bidirectional text for
    display (see Bidirectional Display). For unassigned codepoints, the value
    depends on the code blocks to which the codepoint belongs: most unassigned
    codepoints get the value of L (strong L), but some get values of AL (Arabic
    letter) or R (strong R).

    decomposition
    Corresponds to the Unicode properties Decomposition_Type and Decomposition_Value.
    The value is a list, whose first element may be a symbol representing a
    compatibility formatting tag, such as small18; the other elements are
    characters that give the compatibility decomposition sequence of this
    character. For characters that don’t have decomposition sequences, and for
    unassigned codepoints, the value is a list with a single member, the
    character itself.

    decimal-digit-value
    Corresponds to the Unicode Numeric_Value property for characters whose
    Numeric_Type is ‘Decimal’. The value is an integer, or nil if the character
    has no decimal digit value. For unassigned codepoints, the value is nil,
    which means NaN, or “not a number”.

    digit-value
    Corresponds to the Unicode Numeric_Value property for characters whose
    Numeric_Type is ‘Digit’. The value is an integer. Examples of such characters
    include compatibility subscript and superscript digits, for which the value
    is the corresponding number. For characters that don’t have any numeric value,
    and for unassigned codepoints, the value is nil, which means NaN.

    numeric-value
    Corresponds to the Unicode Numeric_Value property for characters whose
    Numeric_Type is ‘Numeric’. The value of this property is a number. Examples
    of characters that have this property include fractions, subscripts,
    superscripts, Roman numerals, currency numerators, and encircled numbers.
    For example, the value of this property for the character U+2155 VULGAR
    FRACTION ONE FIFTH is 0.2. For characters that don’t have any numeric value,
    and for unassigned codepoints, the value is nil, which means NaN.

    mirrored
    Corresponds to the Unicode Bidi_Mirrored property.
    The value of this property is a symbol, either Y or N. For unassigned
    codepoints, the value is N.

    mirroring
    Corresponds to the Unicode Bidi_Mirroring_Glyph property.
    The value of this property is a character whose glyph represents the mirror
    image of the character’s glyph, or nil if there’s no defined mirroring glyph.
    All the characters whose mirrored property is N have nil as their mirroring
    property; however, some characters whose mirrored property is Y also have nil
    for mirroring, because no appropriate characters exist with mirrored glyphs.
    Emacs uses this property to display mirror images of characters when
    appropriate (see Bidirectional Display). For unassigned codepoints, the
    value is nil.

    paired-bracket
    Corresponds to the Unicode Bidi_Paired_Bracket property.
    The value of this property is the codepoint of a character’s paired bracket,
    or nil if the character is not a bracket character. This establishes a
    mapping between characters that are treated as bracket pairs by the Unicode
    Bidirectional Algorithm; Emacs uses this property when it decides how to
    reorder for display parentheses, braces, and other similar characters (see
    Bidirectional Display).

    bracket-type
    Corresponds to the Unicode Bidi_Paired_Bracket_Type property.
    For characters whose paired-bracket property is non-nil, the value of this
    property is a symbol, either o (for opening bracket characters) or c (for
    closing bracket characters). For characters whose paired-bracket property is
    nil, the value is the symbol n (None). Like paired-bracket, this property is
    used for bidirectional display.

    old-name
    Corresponds to the Unicode Unicode_1_Name property.
    The value is a string. For unassigned codepoints, and characters that have
    no value for this property, the value is nil.

    iso-10646-comment
    Corresponds to the Unicode ISO_Comment property.
    The value is either a string or nil. For unassigned codepoints, the value
    is nil.

    uppercase
    Corresponds to the Unicode Simple_Uppercase_Mapping property.
    The value of this property is a single character. For unassigned codepoints,
    the value is nil, which means the character itself.

    lowercase
    Corresponds to the Unicode Simple_Lowercase_Mapping property.
    The value of this property is a single character. For unassigned codepoints,
    the value is nil, which means the character itself.

    titlecase
    Corresponds to the Unicode Simple_Titlecase_Mapping property.
    Title case is a special form of a character used when the first character of
    a word needs to be capitalized. The value of this property is a single
    character. For unassigned codepoints, the value is nil, which means the
    character itself.

    special-uppercase
    Corresponds to Unicode language- and context-independent special upper-casing
    rules. The value of this property is a string (which may be empty). For
    example mapping for U+00DF LATIN SMALL LETTER SHARP S is "SS". For characters
    with no special mapping, the value is nil which means uppercase property
    needs to be consulted instead.

    special-lowercase
    Corresponds to Unicode language- and context-independent special lower-casing
    rules. The value of this property is a string (which may be empty). For
    example mapping for U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE the value is
    "i\u0307" (i.e. 2-character string consisting of LATIN SMALL LETTER I followed
    by U+0307 COMBINING DOT ABOVE). For characters with no special mapping, the
    value is nil which means lowercase property needs to be consulted instead.

    special-titlecase
    Corresponds to Unicode unconditional special title-casing rules.
    The value of this property is a string (which may be empty). For example
    mapping for U+FB01 LATIN SMALL LIGATURE FI the value is "Fi". For characters
    with no special mapping, the value is nil which means titlecase property
    needs to be consulted instead.

Mathematica lang

https://www.youtube.com/watch?v=yiwLBvirm7A
Live CEOing Ep 426: Language Design in Wolfram Language [Unicode Characters & WFR Suggestions]
At the begining, there are a few minutes about character properties.

https://writings.stephenwolfram.com/2022/06/launching-version-13-1-of-wolfram-language-mathematica/#emojis-and-more-multilingual-support
Launching Version 13.1 of Wolfram Language & Mathematica 🙀🤠🥳
Emojis! And More Multilingual Support
Original 16-bit Unicode is “plane 0”. Now there are up to 16 additional planes.
Not quite 32-bit characters, but given the way computers work, the approach now
is to allow characters to be represented by 32-bit objects. It’s far from trivial
to do that uniformly and efficiently. And for us it’s been a long process to
upgrade everything in our system—from string manipulation to notebook rendering—
to handle full 32-bit characters. And that’s finally been achieved in Version 13.1.
---
You can have wolf and ram variables:
    In:= Expand[(🐺 + 🐏)^8]
    In:= Expand[(\|01f43a + \|01f40f)^8]
               8       7            6    2         5   3         4   4         3   5         2    6           7      8
    Out= 🐏  + 8 🐏  🐺 + 28 🐏  🐺  + 56 🐏  🐺  + 70 🐏  🐺  + 56 🐏  🐺  + 28 🐏  🐺  + 8 🐏 🐺  + 🐺

The 🐏 sorts before the 🐺 because it happens to have a numerically smaller character code:
    In:= ToCharacterCode["🐺🐏"]
    In:= ToCharacterCode["\|01f43a\|01f40f"]
    Out= {128058, 128015}
---
In:= Grid[Outer[StringJoin[#1, "\:200d", #2] & , {"👩", "👨"}, {"🔬", "🏫", "🎓", "🍳", "🚀", "🔧"}]]
In:= Grid[Outer[StringJoin[#1, "\:200d", #2] & , {"\|01f469", "\|01f468"}, {"\|01f52c", "\|01f3eb", "\|01f393", "\|01f373", "\|01f680", "\|01f527"}]]
Out= 👩‍🔬   👩‍🏫   👩‍🎓   👩‍🍳   👩‍🚀   👩‍🔧
     👨‍🔬   👨‍🏫   👨‍🎓   👨‍🍳   👨‍🚀   👨‍🔧
---
No outer product in Executor, only element-wise operators
("👩", "👨")~each{(item || .Unicode["zero width joiner"]~text) || ("🔬", "🏫", "🎓", "🍳", "🚀", "🔧")}==
    an Array (shape [2], 2 items)
     1 : [T'👩‍🔬',T'👩‍🏫',T'👩‍🎓',T'👩‍🍳',T'👩‍🚀',T'👩‍🔧']
     2 : [T'👨‍🔬',T'👨‍🏫',T'👨‍🎓',T'👨‍🍳',T'👨‍🚀',T'👨‍🔧']
---
In:= CharacterRange[74000, 74050]
Out= {𒄐, 𒄑, 𒄒, 𒄓, 𒄔, 𒄕, 𒄖, 𒄗, 𒄘, 𒄙, 𒄚, 𒄛, 𒄜, 𒄝, 𒄞, 𒄟, 𒄠, 𒄡, 𒄢, 𒄣,
 >    𒄤, 𒄥, 𒄦, 𒄧, 𒄨, 𒄩, 𒄪, 𒄫, 𒄬, 𒄭, 𒄮, 𒄯, 𒄰, 𒄱, 𒄲, 𒄳, 𒄴, 𒄵, 𒄶, 𒄷,
 >    𒄸, 𒄹, 𒄺, 𒄻, 𒄼, 𒄽, 𒄾, 𒄿, 𒅀, 𒅁, 𒅂}
In:= FromCharacterCode[{2361, 2367}]
Out= हि
In:= Characters["हि"]
In:= Characters["\:0939\:093f"]
Out= {ह, ि}
In:= Characters["o\:0308"]
Out= {o, ̈}
In:= CharacterNormalize["o\:0308", "NFC"]
Out= ö
In:= ToCharacterCode[%]
Out= {246}

netrexx lang

https://groups.io/g/netrexx/topic/93734685
Unicode Examples

(this not NetRexx, but this answer is useful for NetRexx)
https://stackoverflow.com/questions/63410278/code-point-and-utf-16-code-units-are-the-same-thing
    code point and UTF-16 code units are the same thing?

    No, they are different.
    I know, MDN uses the rarely used "code units" term, which confuses people a lot.

    Code points are the number given to a Unicode element (character).
    This is independent to the encoding, and it can be as high as 0x10FFFF.
    UTF-32 code units are equivalent to Unicode code points (if you are using the correct endianess).

    Code units in UTF-16 are units of 16bit data.
    UTF-16 uses 1 or 2 code units to describe a code point, depending on its value.

    Code points below (or equal) to 0xFFFF (the old limit/expectation of Unicode
    that such numbers were enough to encode all characters) use just 1 code unit,
    and its value is the same as the code point.

    Unicode expanded the code point space, so now code points between 0x010000..0x10FFFF require 2 code units
    (and we use "surrogates" to encode such characters), 4 bytes total.

    So, code points are not the same as code units.
    For UTF-16, code units are 16bit long, and code points could be 1 or 2 code units.

(this is JavaScript, but this answer is useful for NetRexx)
https://exploringjs.com/impatient-js/ch_unicode.html#:~:text=Code%20units%20are%20numbers%20that,has%208%2Dbit%20code%20units.
    each UTF-16 code unit is always either a leading surrogate, a trailing surrogate, or encodes a BMP code point
    BMP = Basic Multilingual Plane (0x0000–0xFFFF)

(this is JavaScript, but this answer is useful for NetRexx)
https://www.w3schools.com/jsref/jsref_codepointat.asp#:~:text=Difference%20Between%20charCodeAt()%20and%20codePointAt()&text=charCodeAt()%20returns%20a%20number,value%20greather%200xFFFF%20(65535).
Difference Between charCodeAt() and codePointAt()
    charCodeAt() is UTF-16, codePointAt() is Unicode.
    charCodeAt() returns a number between 0 and 65535.
    Both methods return an integer representing the UTF-16 code of a character,
    but only codePointAt() can return the full value of a Unicode value greather 0xFFFF (65535).

(this is Unicode, but this answer is useful for NetRexx)
https://www.unicode.org/faq/utf_bom.html#:~:text=Surrogates%20are%20code%20points%20from,DC0016%20to%20DFFF16.
    What are surrogates?
        Surrogates are code points from two special ranges of Unicode values, reserved
        for use as the leading, and trailing values of paired code units in UTF-16.
        Leading surrogates, also called high surrogates, are encoded from D800 to DBFF,
        and trailing surrogates, or low surrogates, from DC00 to DFFF.
        They are called surrogates, since they do not represent characters directly, but only as a pair.

    What is the difference between UCS-2 and UTF-16?
        UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1,
        before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided.
        UCS-2 does not describe a data format distinct from UTF-16, because both use exactly the same 16-bit code unit representations.
        However, UCS-2 does not interpret surrogate code points, and thus cannot be used to conformantly represent supplementary characters.
        Sometimes in the past an implementation has been labeled “UCS-2” to indicate that it does not support supplementary characters
        and doesn’t interpret pairs of surrogate code points as characters.
        Such an implementation would not handle processing of character properties,
        code point boundaries, collation, etc. for supplementary characters, nor would it
        be able to support most emoji, for example. [AF]

(this is Unicode, but this answer is useful for NetRexx)
Unicode standard
    How the word "surrogate" is used
        surrogate pair
        surrogate code unit
        surrogate code point
        leading surrogate
        trailing surrogate
        high-surrogate code point
        high-surrogate code unit
        low-surrogate code point
        low-surrogate code unit

    Surrogates
        D71
        High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.

        D72
        High-surrogate code unit: A 16-bit code unit in the range D800 to DBFF, used in
        UTF-16 as the leading code unit of a surrogate pair.

        D73
        Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.

        D74
        Low-surrogate code unit: A 16-bit code unit in the range DC00 to DFFF,
        used in UTF-16 as the trailing code unit of a surrogate pair.

    UTF-16
        In the UTF-16 encoding form, non-surrogate code points in the range U+0000..U+FFFF
        are represented as a single 16-bit code unit; code points in the supplementary planes,
        in the range U+10000..U+10FFFF, are represented as pairs of 16-bit code units.
        These pairs of special code units are known as surrogate pairs. The values of the code units
        used for surro- gate pairs are completely disjunct from the code units used for the
        single code unit representations, thus maintaining non-overlap for all code point representations in UTF-16.

    Code Points Unassigned to Abstract Characters
        C1
        A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character.
        • The high-surrogate and low-surrogate code points are designated for surrogate code units in the UTF-16 character encoding form.
          They are unassigned to any abstract character.

(this is Java, but this answer is useful for NetRexx)
https://stackoverflow.com/questions/39955169/which-encoding-does-java-uses-utf-8-or-utf-16/39957184#39957184
Which encoding does Java uses UTF-8 or UTF-16?
    Note that
    new String(bytes, StandardCharsets.UTF_16);
    does not "convert it to UTF-16 explicitly".
    This string constructor takes a sequence of bytes, which is supposed to be in
    the encoding that you have given in the second argument, and converts it to the
    UTF-16 representation of whatever characters those bytes represent in that encoding.

    You can't tell Java how to internally store strings.
    It always stores them as UTF-16.
    The constructor String(byte[],Charset) tells Java to create a UTF-16 string from
    an array of bytes that is supposed to be in the given character set.
    The method getBytes(Charset) tells Java to give you a sequence of bytes that
    represent the string in the given encoding (charset).
    And the method getBytes() without an argument does the same - but uses your
    platform's default character set for the conversion.

    Edit: in fact, Java 9 introduced just such a change in internal representation
    of strings, where, by default, strings whose characters all fall in the ISO-8859-1
    range are internally represented in ISO-8859-1, whereas strings with at least one
    character outside that range are internally represented in UTF-16 as before.
    So indeed, getBytes(StandardCharsets.UTF_16) no longer returns the internal representation.

(this thread contains answers useful for NetRexx)
https://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/
UCS vs UTF-8 as Internal String Encoding
jlf: Very good introduction 100% applicable to NetRexx

https://news.ycombinator.com/item?id=9618306
jlf: comments about the blog above.
Interesting comments about the need or non-need to have direct access to code
units or "characters" in constant time.
---
Unicode provides 3 classes of grapheme clusters (legacy, extended and tailored)
at least one of which (tailored) is locale-dependent (`ch` is a single tailored
grapheme cluster under the Slovak locale, because it's the ch digraph).
---
A text-editing control is thinking in terms of "grapheme clusters", not in terms
of codepoints.
jlf: not true. BBEdit works at codepoint level.
👩‍👨‍👩‍👧🎅
2 graphemes, 8 codepoints, 29 bytes
c2x = 'F09F91A9 E2808D F09F91A8 E2808D F09F91A9 E2808D F09F91A7 F09F8E85'
c2u = 'U+1F469 U+200D U+1F468 U+200D U+1F469 U+200D U+1F467 U+1F385'
c2g = 'F09F91A9E2808DF09F91A8E2808DF09F91A9E2808DF09F91A7 F09F8E85'
In BBEdit, I see 8 "characters" and can move the cursor between each "character".
The ZERO WIDTH JOINER codepoints are visible.
In VSCode, I see 2 "characters".
---
What is the practicality of an unbounded number of possible graphemes?
The standard itself doesn't mention any bounds but there is Unicode Standard Annex #15 -
Unicode Normalization Forms which defines the Stream-Safe Text Format.
    UAX15-D3. Stream-Safe Text Format: A Unicode string is said to be in
    Stream-Safe Text Format if it would not contain any sequences of
    non-starters longer than 30 characters in length when normalized
    to NFKD.
---
This sub-part of the thread is exactly what we are discussing for NetRexx
https://news.ycombinator.com/item?id=9620112
---
jlf: next description is exactly what I do in the Executor prototype.
What you really want is constant-time dereferencing of designators for semantically
meaningful substrings. But no language AFAIK actually has that. The fundamental
problem is that most languages have painted themselves into a corner by carving
into stone the fact that strings can be dereferenced by integers. Once you've
done that, you're pretty much screwed. It's not that you can't make it work,
it's just that it requires an awful lot of machinery. You basically need to build
an index for every string you construct, and that can get very expensive.
Fixed-width representations are a less-than-perfect-but-still-not-entirely-unreasonable
engineering solution to this problem.

(this is Python, but this link could be useful for NetRexx c2x and x2c)
https://docs.python.org/3/library/struct.html
Interpret bytes as packed binary data

Oracle

https://docs.oracle.com/database/121/NLSPG/ch5lingsort.htm#NLSPG288
Database Globalization Support Guide
5 Linguistic Sorting and Matching
Complex! Did not read in details, maybe I should...

https://docs.oracle.com/database/121/NLSPG/ch6unicode.htm#NLSPG323
Database Globalization Support Guide
6 Supporting Multilingual Databases with Unicode

https://docs.oracle.com/database/121/NLSPG/ch7progrunicode.htm#NLSPG346
Database Globalization Support Guide
7 Programming with Unicode

Perl lang (Perl 6 has been renamed to Raku)

https://swigunicode.wordpress.com/2021/10/18/example-post-3/
    SWIG and Perl: Unicode C Library
    Part 1. Small Intro to SWIG

    https://swigunicode.wordpress.com/2021/10/22/part-2-c-header-file/
    Part 2. C Header File

    https://swigunicode.wordpress.com/2021/10/24/part-3-c-source-file/
    Part 3. C Source File

    https://swigunicode.wordpress.com/2021/10/25/part-4-perl-source-file/
    Part 4. Perl Source File

    https://swigunicode.wordpress.com/2021/10/26/part-5-build-and-run-scripts/
    Part 5. Build and Run Scripts

    https://swigunicode.wordpress.com/2021/10/27/part-6-swig-interface-file/
    Part 6. SWIG Interface File

https://lwn.net/Articles/667684/
An article about NFG.
Unless one specifies otherwise, Perl 6 normalizes a text string to NFC when it's not NFG.

PHP lang

https://github.com/nicolas-grekas/Patchwork-UTF8
Extensive, portable and performant handling of UTF-8 and grapheme clusters for PHP

https://kunststube.net/encoding/
jlf: First, a general introduction to encoding. Then a focus on PHP.

https://www.php.net/manual/en/function.iconv.php
iconv — Convert a string from one character encoding to another
iconv(string $from_encoding, string $to_encoding, string $string): string|false

https://www.php.net/manual/en/book.mbstring.php
Multibyte String
replicates all important string functions in a multi-byte aware fashion.
Because the mb_ functions now have to actually think about what they're doing,
they need to know what encoding they're working on. Therefore every mb_ function
accepts an $encoding parameter as well. Alternatively, this can be set globally
for all mb_ functions using mb_internal_encoding.

https://www.php.net/manual/en/mbstring.overload.php
Warning
This feature has been DEPRECATED as of PHP 7.2.0, and REMOVED as of PHP 8.0.0.
Relying on this feature is highly discouraged.
---
mbstring supports a 'function overloading' feature which enables you to add
multibyte awareness to such an application without code modification by
overloading multibyte counterparts on the standard string functions. For
example, mb_substr() is called instead of substr() if function overloading is
enabled. This feature makes it easy to port applications that only support
single-byte encodings to a multibyte environment in many cases.
---
jlf: the few user's comments are all negative.
     hum... this is one of the choices we foresee for Rexx. Bad idea?
Example: "In short, only use mbstring.func_overload if you are 100% certain that
nothing on your site relies on manipulating binary data in PHP."

Search PHP souces: grapheme
https://heap.space/search?project=PHP-8.2&full=grapheme&defs=&refs=&path=&hist=&type=

https://news-web.php.net/group.php?group=php.i18n
php.i18n
Most recent: 08 Feb 2018 (?)

https://news-web.php.net/php.i18n/1439
Unicode support with UString abstraction layer
21/03/2012 by Umberto Salsi
jlf: no URL

https://wiki.php.net/rfc/ustring
    UString is much quicker than mbstring thanks to the use of ICU.
https://www.reddit.com/r/PHP/comments/2jvvol/rfc_ustring/
    Low enthusiasm on reddit...
https://github.com/krakjoe/ustring
    UnicodeString for PHP7
    dead, last commit on Mar 17, 2016

https://github.com/nicolas-grekas/Patchwork-UTF8
Patchwork-UTF8
Extensive, portable and performant handling of UTF-8 and grapheme clusters for PHP
Dead? last commit was on May 18, 2016

https://blog.internet-formation.fr/2022/08/nettoyer-et-remplacer-les-homographes-et-homoglyphes-dun-texte-en-php/
Nettoyer et remplacer les homographes (et homoglyphes) d’un texte en PHP

Python lang

https://github.com/dabeaz-course/practical-python/blob/master/Notes/01_Introduction/04_Strings.md
Practical Python Programming. A course by David Beazley
jlf:good introduction to Python strings.

https://www.youtube.com/watch?v=Nfqh6lr3frQ
The Guts of Unicode in Python
Benjamin Peterson
This talk will examine how Python's internal Unicode representation has changed
from its introduction through the latest major changes in Python 3.3.
jlf: not too long (28 min), good overview.

10/08/2021
List of Python PEPS related to string.
https://www.python.org/dev/peps/
    Other Informational PEPs
        I	257	Docstring Conventions	Goodger, GvR
        I	287	reStructuredText Docstring Format	Goodger

    Accepted PEPs (accepted; may not be implemented yet)
        SA	675	Arbitrary Literal String Type
        SA	686	Make UTF-8 mode default
        SA	701	Syntactic formalization of f-strings

    Open PEPs (under consideration)

    Finished PEPs (done, with a stable interface)
        SF	100	Python Unicode Integration	Lemburg
        SF	260	Simplify xrange()	GvR
        SF	261	Support for "wide" Unicode characters	Prescod
        SF	263	Defining Python Source Code Encodings	Lemburg, von Löwis
        SF	277	Unicode file name support for Windows NT	Hodgson
        SF	278	Universal Newline Support	Jansen
        SF	292	Simpler String Substitutions	Warsaw
        SF	331	Locale-Independent Float/String Conversions	Reis
        SF	383	Non-decodable Bytes in System Character Interfaces	von Löwis
        SF	393	Flexible String Representation	v. Löwis
        SF	414	Explicit Unicode Literal for Python 3.3	Ronacher, Coghlan
        SF	498	Literal String Interpolation	Smith
        SF	515	Underscores in Numeric Literals	Brandl, Storchaka
        SF	528	Change Windows console encoding to UTF-8	Dower
        SF	529	Change Windows filesystem encoding to UTF-8	Dower
        SF	538	Coercing the legacy C locale to a UTF-8 based locale	Coghlan
        SF	540	Add a new UTF-8 Mode	Stinner
        SF	597	Add optional EncodingWarning	Naoki
        SF	616	String methods to remove prefixes and suffixes	Sweeney
        SF	623	Remove wstr from Unicode	Naoki
        SF	624	Remove Py_UNICODE encoder APIs	Naoki
        SF	3101	Advanced String Formatting	Talin
        SF	3112	Bytes literals in Python 3000	Orendorff
        SF	3120	Using UTF-8 as the default source encoding	von Löwis
        SF	3127	Integer Literal Support and Syntax	Maupin
        SF	3131	Supporting Non-ASCII Identifiers	von Löwis
        SF	3137	Immutable Bytes and Mutable Buffer	GvR
        SF	3138	String representation in Python 3000	Ishimoto

    Deferred PEPs (postponed pending further research or updates)
        SD	501	General purpose string interpolation	Coghlan
        SD	536	Final Grammar for Literal String Interpolation	Angerer
        SD	558	Defined semantics for locals()	Coghlan

    Abandoned, Withdrawn, and Rejected PEPs
        SS	215	String Interpolation	Yee
        IR	216	Docstring Format	Zadka
        SR	224	Attribute Docstrings	Lemburg
        SR	256	Docstring Processing System Framework	Goodger
        SR	295	Interpretation of multiline string constants	Koltsov
        SR	332	Byte vectors and String/Unicode Unification	Montanaro
        SR	349	Allow str() to return unicode strings	Schemenauer
        IR	502	String Interpolation - Extended Discussion	Miller
        SR	3126	Remove Implicit String Concatenation	Jewett, Hettinger

15/07/2021
review
https://docs.python.org/3/howto/unicode.html

    Escape sequences in string literals
        "\N{GREEK CAPITAL LETTER DELTA}"        # Using the character name  '\u0394'
        "\u0394"                                # Using a 16-bit hex value  '\u0394'
        "\U00000394"                            # Using a 32-bit hex value  '\u0394'

    One can create a string using the decode() method of bytes.
    This method takes an encoding argument, such as UTF-8, and optionally an errors argument.
    The errors argument specifies the response when the input string can’t be converted
    according to the encoding’s rules. Legal values for this argument are
        'strict'            (raise a UnicodeDecodeError exception),
        'replace'           (use U+FFFD, REPLACEMENT CHARACTER),
        'ignore'            (just leave the character out of the Unicode result),
        'backslashreplace'  (inserts a \xNN escape sequence).
    Examples:
        b'\x80abc'.decode("utf-8", "strict")                # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0
        b'\x80abc'.decode("utf-8", "replace")               # '\ufffdabc'
        b'\x80abc'.decode("utf-8", "backslashreplace")      # '\\x80abc'
        b'\x80abc'.decode("utf-8", "ignore")                # 'abc'

    Encodings are specified as strings containing the encoding’s name.
    Python comes with roughly 100 different encodings:
        https://docs.python.org/3/library/codecs.html#standard-encodings

    One-character Unicode strings can also be created with the chr() built-in function,
    which takes integers and returns a Unicode string of length 1 that contains
    the corresponding code point:
        chr(57344)      # '\ue000'
    The reverse operation is the built-in ord() function that takes a one-character
    Unicode string and returns the code point value:
        ord('\ue000')   # 57344

    The opposite method of bytes.decode() is str.encode(), which returns a bytes
    representation of the Unicode string, encoded in the requested encoding.
    The errors parameter is the same as the parameter of the decode() method
    but supports a few more possible handlers.
        'strict'            (raise a UnicodeDecodeError exception),
        'replace'           inserts a question mark instead of the unencodable character,
        'ignore'            (just leave the character out of the Unicode result),
        'backslashreplace'  (inserts a \uNNNN escape sequence)
        'xmlcharrefreplace' (inserts an XML character reference),
        'namereplace'       (inserts a \N{...} escape sequence).

    Unicode code points can be written using the \u escape sequence, which is
    followed by four hex digits giving the code point. The \U escape sequence
    is similar, but expects eight hex digits, not four
        >>> s = "a\xac\u1234\u20ac\U00008000"
        ... #     ^^^^ two-digit hex escape
        ... #         ^^^^^^ four-digit Unicode escape
        ... #                     ^^^^^^^^^^ eight-digit Unicode escape
        >>> [ord(c) for c in s]
        [97, 172, 4660, 8364, 32768]

    Python supports writing source code in UTF-8 by default, but you can use almost
    any encoding if you declare the encoding being used. This is done by including
    a special comment as either the first or second line of the source file:
        #!/usr/bin/env python
        # -*- coding: latin-1 -*-
        u = 'abcdé'
    https://www.python.org/dev/peps/pep-0263/
    PEP 263 -- Defining Python Source Code Encodings

    Comparing Strings
    The casefold() string method converts a string to a case-insensitive
    form following an algorithm described by the Unicode Standard. This
    algorithm has special handling for characters such as the German letter ‘ß’
    (code point U+00DF), which becomes the pair of lowercase letters ‘ss’.
        >>> street = 'Gürzenichstraße'
        >>> street.casefold()
        'gürzenichstrasse'
    The unicodedata module’s normalize() function converts strings to one of
    several normal forms: ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
        def compare_strs(s1, s2):
            def NFD(s):
                return unicodedata.normalize('NFD', s)
            return NFD(s1) == NFD(s2)
    The Unicode Standard also specifies how to do caseless comparisons:
        def compare_caseless(s1, s2):
            def NFD(s):
                return unicodedata.normalize('NFD', s)
            return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())
    Why is NFD() invoked twice? Because there are a few characters that make
    casefold() return a non-normalized string, so the result needs to be
    normalized again. See section 3.13 of the Unicode Standard

    https://docs.python.org/3/library/unicodedata.html
        unicodedata.lookup(name)
            Look up character by name.
            If a character with the given name is found, return the corresponding character.
            If not found, KeyError is raised.
            Changed in version 3.3: Support for name aliases 1 and named sequences 2 has been added.
        unicodedata.name(chr[, default])
            Returns the name assigned to the character chr as a string.
        unicodedata.decimal(chr[, default])
            Returns the decimal value assigned to the character chr as integer.
        unicodedata.digit(chr[, default])
            Returns the digit value assigned to the character chr as integer.
        unicodedata.numeric(chr[, default])
            Returns the numeric value assigned to the character chr as float.
        unicodedata.category(chr)
            Returns the general category assigned to the character chr as string.
        unicodedata.bidirectional(chr)
            Returns the bidirectional class assigned to the character chr as string.
        unicodedata.combining(chr)
            Returns the canonical combining class assigned to the character chr as integer.
            Returns 0 if no combining class is defined.
        unicodedata.east_asian_width(chr)
            Returns the east asian width assigned to the character chr as string.
        unicodedata.mirrored(chr)
            Returns the mirrored property assigned to the character chr as integer.
            Returns 1 if the character has been identified as a “mirrored” character in bidirectional text, 0 otherwise.
        unicodedata.decomposition(chr)
            Returns the character decomposition mapping assigned to the character chr as string.
            An empty string is returned in case no such mapping is defined.
        unicodedata.normalize(form, unistr)
            Return the normal form form for the Unicode string unistr.
            Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
        unicodedata.is_normalized(form, unistr)
            Return whether the Unicode string unistr is in the normal form form.
            Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
        unicodedata.unidata_version
            The version of the Unicode database used in this module.
        unicodedata.ucd_3_2_0
            This is an object that has the same methods as the entire module,
            but uses the Unicode database version 3.2 instead

    https://www.python.org/dev/peps/pep-0393/
    PEP 393 -- Flexible String Representation
        When creating new strings, it was common in Python to start of with a
        heuristical buffer size, and then grow or shrink if the heuristics failed.
        With this PEP, this is now less practical, as you need not only a heuristics
        for the length of the string, but also for the maximum character.

        In order to avoid heuristics, you need to make two passes over the input:
        once to determine the output length, and the maximum character; then
        allocate the target string with PyUnicode_New and iterate over the input
        a second time to produce the final output. While this may sound expensive,
        it could actually be cheaper than having to copy the result again as in
        the following approach.

        If you take the heuristical route, avoid allocating a string meant to be
        resized, as resizing strings won't work for their canonical representation.
        Instead, allocate a separate buffer to collect the characters, and then
        construct a unicode object from that using PyUnicode_FromKindAndData.
        One option is to use Py_UCS4 as the buffer element, assuming for the worst
        case in character ordinals. This will allow for pointer arithmetics, but
         may require a lot of memory. Alternatively, start with a 1-byte buffer,
         and increase the element size as you encounter larger characters.
         In any case, PyUnicode_FromKindAndData will scan over the buffer to
         verify the maximum character.

15/07/2021
https://docs.python.org/3/library/codecs.html
Codec registry and base classes
Most standard codecs are text encodings, which encode text to bytes, but there
are also codecs provided that encode text to text, and bytes to bytes.
errors string argument:
    strict
    ignore
    replace
    xmlcharrefreplace
    backslashreplace
    namereplace
    surrogateescape
    surrogatepass

15/07/2021
https://discourse.julialang.org/t/a-python-rant-about-types/43294/22
A Python rant about types
jlf: the main discussion is about invalid string data.
Stefan Karpinski describes the Julia strings:
    1. You can read and write any data, valid or not.
    2. It is interpreted as UTF-8 where possible and as invalid characters otherwise.
    3. You can simply check if strings or chars are valid UTF-8 or not.
    4. You can work with individual characters easily, even invalid ones.
    5. You can losslessly read and write any string data, valid or not, as strings or chars.
    6. You only get an error when you try to ask for the code point of an invalid char.
    Most Julia code that works with strings is automatically robust with respect to
    invalid UTF-8 data. Only code that needs to look at the code points of individual
    characters will fail on invalid data; in order to do that robustly, you simply
    need to check if the character is valid before taking its code point and handle
    that appropriately.
jlf: I think that all the Julia methods working at character level will raise an error,
not just when looking at the code point.
jlf: Stefan Karpinski explains why Python design is problematic.
Python 3 has to be able to represent any input string in terms of code points.
Needing to turn every string into a fixed-width sequence of code points puts them
in a tough position with respect to invalid strings where there is simply no
corresponding sequence of code points.

17/07/2021
https://groups.google.com/g/python-ideas/c/wStIS1_NVJQ
Fix default encodings on Windows
jlf: did not read in details, too long, too many feedbacks.
Maybe some comments are interesting, so I save this URL.

https://djangocas.dev/blog/python-unicode-string-lowercase-casefold-caseless-match/
Interesting infos about caseless matching

https://gist.github.com/dpk/8325992
PyICU cheat sheet

10/05/2023
https://github.com/python/cpython/issues/56938
original URL before migration to github:
https://bugs.python.org/issue12729

    Python lib re cannot handle Unicode properly due to narrow/wide bug
    jlf: TODO not yet read, but seems interesting.
    I found this link thanks to https://news.ycombinator.com/item?id=9618306 (referenced
    in the NetRexx section)

https://peps.python.org/pep-0414/
PEP 414 – Explicit Unicode Literal for Python 3.3
Specifically, the Python 3 definition for string literal prefixes will be expanded to allow:
    "u" | "U"
in addition to the currently supported:
    "r" | "R"
The following will all denote ordinary Python 3 strings:
    'text'
    "text"
    '''text'''
    """text"""
    u'text'
    u"text"
    u'''text'''
    u"""text"""
    U'text'
    U"text"
    U'''text'''
    U"""text"""

Types of string and their methods:
    string          "H"     "H"[0]  # "H"
    unicode string  u"H"    u"H"[0] # "H"
    byte string     b"H"    b"H"[0] # 72        string of 8-bit bytes
    raw string      r"H"    r"H"[0] # "H"       string literals with an uninterpreted backslash.
    f-string        f"H"    f"H"[0] # "H"       string with formatted expression substitution.

        dir(""), dir(f""), dir(r"")     dir(b"")
        -------------------------------------------------
        __add__                         __add__
                                        __bytes__
        __class__                       __class__
        __contains__                    __contains__
        __delattr__                     __delattr__
        __dir__                         __dir__
        __doc__                         __doc__
        __eq__                          __eq__
        __format__                      __format__
        __ge__                          __ge__
        __getattribute__                __getattribute__
        __getitem__                     __getitem__
        __getnewargs__                  __getnewargs__
        __getstate__                    __getstate__
        __gt__                          __gt__
        __hash__                        __hash__
        __init__                        __init__
        __init_subclass__               __init_subclass__
        __iter__                        __iter__
        __le__                          __le__
        __len__                         __len__
        __lt__                          __lt__
        __mod__                         __mod__
        __mul__                         __mul__
        __ne__                          __ne__
        __new__                         __new__
        __reduce__                      __reduce__
        __reduce_ex__                   __reduce_ex__
        __repr__                        __repr__
        __rmod__                        __rmod__
        __rmul__                        __rmul__
        __setattr__                     __setattr__
        __sizeof__                      __sizeof__
        __str__                         __str__
        __subclasshook__                __subclasshook__
        capitalize                      capitalize
        casefold
        center                          center
        count                           count
                                        decode
        encode
        endswith                        endswith
        expandtabs                      expandtabs
        find                            find
        format
        format_map
                                        fromhex
                                        hex
        index                           index
        isalnum                         isalnum
        isalpha                         isalpha
        isascii                         isascii
        isdecimal
        isdigit                         isdigit
        isidentifier
        islower                         islower
        isnumeric
        isprintable
        isspace                         isspace
        istitle                         istitle
        isupper                         isupper
        join                            join
        ljust                           ljust
        lower                           lower
        lstrip                          lstrip
        maketrans                       maketrans
        partition                       partition
        removeprefix                    removeprefix
        removesuffix                    removesuffix
        replace                         replace
        rfind                           rfind
        rindex                          rindex
        rjust                           rjust
        rpartition                      rpartition
        rsplit                          rsplit
        rstrip                          rstrip
        split                           split
        splitlines                      splitlines
        startswith                      startswith
        strip                           strip
        swapcase                        swapcase
        title                           title
        translate                       translate
        upper                           upper
        zfill                           zfill

https://stackoverflow.com/questions/72371202/unicodedecodeerror-utf-8-codec-cant-decode-byte-0x97-in-position-3118-inval
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 3118: invalid start byte Simple text file [duplicate]
(jlf: just keeping a note for the example)
It seems like the file is not encoded in utf-8. Could you try open the file using
io.open with latin-1 encoding instead?
https://docs.python.org/3/library/functions.html#open
    --- (example)
    from textblob import TextBlob
    import io

    with io.open("positive.txt", encoding='latin-1') as f:
        for line in f.read().split('\n'):
            # do what you want with line
    ---

https://github.com/life4/textdistance
Compute distance between sequences.
30+ algorithms, pure python implementation, common interface, optional external libs usage.
Reimplemented in Rust by the same author: https://github.com/life4/textdistance.rs

Testing the JMB's example
"ς".upper()                                       # 'Σ'
"σ".upper()                                       # 'Σ'
"ὈΔΥΣΣΕΎΣ".lower()                              # 'ὀδυσσεύς'    last Σ becomes ς
"ὈΔΥΣΣΕΎΣA".lower()                             # 'ὀδυσσεύσa'   last Σ becomes σ
# Humm... the concatenation doesn't change ς to σ
"ὈΔΥΣΣΕΎΣ".lower() + "A"                        # 'ὀδυσσεύςA'
("ὈΔΥΣΣΕΎΣ".lower() + "A").upper()              # 'ὈΔΥΣΣΕΎΣA'
("ὈΔΥΣΣΕΎΣ".lower() + "A").upper().lower()      # 'ὀδυσσεύσa'

https://news.ycombinator.com/item?id=33984308
The History and rationale of the Python 3 Unicode model for the operating system (vstinner.github.io)
jlf: HN comments about this old blog
https://vstinner.github.io/python30-listdir-undecodable-filenames.html

https://github.com/python/cpython/blob/main/Include/cpython/unicodeobject.h
(search "Unicode Type")
CPython source code of Unicode string
This URL comes from
https://blog.vito.nyc/posts/gil-balm/
Fast string construction for CPython extensions

https://python.developpez.com/tutoriels/plonger-au-coeur-de-python/?page=chapitre-4-moins-strings
jlf: todo read (french)
Translation from english, could not find the original article.

R lang

https://stringi.gagolewski.com/index.html
stringi: Fast and Portable Character String Processing in R
stringi (pronounced “stringy”, IPA [strinɡi]) is THE R package for very fast, portable,
correct, consistent, and convenient string/text processing in any locale or character encoding.
Thanks to ICU, stringi fully supports a wide range of Unicode standards.
Paper (PDF): https://www.jstatsoft.org/index.php/jss/article/view/v103i02/4324

https://github.com/gagolews/stringi
Fast and Portable Character String Processing in R (with the Unicode ICU)

RAKU lang Rakudo lang (Perl6, Perl 6, MOAR-VM)

https://raku-advent.blog/2022/12/23/sigils-2/
jlf: not related to unicode, but good for general culture.
A sigil is any non-alphabetic character that’s used at the front of a word, and
that conveys meta information about the word. For example, hashtags are a sigil:
the  #  in  #nofilter  is a sigil that communicates that “nofilter” is a tag
(not a regular word of text). The Raku programming language uses sigils to mark
its variables; Raku has four sigils:
@   (normally associated with arrays),
    can only be used for types that implement the  Positional  (“array-like”) role
%   (normally associated with hashes),
    can only be used for types that implement the  Associative  (“hash-like”) role
&   (normally associated with functions)
    can only be used for types that implement the  Callable  (“function-like”) role
$  (for other variables, such as numbers and strings).

https://dev.to/lizmat/series/24075
Migrating Perl to Raku Series' Articles
jlf: not related to unicode, but good for general culture.

http://docs.p6c.org/routine.html
Raku Routines
This is a list of all built-in routines that are documented here as part of the Raku language.
jlf: not related to unicode, but good for general culture.

https://www.learningraku.com/2016/11/26/quick-tip-11-number-strings-and-numberstring-allomorphs/
Quick Tip #11: Number, Strings, and NumberString Allomorphs
jlf: maybe the same as ooRexx string numbers?
https://docs.raku.org/type/Stringy          String or object that can act as a string   (role)
https://rakudocs.github.io/type/Allomorph   Dual value number and string                (class)
https://docs.raku.org/type/IntStr           Dual value integer and string               (class)
https://docs.raku.org/type/RatStr           Dual value rational number and string       (class)

https://github.com/MoarVM/MoarVM/blob/master/docs/strings.asciidoc
MoarVM string documentation.
jlf: little intro, no detailled API.

https://docs.raku.org/type/Str
class Str
Built-in class for strings. Objects of type Str are immutable.

https://docs.raku.org/type/Uni
class Uni
A string of Unicode codepoints
Unlike Str, which is made of Grapheme clusters, Uni is string strictly made of
Unicode codepoints. That is, base characters and combining characters are
separate elements of a Uni instance.
Uni presents itself with a list-like interface of integer Codepoints.
Typical usage of Uni is through one of its subclasses, NFC, NFD, NFKD and NFKC,
which represent strings in one of the Unicode Normalization Forms of the same name.

https://course.raku.org/essentials/strings/string-concatenation/
String concatenation
jlf: strange... the concatenation is not described in the doc of Str.
In Raku, you concatenate strings using concatenation operator.
This operator is a tilde: ~.
    my $greeting = 'Hello, ';
    my $who = 'World!';
    say $greeting ~ $who;
Concatenation with assignment
    $str = $str ~ $another-str;
    $str ~= $another-str;

https://www.codesections.com/blog/raku-unicode/
A deep dive into Raku's Unicode support
Grepping for "Unicode Character Database" brings us to unicode_db.c.
https://github.com/MoarVM/MoarVM/blob/master/src/strings/unicode_db.c

29/05/2021
http://moarvm.com/releases.html
    2017.07
        Greatly reduce the cases when string concatenation needs renormalization
        Use normalize_should_break to decide if concat needs normalization
        Rename should_break to MVM_unicode_normalize_should_break
        Fix memory leak in MVM_nfg_is_concat_stable
        If both last_a and first_b during concat are non-0 CCC, re-NFG
    --> maybe to review : the last sentence seems to be an optimization of concatenation.
    2017.02
        Implement support for synthetic graphemes in MVM_unicode_string_compare
        Implement configurable collation_mode for MVM_unicode_string_compare
    2017.01
        Add a new unicmp_s op, which compares using the Unicode Collation Algorithm
        Add support for Grapheme_Cluster_Break=Prepend from Unicode 9.0
        Add a script to download the latest version of all of the Unicode data
    --> should review this script
    2015.11
        NFG now uses Unicode Grapheme Cluster algorithm; "\r\n" is now one grapheme
    --> ??? [later] ah, I had a bug! Was not analyzing an UTF-8 ASCII string... Now fixed:
        "0A0D"x~text~description= -- UTF-8 ASCII ( 2 graphemes, 2 codepoints, 2 bytes )
        "0D0A"x~text~description= -- UTF-8 ASCII ( 1 grapheme, 2 codepoints, 2 bytes )

29/05/2021
https://news.ycombinator.com/item?id=26591373
String length functions for single emoji characters evaluate to greater than 1
--> to check : MOAR VM really concatenate a 8bit string with a 32bit string using a string concatenation object ?

    You could do it the way Raku does. It's implementation defined. (Rakudo on MoarVM)
    The way MoarVM does it is that it does NFG, which is sort of like NFC except that it stores grapheme clusters as if they were negative codepoints.

    If a string is ASCII it uses an 8bit storage format, otherwise it uses a 32bit one.
    It also creates a tree of immutable string objects.
    If you do a substring operation it creates a substring object that points at an existing string object.
    If you combine two strings it creates a string concatenation object. Which is useful for combining an 8bit string with a 32bit one.
    All of that is completely opaque at the Raku level of course.

        my $str = "\c[FACE PALM, EMOJI MODIFIER FITZPATRICK TYPE-3, ZWJ, MALE SIGN, VARIATION SELECTOR-16]";

        say $str.chars;        # 1
        say $str.codes;        # 5
        say $str.encode('utf16').elems; # 7
        say $str.encode('utf16').bytes; # 14
        say $str.encode.elems; # 17
        say $str.encode.bytes; # 17
        say $str.codes * 4;    # 20
        #(utf32 encode/decode isn't implemented in MoarVM yet)

        say for $str.uninames;
        # FACE PALM
        # EMOJI MODIFIER FITZPATRICK TYPE-3
        # ZERO WIDTH JOINER
        # MALE SIGN
        # VARIATION SELECTOR-16
    The reason we have utf8-c8 encode/decode is because filenames, usernames, and passwords are not actually Unicode.
    (I have 4 files all named rèsumè in the same folder on my computer.)
    utf8-c8 uses the same synthetic codepoint system as grapheme clusters.

https://andrewshitov.com/2018/10/31/unicode-in-perl-6/
Unicode in Raku

https://docs.raku.org/language/unicode
Raku applies normalization by default to all input and output except for file names,
which are read and written as UTF8-C8
UTF-8 Clean-8 is an encoder/decoder that primarily works as the UTF-8 one. However,
upon encountering a byte sequence that will either not decode as valid UTF-8, or
that would not round-trip due to normalization, it will use NFG synthetics to
keep track of the original bytes involved. This means that encoding back to UTF-8 Clean-8
will be able to recreate the bytes as they originally existed.

https://github.com/MoarVM/MoarVM/blob/master/docs/strings.asciidoc
Strings in MoarVM
Strands
    Strands are a type of MVMString which instead of being a flat string with contiguous data,
    actually contains references to other strings. Strands are created during concatenation
    or substring operations. When two flat strings are concatenated together, a Strand with
    references to both string a and string b is created. If string a and string b were strands
    themselves, the references of string a and references of string b are copied one after another
    into the Strand.
Synthetic’s
    Synthetics are graphemes which contain multiple codepoints. In MoarVM these are stored
    and accessed using a trie, while the actual data itself stores the base character seprately
    and then the combiners are stored in an array.
    Currently the maximum number of combiners in a synthetic is 1024.
    MoarVM will throw an exception if you attempt to create a grapheme with more than 1024 codepoints in it.
Normalization
    MoarVM normalizes into NFG form all input text.
NFG
    Normalization Form Grapheme. Similar to NFC except graphemes which contain multiple codepoints
    are stored in Synthetic graphemes.

https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-perl-6-and-you/
Types
    Str type: graphemes
        say "नि".codes;    # returns  2
        say "नि".chars;    # returns  1
        say "\r\n".chars;    # returns 1
    NFC, NFD, NFKC, NFKD: types (jlf: types? really?)
    Uni: work with codepoints, no normalization (keep text as-is)
    Blob: family of types to work at the binary level
Unicode source code
    say 0 ∈ «42 -5 1».map(&log ∘ &abs);
    say 0.1e0 + 0.2e0 ≅ 0.3e0;
    say ｢There is no \escape in here!｣
"Texas" source code
    say 0 (elem) <<42 -5 1>>.map(&log o &abs);
    say 0.1e0 + 0.2e0 =~= 0.3e0;
    say Q[[[There is no \escape in here!]]]

https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-perl-6-and-you/#comment-14302
jlf: interesting critics about graphemes.
See also the comment after, which provides answers to the critics.

https://lwn.net/Articles/667036/
Unicode, Perl 6, and You
jlf: interesting opinions.
https://en.wikipedia.org/wiki/Devanagari#Conjunct_consonants
jlf: this is executable code (what is this notation < षि > ?)
    < षि > .NFC .say    # NFC:0x<0937 093f>
    < षि > .NFKC .say   # NFD:0x<0937 093f>
    < षि > .NFD .say    # NFKC:0x<0937 093f>
    < षि > .NFKD .say   # NFKD:0x<0937 093f>
Particularly interesting, this subthread:
https://lwn.net/Articles/667669/
Is the current Unicode design impractical?

jlf tests
    # Returns a list of Unicode codepoint numbers that describe the codepoints making up the string
    "aå«".ords                                                  # (97 229 171)

    # Returns the codepoint number of the base characters of the first grapheme in the string
    "å«".ord                                                    # 229

    "Bundesstraße im Freiland".lc                               # bundesstraße im freiland
    "Bundesstraße im Freiland".uc                               # BUNDESSTRASSE IM FREILAND
    "Bundesstraße im Freiland".fc                               # bundesstrasse im freiland
    "Bundesstraße im Freiland".index("Freiland")                # 16 (start at 0) (executor: 17)
    "Bundesstraße im Freiland".index("freiland", :ignorecase)   # 16

                                                                                    # Bundesstraße sss sßs ss
                                                                                    # 01234567890123456789012
                                                                                    #      |    |  ||  ||  |
    "Bundesstraße sss sßs ss".indices("ss")                                         # (5 13 21)
    "Bundesstraße sss sßs ss".indices("ss", :overlap)                               # (5 13 14 21)
    "Bundesstraße sss sßs ss".indices("ss", :ignorecase)                            # (5 10 13 18 21)
    "Bundesstraße sss sßs ss".indices("ss", :ignorecase, :overlap)                  # (5 10 13 14 18 21)   not 17?

    "Bundesstraße sss sßs ss".indices("s", :ignorecase, :overlap)                   # (5 6 13 14 15 17 19 21 22)
    "Bundesstraße sss sßs ss".indices("sSs", :ignorecase, :overlap)                 # (13 17 18)
    "Bundesstraße sss sßs ss".indices("sSsS", :ignorecase, :overlap)                # (17)

    "Bündesstraße sss sßs ss".fc                                                    # bundesstrasse sss ssss ss
                                                                                    # 0123456789012345678901234
                                                                                    #      |    |   ||  |||  |
    "Bündëssträßë sss sßs ss".fc.indices("ss")                                      # (5 10 14 18 20 23)
    "Bündëssträßë sss sßs ss".fc.indices("ss", :overlap)                            # (5 10 14 15 18 19 20 23)

                                                                                    # straßssßßssse
                                                                                    # 0123456789012
                                                                                    #     || ||||
    "straßssßßssse".indices("Ss", :ignorecase)                                      # (4 7 9)
    "straßssßßssse".indices("Ss", :ignorecase, :overlap)                            # (4 5 7 8 9 10)

    "TÊt\c[TAG SPACE]e".chars                                                       # 4, "t" + "TAG SPACE" is one grapheme
    "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc                             # TÊt󠀠e sss ssss ss t󠀠êTE
                                                                                    # 012345678901234567890
                                                                                    # ^ ^  ||  |||  |  ^ ^
    "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".indices("ss")                  # (5 13)
    "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".indices("ss", :ignorecase)     # (5 10 13)
    "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("ss")               # (5 9 11 14)      11? why not 10? because no overlap
    "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("ss", :overlap)     # (5 6 9 10 11 14)

    "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("te")               # ()
    "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("te", :ignorecase)  # (19)
    "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("te", :ignoremark)  # (0 2 17 19)      so TAG SPACE is ignored when :ignoremark

    # Matching inside a grapheme
    "noël👩‍👨‍👩‍👧🎅".indices("👧🎅")                                         # ()
    "noël👩‍👨‍👩‍👧🎅".indices("👨‍👩")                                        # ()

    # Matching a ligature
                                                                                    # bâﬄé
                                                                                    # 012 3
    "bâﬄé".indices("é")                                                            # (3)
    "bâﬄé".indices("ffl")                                                          # ()
    "bâﬄé".indices("ffl", :ignorecase)                                             # (2)

https://raku-advent.blog/2022/12/22/day-22-hes-making-a-list-part-1/
Unicode’s CLDR (Common Linguistic Data Repository)
jlf: to read...

https://www.nu42.com/2015/12/perl6-newline-translation-broken.html
Newline translation in Perl6 is broken
A. Sinan Unur
December 11, 2015
---
jlf:
Referenced from https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-perl-6-and-you/#comment-14382
I reference this URL in case \r\n versus \r is a problem for Rexx Unicodified.
For Unicode, \r\n is one grapheme. Maybe no relation with the failed test cases.
Was fixed like that:
    https://github.com/Raku/old-issue-tracker/issues/4849#issuecomment-570873506
    * We do translation of \r\n graphemes to \n on all input read as text except
      sockets, independent of platform
    * We do translation of all \n graphemes to \r\n on text output to handles
      except sockets, on Windows only
    * \n is now, unless `use newline` is in force, always \x0A
    * We don't do any such translation when using .encode/.decode, and of course
      when reading/writing Bufs to files, providing an escape hatch from translation if needed

https://6guts.wordpress.com/2015/11/21/what-one-christmas-elf-has-been-up-to/
jlf: referenced for the section NFG improvements.

https://6guts.wordpress.com/2015/10/15/last-week-unicode-case-fixes-and-much-more/
jlf: referenced for the section A case of Unicode.

Testing the JMB's example
"ς".uc                            # Σ
"σ".uc                            # Σ
"ὈΔΥΣΣΕΎΣ".lc                   # ὀδυσσεύς     last Σ becomes ς
"ὈΔΥΣΣΕΎΣA".lc                  # ὀδυσσεύσa    last Σ becomes σ
# Humm... the concatenation doesn't change ς to σ
"ὈΔΥΣΣΕΎΣ".lc ~ "A"             # ὀδυσσεύςA
("ὈΔΥΣΣΕΎΣ".lc ~ "A").uc        # ὈΔΥΣΣΕΎΣA
("ὈΔΥΣΣΕΎΣ".lc ~ "A").uc.lc     # ὀδυσσεύσa

https://stackoverflow.com/questions/39663846/how-can-i-make-perl-6-be-round-trip-safe-for-unicode-data
How can I make Perl 6 be round-trip safe for Unicode data?
Answer: UTF8-C8 isn't really a good solution (but is probably the only solution currently).
jlf: asked in 2016-09-23, maybe the situation is better today.

https://rosettacode.org/wiki/String_comparison#Raku
String comparisons never do case folding because that's a very complicated subject
in the modern world of Unicode. (You can explicitly apply an appropriate case-folding
function to the arguments before doing the comparison, or for "equality" testing you
can do matching with a case-insensitive regex, assuming Unicode's language-neutral
case-folding rules are okay.)
---
Be aware that Raku applies normalization (Unicode NFC form (Normalization Form Canonical))
by default to all input and output except for file names See docs. Raku follows the Unicode spec.
Raku follows all of the Unicode spec, including parts that some people don't like.
There are some graphemes for which the Unicode consortium has specified that the
NFC form is a different (though usually visually identical) grapheme. Referred to
in Unicode standard annex #15 as Canonical Equivalence. Raku adheres to that spec.

https://docs.raku.org/language/traps#Traps_to_avoid
Some problems that might arise when dealing with strings

https://raku.guide/#_unicode
Escape characters
    say "\x0061";
    say "\c[LATIN SMALL LETTER A]";
Numbers
    say (٤,٥,٦,1,2,3).sort; # (1 2 3 4 5 6)
    say 1 + ٩;              # 10
Raku has methods/operators that implement the Unicode Collation Algorithm.
    say 'a' unicmp 'B'; # Less
Raku provides a collate method that implements the Unicode Collation Algorithm.
    say ('a','b','c','D','E','F').sort;    # (D E F a b c)
    say ('a','b','c','D','E','F').collate; # (a b c D E F)

Rexx lang

11/08/2021
http://nokix.sourceforge.net/help/learn_rexx/funcs5.htm#VALUEIN
Reads in a numeric value from a binary (ie, non-text) file.
value = VALUEIN(stream, position, length, options)
    Args
        stream is the name of the stream.
        It can include the full path to the stream (ie, any drive and directory names).
        If omitted, the default is to read from STDIN.

        position specifies at what character position (within the stream) to start
        reading from, where 1 means to start reading at the very first character
        in the stream. If omitted, the default is to resume reading at where a
        previous call to CHARIN() or VALUEIN() left off (ie, where you current
        read character position is).

        length is a 1 to read in the next binary byte (ie, 8-bit value), a 2 to
        read in the next binary short (ie, 16-bit value), or a 4 to read in the
        next binary long (ie, 32-bit value). If length is omitted, VALUEIN() defaults to reading a byte.

        options can be any of the following:
            M	The value is stored (in the stream) in Motorola (big endian) byte order,
                rather than Intel (little endian) byte order.
                The effects only long and short values.
            H	Read in the value as hexadecimal (rather than the default of base 10,
                or decimal, which is the base that REXX uses to express numbers).
                The value can later be converted with X2D().
            B	Read in the value as binary (base 2).
            -	The value is signed (as opposed to unsigned).
            V	stream is the actual data string from which to extract a value.
                You can now replace calls to SUBSTR and C2D with a single, faster call to VALUEIN.
            If omitted, options defaults to none of the above.
    Returns
        The value, if successful.
        If an error, an empty string is returned (unless the NOTREADY condition
        is trapped via CALL method. Then, a '0' is returned).

http://nokix.sourceforge.net/help/learn_rexx/funcs5.htm#VALUEOUT
Write out numeric values to a binary (ie, non-text) file (ie, in non-text format).
result = VALUEOUT(stream, values, position, size, options)
    Args
        stream is the name of the stream.
        It can include the full path to the stream (ie, any drive and directory names).
        If omitted, the default is to write to STDOUT (typically, display the data in the console window).

        position specifies at what character position (within the stream) to start writing the data,
        where 1 means to start writing at the very first character in the stream.
        If omitted, the default is to resume writing at where a previous call to
        CHAROUT() or VALUEOUT() left off (or where the "write character pointer" was set via STREAM's SEEK).

        values are the numeric values (ie, data) to write out.
        Each value is separated by one space.

        size is a 1 if each value is to be written as a byte (ie, 8-bit value),
        2 if each value is to be written as a short (16-bit value),
        or 4 if each value is to be written as a long (32-bit value). If omitted, size defaults to 1.

        options can be any of the following:
            M	Write out the values in Motorola (big endian) byte order,
                rather than Intel (little endian) byte order. The effects only long and short values.
            H	The values you supplied are specified in hexadecimal.
            B	The values you supplied are specified in binary (base 2).
            V	stream is the name of a variable, and the data will be overlaid
                onto that variable's value. You can now replace calls to D2C and
                OVERLAY with a single, faster call to VALUEOUT, especially when
                a variable has a large amount of non-text data.
            If omitted, options defaults to none of the above.
    Returns
        0 if the string was written out successfully.
        If an error, VALUEOUT() returns non-zero.

http://www.dg77.net/tekno/manuel/rexxendian.htm
Test de l’endianité
    /* Verifie l'endianité / check endiannity          */
    /* Pour traitement d'information encodees en UTF-8 */
    /* Adapter si on utilise un autre encodage         */
    CALL CONV8_16 ' '
    IF c2x(sortie) = '2000' THEN DO
        endian = 'LE' /* little endian  */
        blanx = '2000'
        END
    ELSE DO
        endian = 'BE' /* big endian  */
        blanx = '0020'
        END
    return endian blanx
    /* ********************************************************************** */
    /*           Conversion UTF-8 -> UNICODE                                  */
    CONV8_16:
    parse arg entree
    sortie = ''
    ZONESORTIE.='NUL'; ZONESORTIE.0=0
    err = systounicode(entree, 'UTF8', , ZONESORTIE.)
    if err == 0 then sortie = ZONESORTIE.!TEXT
      else say 'probleme car., code ' err
    return

http://www.dg77.net/tekno/xhtml/codage.htm
Le codage des caractères
To read, some infos about the code pages could be useful.

Regina doc
    EXPORT(address, [string], [length] [,pad]) - (AREXX)
        Copies data from the (optional) string into a previously-allocated memory area, which must be
        specified as a 4-byte address. The length parameter specifies the maximum number of characters to
        be copied; the default is the length of the string. If the specified length is longer than the string, the
        remaining area is filled with the pad character or nulls('00'x). The returned value is the number
        of characters copied.
        Caution is advised in using this function. Any area of memory can be overwritten,possibly
        causing a system crash.
        See also STORAGE() and IMPORT().
        Note that the address specified is subject to a machine's endianess.
        EXPORT('0004 0000'x,'The answer') '10'

    IMPORT(address [,length]) - (AREXX)
        Creates a string by copying data from the specified 4-byte address. If the length parameter is not
        supplied,the copy terminates when a null byte is found.
        See also EXPORT()
        Note that the address specified is subject to a machine's endianess.
        IMPORT('0004 0000'x,10) 'The answer' /* maybe */

Ruby lang

jlf note:
still searching articles/blogs comparing the Ruby's approach (multi-encodings)
with languages that force the conversion to Unicode (be it utf-8 or Unicode scalars).

https://docs.ruby-lang.org/en/3.2/String.html
class String
    ---
    jlf: focus on comparison.
    I did not find the definition of "compatible".

    Methods for Comparing
        ==, ===: Returns true if a given other string has the same content as self.
        eql?: Returns true if the content is the same as the given other string.
        <=>: Returns -1, 0, or 1 as a given other string is smaller than, equal to, or larger than self.
        casecmp: Ignoring case, returns -1, 0, or 1 as a given other string is smaller than, equal to, or larger than self.
        casecmp?: Returns true if the string is equal to a given string after Unicode case folding; false otherwise.

    Returns false if the two strings’ encodings are not compatible:
        "\u{e4 f6 fc}" == ("\u{e4 f6 fc}")                          # => true
        "\u{e4 f6 fc}".encode("ISO-8859-1") == ("\u{e4 f6 fc}")     # => false

        "\u{e4 f6 fc}".eql?("\u{e4 f6 fc}")                         # => true
        "\u{e4 f6 fc}".encode("ISO-8859-1").eql?("\u{e4 f6 fc}")    # => false

        # "äöü"   "ÄÖÜ"
        "\u{e4 f6 fc}".casecmp("\u{c4 d6 dc}")                      # => 1
        "\u{e4 f6 fc}".encode("ISO-8859-1").casecmp("\u{c4 d6 dc}") # => nil

https://yehudakatz.com/2010/05/17/encodings-unabridged/
Encodings, Unabridged
jlf: this article explains why the Ruby team consider that Unicode is not a
     good solution for CJK.

https://ruby-doc.org/current/Encoding.html

https://github.com/ruby/ruby/blob/master/encoding.c
jlf: search "compat"

https://docs.ruby-lang.org/en/master/encodings_rdoc.html
Encodings
---
jlf: Executor has a similar support of encodings, with less defaults and less
     supported encodings. Otherwise the technical solution is the same: all
     encodings are equal, there is no forced internal encoding, no forced
     conversion.
---
Default encodings:
    - Encoding.default_external:     the default external encoding
    - Encoding.default_internal:     the default internal encoding (may be nil)
    - locale:       the default encoding for a string from the environment
    - filesystem:   the default encoding for a string from the filesystem
String encoding
    A Ruby String object has an encoding that is an instance of class Encoding.
    The encoding may be retrieved by method String#encoding.
        's'.encoding # => #<Encoding:UTF-8>
    The default encoding for a string literal is the script encoding
The encoding for a string may be changed:
    s = "R\xC3\xA9sum\xC3\xA9"     # => "Résumé"
    s.encoding                     # => #<Encoding:UTF-8>
    s.force_encoding('ISO-8859-1') # => "R\xC3\xA9sum\xC3\xA9"
    s.encoding                     # => #<Encoding:ISO-8859-1>
Stream Encodings
    Certain stream objects can have two encodings; these objects include instances of:
        IO.
        File.
        ARGF.
        StringIO.
    The two encodings are:
    - An external encoding, which identifies the encoding of the stream.
      The default external encoding is:
          - UTF-8 for a text stream.
          - ASCII-8BIT for a binary stream.
    - An internal encoding, which (if not nil) specifies the encoding to be used
      for the string constructed from the stream.
      The default internal encoding is nil (no conversion).
Script Encoding
    The default script encoding is UTF-8; a Ruby source file may set its script
    encoding with a magic comment on the first line of the file (or second line,
    if there is a shebang on the first).
    The comment must contain the word coding or encoding, followed by a colon,
    space and the Encoding name or alias:
        # encoding: ISO-8859-1
        __ENCODING__ #=> #<Encoding:ISO-8859-1>
This example writes a string to a file, encoding it as ISO-8859-1, then reads
the file into a new string, encoding it as UTF-8:
    s = "R\u00E9sum\u00E9"
    path = 't.tmp'
    ext_enc = 'ISO-8859-1'
    int_enc = 'UTF-8'

    File.write(path, s, external_encoding: ext_enc)
    raw_text = File.binread(path)                                                               # "R\xE9sum\xE9"
    transcoded_text = File.read(path, external_encoding: ext_enc, internal_encoding: int_enc)   # "Résumé"

https://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/
3 Steps to Fix Encoding Problems in Ruby
The major difference between encode and force_encoding is that encode might
change bytes, and force_encoding won’t.
In ASCII-8BIT, every character is represented by a single byte.
That is, str.chars.length == str.bytes.length.

https://www.cloudbees.com/blog/how-ruby-string-encoding-benefits-developers
Familiarize Yourself with Ruby String Encoding
written August 14, 2018
Ruby encoding methods
- String#force_encoding is a way of saying that we know the bits for the characters
  are correct and we simply want to properly define how those bits are to be
  interpreted to characters.
- String#encode will transcode the bits themselves that form the characters from
  whatever the string is currently encoded as to our target encoding.
Example of the byte size being different from the character length:
    "łał".size
    # => 3
    "łał".bytesize
    # => 5
Different operating systems have different default character encodings so
programming languages need to support these.
    Encoding.default_external
    # => #<encoding:utf -8></encoding:utf>
Ruby defaults to UTF-8 as its encoding so if it is opening up files from the
operating system and the default is different from UTF-8, it will transcode the
input from that encoding to UTF-8. If this isn't desirable, you may change the
default internal encoding in Ruby with Encoding.default_internal. Otherwise you
can use specific IO encodings in your Ruby code.
    File.open(filename, 'r:UTF-8', &amp;:read)
    # or
    File.open(filename, external_encoding: "ASCII-8BIT", internal_encoding: "ASCII-8BIT") do |f| f.read end
Lately, I've been integrating Ruby's encoding support to Rust with the library Rutie.
Rutie allows you to write Rust that works in Ruby and Ruby that works in Rust.
jlf: see Rutie in Rust lang.

https://ruby-core.ruby-lang.narkive.com/RDKAvdS7/20483-encoding-of-symbols#post2
[ruby-core:20483] encoding of symbols
---
jlf: AT LAST! I found a discussion about the comparison of strings.
LONG thread, to carefully read.
---
This message 2008-12-14 is a good summary!
Is it still correct today?
https://ruby-core.ruby-lang.narkive.com/RDKAvdS7/20483-encoding-of-symbols#post12
    - String operations are done using the bytes in the strings - they are not
    converted to codepoints internally

    - String equality comparisons seem to be simply done on a byte-by-byte
    basis, without regard to the encoding

    - *However* other operations are not simply byte-by-byte. They are done
    character-by-character, but without converting to codepoints - eg: a 3
    byte character is kept as 3 bytes. For example this means that when
    operating on a variable-length encoding, simple operations like indexing
    can be inefficient, as Ruby may have to scan through the string from the
    start. However Ruby does try to optimize this where possible.

    - There is also a concept of "compatible encodings". Given 2 encodings e1
    & e2, e1 is compatible with e2 if the representation of every character in
    e1 is the same as in e2. This implies that e2 must be a "bigger" encoding
    than e1 - ie: e2 is a superset of e1. Typically we are mainly talking
    about US-ASCII here, which is compatible with most other character sets
    that are either all single-byte (eg: all the ISO-8859 sets) or are
    variable-length multi-byte (eg: UTF-8).

    - When operating on encodings e1 & e2, if e1 is compatible with e2, then
    Ruby treats both strings as being in encoding e2.

    - String#> and String#< are a bit wierd. Normally they are just done on a
    byte-by-byte basis, UNLESS the strings are the same and are incompatible
    encodings, then they always seem to return FALSE. (I have to check this -
    it may be more complicated than this).

    - When operating on incompatible encodings, *normally* non-comparison
    operations (including regexp matches) raise an "Encoding Compatibility
    Error".

    - However there appears to be an exception to this: if operating on 2
    incompatible encodings AND US-ASCII is compatible with both, AND both
    strings are US-ASCII strings, then the operation appears to proceed,
    treating both as US-ASCII. For example "abc" as an ISO-8859-1 and "abc" as
    UTF-8. I guess this is Ruby being "forgiving". (Personally I am not sure
    if this is good or bad). The encoding of the result (for example of a
    string concatenation) seems to be one of the 2 original encodings - I
    haven't figured out the logic to this yet :)
---
jlf: this one seems ugly...
    Actually I just checked this, and this is wrong, sorry. I ended up looking
    at the source code of rb_str_cmp() in string.c, and here is what I think
    it does:

    - it does a byte-by-byte comparison. Assuming the strings are different,
    Ruby returns what you would expect based on this.

    - if the strings are byte for byte identical, but they have incompatible
    encodings and at least one of the strings contains a non-ASCII character,
    then it seems that the result is determined by the ordering of the
    encodings, based on ruby's "encoding index" - an internal ordering of the
    available encodings. Maybe I have got this wrong - it doesn't make a lot
    of sense to me!
---
    I don't mean to shoot you down in flames, but a lot of thought and effort
    has gone into Ruby's encoding support. Ruby could have followed the Python
    route of converting everything to Unicode, but that was rejected for various
    good reasons. Also automatic transcoding to solve issues of incompatible
    encodings was also rejected because it causes a number of problems, in
    particular I believe that transcoding isn't necessarilly accurate, because
    for example there may be multiple or ambiguous representations of the same
    character.
---
    Yukihiro Matsumoto
    UTF-8 + ASCII-8BIT makes ASCII-8BIT. Binary wins.
    jlf: hum... I do the opposite with Executor
    jlf 2023.08.09: I checked today with Ruby 3.2, the result is UTF-8

http://graysoftinc.com/character-encodings
jlf: 12 articles about character encoding in Ruby.
From 2008-10-14 to 2009-06-18
Old, but maybe interesting?
todo: read

https://docs.ruby-lang.org/en/3.2/case_mapping_rdoc.html
Case Mapping
By default, all of these methods use full Unicode case mapping, which is suitable for most languages.
Non-ASCII case mapping and folding are supported for UTF-8, UTF-16BE/LE, UTF-32BE/LE, and ISO-8859-1~16 Strings/Symbols.
Context-dependent case mapping is currently not supported (Unicode standard: Context Specification for Casing).
In most cases, case conversions of a string have the same number of characters. There are exceptions (see also :fold below):
    s = "\u00DF" # => "ß"
    s.upcase     # => "SS"
    s = "\u0149" # => "ŉ"
    s.upcase     # => "ʼN"
Case mapping may also depend on locale (see also :turkic below)
    s = "\u0049"        # => "I"
    s.downcase          # => "i" # Dot above.
    s.downcase(:turkic) # => "ı" # No dot above.
Case changing methods may not maintain Unicode normalization.
Except for casecmp and casecmp?, each of the case-mapping methods listed above accepts optional arguments, *options.
The arguments may be:
    :ascii only.
    :fold only.
    :turkic or :lithuanian or both.

https://andre.arko.net/2013/12/01/strings-in-ruby-are-utf-8-now/
composition in the form of ligatures isn’t handled at all
    "baﬄe".upcase == "BAFFLE"             # => false
jlf: Has been fixed in a later version:
    "baﬄe".upcase                         # => "BAFFLE"
BUT
other things are still not good in Ruby 3.2.2 (March 30, 2023):
"noël".reverse                             # => "l̈eon"
"noël"[0..2]                               # => "noe"
---
"baﬄe"~text~upper=                        -- T'BAﬄE'      30/05/2023 Executor not good because utf8proc upper is not good
"baﬄe"~text~caselessEquals("baffle")=     -- 1             30/05/2023 Executor is good because utf8proc casefold is good
"noël"~text~reverse=                       -- T'lëon'
"noël"~text[1,3]=                          -- T'noë'

https://github.com/jmhodges/rchardet
Character encoding auto-detection in Ruby.
jlf: no doc :-(
Returns a confidence rate?
    cd = CharDet.detect(some_data)
    encoding = cd['encoding']
    confidence = cd['confidence'] # 0.0 <= confidence <= 1.0

https://bugs.ruby-lang.org/issues/18949
Deprecate and remove replicate and dummy encodings
Rejected by Naruse:
    String is a container and an encoding is a label of it. While data whose
    encoding is an encoding categorized in dummy encodings in Ruby, we cannot
    avoid such encodings.
<reopened, lot of discussions>
This is all done now, only https://github.com/ruby/ruby/pull/7079.
Overall:
    We deprecated and removed Encoding#replicate
    We removed get_actual_encoding()
    We limited to 256 encodings and kept rb_define_dummy_encoding() with that constraint.
    There is a single flat array to lookup encodings, rb_enc_from_index() is fast now.

https://github.com/ruby/ruby/pull/3803
Add string encoding IBM720 alias CP720
The mapping table is generated from the ICU project:
https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/ibm-720_P100-1997.ucm

https://speakerdeck.com/ima1zumi/dive-into-encoding
slide 23: Code Set Independent (CSI), Treat all encodings fair
slide 24: Each instance of string has encoding information
slide 26: Universal Coded Set (UCS)

https://shopify.engineering/code-ranges-ruby-strings
Code Ranges: A Deeper Look at Ruby Strings
Code ranges are a way for the VM to avoid repeated work and optimize operations
on a per-string basis, guiding away from slow paths when that functionality
isn't needed.
jlf: not sure this article is useful.

https://idiosyncratic-ruby.com/66-ruby-has-character.html
Ruby has Character
video: https://www.youtube.com/watch?v=hlryzsdGtZo
(jlf: too small, not very readable, but good for pronuntiation: "Louby")
---
jlf: this page is interesting for the one-liners.
Tools implemented by the author
https://github.com/janlelis/unibits     Visualize different Unicode encodings in the terminal
https://github.com/janlelis/uniscribe   Know your Unicode ✀

https://idiosyncratic-ruby.com/41-proper-unicoding.html
Proper Unicoding
Ruby's Regexp engine has a powerful feature built in: It can match for Unicode
character properties.

https://idiosyncratic-ruby.com/26-file-encoding-magic.html
default source encoding
    # coding: cp1252
    p "".encoding #=> #<Encoding:Windows-1252>

https://tomdebruijn.com/posts/rust-string-length-width-calculations/
The article is about Rust, but there is an appendix about Ruby.
Seems a good summary, so copy-paste here...
---
    When calling Ruby's String#length, it returns the length of characters like
    Rust's Chars.count. If you want the length in bytes you need to call String#bytesize.
        "abc".length   # => 3 characters
        "abc".bytesize # => 3 bytes
        "é".length     # => 1 characters
        "é".bytesize   # => 2 bytes

    Calling the length on emoji will return the individual characters as the length.
    The 👩‍🔬 emoji is three characters and eleven bytes in Ruby as well.
        "👩‍🔬".length   # => 3 characters
        "👩‍🔬".bytesize # => 11 bytes

    Do you want grapheme clusters? it's built-in to Ruby with String#grapheme_clusters.
        "👩‍🔬".grapheme_clusters.length # => 1 cluster

    To calculate the display with, we can use the unicode-display_width gem. The same
    multiple counting of emoji in the grapheme cluster still applies here.
        require "unicode/display_width"
        Unicode::DisplayWidth.of("👩‍🔬") # => 4
        Unicode::DisplayWidth.of("❤️") # => 1

https://ruby-doc.org/3.2.2/File.html
class File
A File object is a representation of a file in the underlying platform.
---
Data mode
To specify whether data is to be treated as text or as binary data, either of
the following may be suffixed to any of the string read/write modes above:
    't': Text data; sets the default external encoding to Encoding::UTF_8;
         on Windows, enables conversion between EOL and CRLF and enables
         interpreting 0x1A as an end-of-file marker.
    'b': Binary data; sets the default external encoding to Encoding::ASCII_8BIT;
         on Windows, suppresses conversion between EOL and CRLF and disables
         interpreting 0x1A as an end-of-file marker.
---
Encodings
Any of the string modes above may specify encodings - either external encoding
only or both external and internal encodings - by appending one or both encoding
names, separated by colons:
    f = File.new('t.dat', 'rb')
    f.external_encoding # => #<Encoding:ASCII-8BIT>
    f.internal_encoding # => nil
    f = File.new('t.dat', 'rb:UTF-16')
    f.external_encoding # => #<Encoding:UTF-16 (dummy)>
    f.internal_encoding # => nil
    f = File.new('t.dat', 'rb:UTF-16:UTF-16')
    f.external_encoding # => #<Encoding:UTF-16 (dummy)>
    f.internal_encoding # => #<Encoding:UTF-16>
    f.close
- When the external encoding is set, strings read are tagged by that encoding
  when reading, and strings written are converted to that encoding when writing.
- When both external and internal encodings are set, strings read are converted
  from external to internal encoding, and strings written are converted from
  internal to external encoding. For further details about transcoding input and
  output, see Encodings.
  https://ruby-doc.org/3.2.2/encodings_rdoc.html#label-Encodings

String comparison
If the encodings are different then the strings are different.
So it's not a comparison of Unicode codepoints.
    irb(main):026:0> s1 = "hello"
    => "hello"
    irb(main):027:0> s1
    => "hello"
    irb(main):028:0> s2 = "hello"
    => "hello"
    irb(main):029:0> s1 == s2
    => true
    irb(main):030:0> s2.force_encoding("utf-16")
    => "\x68\x65\x6C\x6C\x6F"
    irb(main):031:0> s2
    => "\x68\x65\x6C\x6C\x6F"
    irb(main):032:0> s1 == s2
    => false

https://bugs.ruby-lang.org/issues/9111
Encoding-free String comparison
14/11/2013
    ---
    Description
    Currently, strings with the same content but with different encodings count
    as different strings. This causes strange behaviour as below (noted in
    StackOverflow question
    http://stackoverflow.com/questions/19977788/strange-behavior-in-packed-ruby-strings#19978206):

    [128].pack("C")             # => "\x80"
    [128].pack("C") == "\x80"   # => false
    Since [128].pack("C") has the encoding ASCII-8BIT and "\x80" (by default)
    has the encoding UTF-8, the two strings are not equal.

    Also, comparison of strings with different encodings may end up with a messy,
    unintended result.
    I suggest that the comparison String#<=> should not be based on the respective
    encoding of the strings, but all the strings should be internally converted
    to UTF-8 for the purpose of comparison.
    ---
    nobu (Nobuyoshi Nakada)
    It's unacceptable to always convert all strings to UTF-8, should restrict to
    comparison with an ASCII-8BIT string.
    ---
    naruse (Yui NARUSE)
    The standard practice is NFD("â") == NFD("a" + "^").
    To NFD, you can use some libraries.
    ---
    duerst (Martin Dürst)
    Lié à Feature #10084: Add Unicode String Normalization to String class ajouté
    https://bugs.ruby-lang.org/issues/10084
    ---
    jlf 09/08/2023: ticket still opened...
    The test [128].pack("C") == "\x80" still returns false, so I assume they made
    no change.

https://bugs.ruby-lang.org/issues/10084
Add Unicode String Normalization to String class
23/07/2014
    ---
    nobu (Nobuyoshi Nakada)
    What will happen for a non-unicode string, raising an exception?
    ---
    duerst (Martin Dürst)
    This is a very good question. I'm okay with whatever Matz and the community
    think is best.

    There are many potential approaches. In general, these will be:
        1. Make the operation a no-op.
        2. Convert to UTF-8, normalize, then convert back.
        3. Implement normalization directly in the encoding.
        4. Raise an exception.

    There is also the question of what a "non-unicode" or "unicode" string is.

    UTF-8 is the preferred way to handle Unicode in Ruby, and is where normalization
    is really needed and will be used.

    For the other encodings, unless we go with 1) or 4), the following considerations
    apply.

    UTF8-Mac, UTF8-DoCoMo, UTF8-KDDI and UTF8-Softbank are essentially UTF-8 but
    with slightly different character conversions. For these encodings, the easiest
    thing to do is force_encoding to UTF-8, normalize, and force_encoding back.
    A C-level implementation may not actually need force_encoding, but a Ruby
    implementation does. There are some questions about what normalizing UTF8-Mac
    means, so that may have to be treated separately. The DoCoMo/KDDI/Softbank
    variants are mostly about emoji, which as far as I know are not affected by
    normalization.

    Then there are UTF-16LE/BE and UTF-32LE/BE. For these, it depends on the
    implementation. A Ruby-level implementation (unless very slow) may want to
    convert to UTF-8 and back. A C-level implementation may not need to do this.

    Then there is also GB18030. Conversion to UTF-8 and back seems to be the best
    solution. Doing normalization directly in GB18030 will need too much data.

    For other, truely non-unicode encodings, implementing noramlization directly
    in the encoding would mean the following: Analyze to what extent the normalization
    applies to the encoding in question, and apply this part.
    As an example, '①'.nfkc produces '1' in UTF-8, it could do the same in Windows-31J.
    The analysis might take some time (but can be automated), and the data needed
    for each encoding would mostly be just very small.
    ---
    matz (Yukihiro Matsumoto)
    First of all, I don't think normalize is the best name.
    I propose unicode_normalize instead, since this normalization is sort of
    unicode specific.

    It should raise an exception for non Unicode strings.
    It shouldn't convert to UTF-8 implicitly inside.

https://www.honeybadger.io/blog/troubleshooting-encoding-errors-in-ruby/
Troubleshooting Encoding Errors in Ruby
---
jlf: interesting for the one-liners
---
"H".bytes                                                                                   # => [72]           in decimal
"H".bytes.map {|e| e.to_s 2}                                                                # => ["1001000"]    convert in base 2
Encoding.name_list                                                                          # => ["ASCII-8BIT", "UTF-8", "US-ASCII", "UTF-16BE", "UTF-16LE", "UTF-32BE", "UTF-32LE", "UTF-16", "UTF-32", "UTF8-MAC", "EUC-JP", "Windows-31J", "Big5", "Big5-HKSCS", "Big5-UAO", "CP949", "Emacs-Mule", "EUC-KR", ...]
"hellÔ!".encode("US-ASCII")                                                                 # in `encode': U+00D4 from UTF-8 to US-ASCII (Encoding::UndefinedConversionError)
"hellÔ!".force_encoding("US-ASCII");                                                        # => "hell\xC3\x94!"
"abc\xCF\x88\xCF\x88"                                                                       # => "abcψψ"
"abcψψ".force_encoding("US-ASCII").valid_encoding?                                          # => false
"abcψψ".encode("US-ASCII", "UTF-8", invalid: :replace, undef: :replace, replace: "")        # => "abc"
"abc\xA1z".encode("US-ASCII")                                                               # in `encode': "\xA1" on UTF-8 (Encoding::InvalidByteSequenceError)
"abc\xA1z".force_encoding("US-ASCII").scrub("*")                                            # => "abc*z"
"abc\xA1z".force_encoding("US-ASCII").scrub("")                                             # => "abcz"
"abc\xA1z".force_encoding("US-ASCII").valid_encoding?                                       # => false

Rust lang

Seen in a comment here :  https://bugs.swift.org/browse/SR-7602

    For reference, I think [Rust's model]( https://doc.rust-lang.org/std/string/struct.String.html ) is pretty good:

    `from_utf8` produces an error explaining why the code units were invalid
    `from_utf8_lossy` replaces encoding errors with U+FFFD
    `from_utf8_unchecked` which takes the bytes, but if there's an encoding error, then memory safety has been violated

    I'm not entirely sure if accepting invalid bytes requires voiding memory safety (assuming bounds checking always happens), but it is totally a security hazard if used improperly.
    We may want to be very cautious about if/how we expose it.

    I think that trying to do read-time validation is dubious for UTF-16, and totally bananas for UTF-8.

17/07/2021
https://www.generacodice.com/en/articolo/120763/Unicode+Support+in+Various+Programming+Languages
jlf: I learned something: OsStr/OsString
    Rust's strings (std::String and &str) are always valid UTF-8, and do not use null
    terminators, and as a result can not be indexed as an array, like they can be in C/C++, etc.
    They can be sliced somewhat like Go using .get since 1.20, with the caveat that
    it will fail if you try slicing the middle of a code point.

    Rust also has OsStr/OsString for interacting with the Host OS.
    It's byte array on Unix (containing any sequence of bytes).
    On windows it's WTF-8 (A super-set of UTF-8 that handles the improperly
    formed Unicode strings that are allowed in Windows and Javascript),
    &str and String can be freely converted to OsStr or OsString, but require
    checks to covert the other way. Either by Failing on invalid unicode, or
    replacing with the Unicode replacement char. (There is also Path/PathBuf,
    which are just wrappers around OsStr/OsString).

    There is also the CStr and CString types, which represent Null terminated C
    strings, like OsStr on Unix they can contain arbitrary bytes.

    Rust doesn't directly support UTF-16. But can convert OsStr to UCS-2 on windows.

22/07/2021
https://lib.rs/crates/
STFU-8: Sorta Text Format in UTF-8
STFU-8 is a hacky text encoding/decoding protocol for data that might be not
quite UTF-8 but is still mostly UTF-8.
Its primary purpose is to be able to allow a human to visualize and edit "data"
that is mostly (or fully) visible UTF-8 text. It encodes all non visible or non
UTF-8 compliant bytes as longform text (i.e. ESC becomes the full string r"\x1B").
It can also encode/decode ill-formed UTF-16.

28/07/2021
https://fasterthanli.me/articles/working-with-strings-in-rust

07/11/2021
https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html
security concern affecting source code containing "bidirectional override" Unicode codepoints

10/03/2022
https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html
Allow non-ASCII letters (such as accented characters, Cyrillic, Greek, Kanji, etc.) in Rust identifiers.

10/09/2022
https://blog.burntsushi.net/bstr/
A byte string library for Rust
Invalid UTF-8 doesn’t actually prevent one from applying Unicode-aware algorithms on the parts
of the string that are valid UTF-8. The parts that are invalid UTF-8 are simply ignored.

15/10/2022
https://crates.io/crates/finl_unicode
Library for handling Unicode functionality for finl (categories and grapheme segmentation)
There are these comments in https://news.ycombinator.com/item?id=32700315
    All with two-step tables instead of range- and binary search?
    Yes. The two-step tables are really not that expensive and they enable features not possible with range and binary search, like identifying the category of a character cheaply.

https://github.com/open-i18n/rust-unic
UNIC: Unicode and Internationalization Crates for Rust
jlf: seems stale since Oct 21, 2020. Killed by ICU4X?
This fork is still alive: https://github.com/eyeplum/rust-unic

https://github.com/logannc/fuzzywuzzy-rs
port of https://github.com/seatgeek/fuzzywuzzy
(Fuzzy String Matching in Python
This project has been renamed and moved to https://github.com/seatgeek/thefuzz)
Fuzzy string matching like a boss.
It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.
https://en.wikipedia.org/wiki/Levenshtein_distance

https://hsivonen.fi/encoding_rs/
encoding_rs: a Web-Compatible Character Encoding Library in Rust
encoding_rs is a high-decode-performance, low-legacy-encode-footprint and high-correctness implementation
of the WHATWG Encoding Standard written in Rust.
---
https://hsivonen.fi/modern-cpp-in-rust/
How I Wrote a Modern C++ Library in Rust
Slides: https://hsivonen.fi/rustfest2018/
Video: https://media.ccc.de/v/rustfest18-5-a_rust_crate_that_also_quacks_like_a_modern_c_library

https://www.youtube.com/watch?v=Mcuqzx3rBWc
Strings in Rust FINALLY EXPLAINED!
jlf: is there something to learn from 15:29 Indexing into a string? no.

https://github.com/rust-lang/regex/blob/master/UNICODE.md
regex Unicode conformance
jlf: I found the URL above in this HN comment (related to awk support of Unicode)
https://news.ycombinator.com/item?id=32538560

https://github.com/danielpclark/rutie
Integrate Ruby with your Rust application. Or integrate Rust with your Ruby application.
https://github.com/danielpclark/rutie/blob/master/src/class/string.rs

https://tomdebruijn.com/posts/rust-string-length-width-calculations/
Calculating String length and width
https://github.com/lintje/lintje/blob/501aab06e19008e787237438a69ac961f38bb4b7/src/utils.rs#L22-L71
    // Return String display width as rendered in a monospace font according to the Unicode
    // specification.

https://www.reddit.com/r/rust/comments/gpw2ra/how_is_the_rust_compiler_able_to_tell_the_visible/
How is the Rust compiler able to tell the visible width of unicode characters?
---
jlf: some arbitray excerpts
- rustc uses the unicode-width crate (https://github.com/unicode-rs/unicode-width)
- Now try it with the rainbow flag emoji. Unicode is hard :)
- explanation:
  the rainbow flag emoji is actually just a white flag + zero width joiner + a rainbow, meaning it's technically three characters.
- Sure but why doesn't the unicode-width crate handle that?
- The unicode-width crate operates on scalar values. I don't believe Unicode has
  a way to determine whether a grapheme cluster is halfwidth/fullwidth. The most
  reasonable way to determine this would probably be the maximum width of any scalar
  value within a grapheme cluster, but this isn't part of any standard and probably
  isn't 100% accurate.
- It is also dependent on the display platform. A platform with support for displaying
  emojis but only in older unicode versions would indeed display multiple emojis
  on the screen. I don't believe there's a platform independent way to detect the
  visual length of any given series of unicode codepoints. For Rust this isn't a
  problem as we restrict the unicode identifiers only to things that are fairly
  homogeneous (namely, no emojis in your variable names!).
- At the bottom of things is the unicode-width native Rust implementation, based
  off the Unicode 13.0 data tables. In C/POSIX land, we would use the function
  wcwidth(). Unfortunately, this isn't the whole story. The actual number of
  columns used is dependent upon your font and the font layout engine.
  See section 7.4 of my Free book, Hacking the Planet! with Notcurses, aka "Fixed-width Fonts Ain't So Fixed."
  https://nick-black.com/htp-notcurses.pdf#page=57
  you want pages 47--49 (p49 has some good examples).

https://github.com/unicode-rs/unicode-width
Displayed width of Unicode characters and strings according to UAX#11 rules.
NOTE: The computed width values may not match the actual rendered column width.
For example, the woman scientist emoji comprises of a woman emoji, a zero-width
joiner and a microscope emoji.

    extern crate unicode_width;
    use unicode_width::UnicodeWidthStr;

    fn main() {
        assert_eq!(UnicodeWidthStr::width("👩"), 2); // Woman
        assert_eq!(UnicodeWidthStr::width("🔬"), 2); // Microscope
        assert_eq!(UnicodeWidthStr::width("👩‍🔬"), 4); // Woman scientist
    }

https://github.com/life4/textdistance.rs
https://www.reddit.com/r/rust/comments/13lo6ne/textdistancers_rust_library_to_compare_strings_or/
textdistance.rs: Rust library to compare strings (or any sequences).
25+ algorithms, pure Rust, common interface, Unicode support.
Based on popular and battle-tested textdistance Python library https://github.com/life4/textdistance

https://github.com/dguo/strsim-rs
Rust implementations of string similarity metrics:
    Hamming
    Levenshtein - distance & normalized
    Optimal string alignment
    Damerau-Levenshtein - distance & normalized
    Jaro and Jaro-Winkler - this implementation of Jaro-Winkler does not limit the common prefix length
    Sørensen-Dice

https://docs.rs/xi-unicode/latest/xi_unicode/
Unicode utilities useful for text editing, including a line breaking iterator.

https://github.com/BurntSushi/bstr
A string type for Rust that is not required to be valid UTF-8.
---
jlf: this crate is referenced by Stefan Karpinski in the section Filenames
(search this URL).

https://www.reddit.com/r/rust/comments/qr0rem/how_many_string_types_does_rust_have_maybe_its/
How many String types does Rust have? Maybe it's just 1
jlf: to read?

Saxon lang

https://www.saxonica.com/documentation12/#!localization/unicode-collation-algorithm
Unicode Collation Algorithm

https://www.saxonica.com/documentation12/index.html#!localization/sorting-and-collations
Sorting and collations

https://www.saxonica.com/documentation12/index.html#!changes/spi/10-11
Changes from 10 to 11
Strings
    Most uses of CharSequence have been replaced by a new class
    net.sf.saxon.str.UnicodeString (which also replaces the old class
    net.sf.saxon.regex.UnicodeString).
    The UnicodeString class has a number of implementations.
    All of them are designed to be codepoint-addressible: they expose an
    indexable array of 32-bit codepoint values, and never use surrogate pairs.
    The implementations of UnicodeString include:
        - Twine8:
          a string consisting entirely of codepoints in the range 1-255, held in
          an array with one byte per character.
        - Twine16:
          a string consisting entirely of codepoints in the range 1-65535, held
          in an array with two bytes per character.
        - Twine24:
          a string of arbitrary codepoints, held in an array with three bytes
          per character.
        - Slice8:
          a sub-range of an array using one byte per character.
        - Slice16:
          a sub-range of an array using two bytes per character.
        - Slice24:
          a sub-range of an array using two bytes per character.
        - BMPString:
          a wrapper around a Java/C# string known to contain no surrogate pairs.
        - ZenoString:
          a composite string held as a list of segments, each of which is itself
          a UnicodeString. The name derives from the algorithm used to combine
          segments, which results in segments having progressively decreasing
          lengths towards the end of the string.
        - StringView:
          a wrapper around an arbitrary Java/C# string. (This stores the string
          both in its native Java/C# form, and using a "real" codepoint-
          addressible implementation of UnicodeString, which is constructed
          lazily when it is first required.)

    Unicode normalization of strings (for example in the fn:normalize-unicode()
    function) now uses the JDK class java.text.Normalizer rather than code
    derived from the Unicode Consortium's implementation.
    This appears to be substantially faster.

https://www.balisage.net/Proceedings/vol26/html/Kay01/BalisageVol26-Kay01.html
ZenoString: A Data Structure for Processing XML Strings
August 2 - 6, 2021
Compare with
- Monolithic char arrays
- Strings in Saxon
- Ropes
- Finger Trees
  https://www.cambridge.org/core/journals/journal-of-functional-programming/article/finger-trees-a-simple-generalpurpose-data-structure/BF419BCA07292DCAAF2A946E6BDF573B#article
  finger-trees-a-simple-general-purpose-data-structure.pdf

SQL lang

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode.html
Unicode Support
BMP characters
- can be encoded in a variable-length encoding using 8, 16, or 24 bits (1 to 3 bytes)
- can be encoded in a fixed-length encoding using 16 bits (2 bytes).
Supplementary characters take more space than BMP characters (up to 4 bytes per character).
MySQL supports these Unicode character sets:
- utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.
- utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.
  This character set is deprecated in MySQL 8.0, and you should use utfmb4 instead.
- utf8: An alias for utf8mb3. In MySQL 8.0, this alias is deprecated; use utf8mb4 instead.
  utf8 is expected in a future release to become an alias for utf8mb4.

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb4.html
jlf: I take note of this URL for this concatenation rule:
utf8mb4 is a superset of utf8mb3, so for an operation such as the following
concatenation, the result has character set utf8mb4 and the collation of
utf8mb4_col:
    SELECT CONCAT(utf8mb3_col, utf8mb4_col);
Similarly, the following comparison in the WHERE clause works according to the
collation of utf8mb4_col:
    SELECT * FROM utf8mb3_tbl, utf8mb4_tbl
    WHERE utf8mb3_tbl.utf8mb3_col = utf8mb4_tbl.utf8mb4_col;

https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html#data-types-storage-reqs-strings
String Type Storage Requirements

https://dev.mysql.com/doc/refman/8.0/en/charset-introducer.html
Character Set Introducers
A character string literal, hexadecimal literal, or bit-value literal may have
an optional character set introducer and COLLATE clause, to designate it as a
string that uses a particular character set and collation:
    [_charset_name] literal [COLLATE collation_name]
The _charset_name expression is formally called an introducer. It tells the
parser, “the string that follows uses character set charset_name.” An introducer
does not change the string to the introducer character set like CONVERT() would
do. It does not change the string value, although padding may occur. The
introducer is just a signal.
---
Examples:
    SELECT 'abc';
    SELECT _latin1'abc';
    SELECT _binary'abc';
    SELECT _utf8mb4'abc' COLLATE utf8mb4_danish_ci;

    SELECT _latin1 X'4D7953514C';
    SELECT _utf8mb4 0x4D7953514C COLLATE utf8mb4_danish_ci;

    SELECT _latin1 b'1000001';
    SELECT _utf8mb4 0b1000001 COLLATE utf8mb4_danish_ci;
---
Character string literals can be designated as binary strings by using the
_binary introducer.
    mysql> SET @v1 = X'000D' | X'0BC0';
    mysql> SET @v2 = _binary X'000D' | X'0BC0';
    mysql> SELECT HEX(@v1), HEX(@v2);
    +----------+----------+
    | HEX(@v1) | HEX(@v2) |
    +----------+----------+
    | BCD      | 0BCD     |
    +----------+----------+
---
Followed by rules to determines the character set and collation of a character
string literal, hexadecimal literal, or bit-value literal.
See the page for the details.

https://www.eversql.com/mysql-utf8-vs-utf8mb4-whats-the-difference-between-utf8-and-utf8mb4/
MySQL utf8 vs utf8mb4 – What’s the difference between utf8 and utf8mb4?
MySQL decided that UTF-8 can only hold 3 bytes per character (as it's defined
as an alias of utf8mb3). Why? no good reason that I can find documented anywhere.
Few years later, when MySQL 5.5.3 was released, they introduced a new encoding
called utf8mb4, which is actually the real 4-byte utf8 encoding that you know and love.

https://www.percona.com/blog/migrating-to-utf8mb4-things-to-consider/
Migrating to utf8mb4: Things to Consider
The utf8mb4 character set is the new default as of MySQL 8.0, and this change
neither affects existing data nor forces any upgrades.
Migration to utf8mb4 has many advantages including:
- It can store more symbols, including emojis
- It has new collations for Asian languages
- It is faster than utf8mb3

Swift lang

https://github.com/apple/swift-evolution/blob/main/proposals/0363-unicode-for-string-processing.md
Proposal: Unicode for String Processing
This proposal describes Regex's rich Unicode support during regex matching,
along with the character classes and options that define and modify that behavior.
This proposal is one component of a larger regex-powered string processing initiative.

https://docs.swift.org/swift-book/documentation/the-swift-programming-language/stringsandcharacters/
Strings and Characters
    Every string is composed of encoding-independent Unicode characters, and provides
    support for accessing those characters in various Unicode representations.

    When a Unicode string is written to a text file or some other storage, the Unicode
    scalars in that string are encoded in one of several Unicode-defined encoding forms.
    Each form encodes the string in small chunks known as code units. These include the
    UTF-8 encoding form (which encodes a string as 8-bit code units), the UTF-16 encoding
    form (which encodes a string as 16-bit code units), and the UTF-32 encoding form
    (which encodes a string as 32-bit code units).

03/08/2021
https://swiftdoc.org/v5.1/type/string/
Auto-generated documentation for Swift.
A Unicode string value that is a collection of characters.

https://developer.apple.com/documentation/swift/string

https://www.simpleswiftguide.com/get-character-from-string-using-its-index-in-swift/
jlf: no direct access to a character
Doesn't work:
    let input = "Swift Tutorials"
    let char = input[3]
Work:
    let input = "Swift Tutorials"
    let char = input[input.index(input.startIndex, offsetBy: 3)]
A "workaround" to have direct access
    extension StringProtocol {
        subscript(offset: Int) -> Character {
            self[index(startIndex, offsetBy: offset)]
        }
    }
Which can be used just like that:
    let input = "Swift Tutorials"
    let char = input[3]

https://gist.github.com/paultopia/6609780e7b53676b7dfc55736221cd23
paultopia/monkey_patch_slicing_into_string.swift
Another "workaround" to have direct access to the characters like that:
    var s = "here is a boring string"
    print(s.getCharList())
    print(s[1])
    print(s[-1])
    print(s[0, 5])
    print(s[5, 0])
    print(s[3...6])
    print(s[2..<10])
    print(s[...15])
    print(s[2...])
    print(s[..<15])

https://developer.apple.com/documentation/swift/unicode/canonicalcombiningclass
Unicode.CanonicalCombiningClass
The classification of a scalar used in the Canonical Ordering Algorithm defined by the Unicode Standard.
---
Canonical combining classes are used by the ordering algorithm to determine if
two sequences of combining marks should be considered canonically equivalent
(that is, identical in interpretation). Two sequences are canonically equivalent
if they are equal when sorting the scalars in ascending order by their combining class.
---
aboveBeforeBelow = "\u{0041}\u{0301}\u{0316}"~text~unescape
belowBeforeAbove = "\u{0041}\u{0316}\u{0301}"~text~unescape
aboveBeforeBelow~compareTo(belowBeforeAbove)=       -- 0 (good, means equal)
aboveBeforeBelow == belowBeforeAbove=               -- .true

15/07/2017
String Processing For Swift 4
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

https://swift.org/blog/utf8-string/
Swift 5 switches the preferred encoding of strings from UTF-16 to UTF-8 while preserving efficient Objective-C-interoperability.
jlf: Search "breadcrumb".
Notice that the article is about Swift Objective-C interoperability.
The language Swift itself is not allowing random access to characters.
---
Swift 5, like Rust, performs encoding validation once on creation, when it is far
more efficient to do so. NSStrings, which are lazily bridged (zero-copy) into
Swift and use UTF-16, may contain invalid content (i.e. isolated surrogates).
As in Swift 4.2, these are lazily validated when read from.

https://bugs.swift.org/browse/SR-7602 (redirect to next URL)
https://github.com/apple/swift/issues/50144
UTF8 should be (one of) the fastest String encoding(s)
---
Requirements:

    being able to copy UTF-8 encoded bytes from a String into a pre-allocated raw buffer
    must be allocation-free and as fast as memcpy can copy them

    creating a String from UTF-8 encoded bytes should just validate the encoding and store the bytes as they are
    (jlf: "and store the bytes as they are" --> YES!)

    slightly softer but still very strong requirement: currently (even with ASCII)
    only the stdlib seems to be able to get a pointer to the contiguous ASCII representation
    (if at all in that form). That works fine if you just want to copy the bytes
    (UnsafeMutableBufferPointer(start: destinationStart, count: destinationLength).initialize(from: string.utf8)
    which will use memcpy if in ASCII representation) but doesn't allow you to implement
    your own algorithms that are only performant on a contiguously stored [UInt8]
---
jlf: this comment in the thread is particularly interesting, because it reminds
me what was said on the ARB mailing list about byte versus string.
https://github.com/apple/swift/issues/50144#issuecomment-1108303710
    May 9, 2018

    @milseman Virtually all of it comes down to `String(data: myData, encoding: .utf8)`
    and `myString.data(encoding: .utf8)`.

    When parsing protocols such as HTTP, Redis, MySQL, PostgreSQL, etc we will read data from
    the OS into an `UnsafeBufferPointer<UInt8>`. This is almost always via NIO's
    [`ByteBuffer`](https://apple.github.io/swift-nio/docs/current/NIO/Structs/ByteBuffer.html) type.
    We sometimes grab `String` from that directly or grab `Data` if we want to iterate over the bytes
    for additional parsing.

    In other words, from `UnsafePointer<UInt8>` we commonly read `FixedWidthInteger`,
    `BinaryFloatingPoint`, `Data`, and `String`. All are very performant except String
    which is the concern since the vast majority of bytes ends up being `String`s.
    Considering the DB use case specifically, the data transfer is usually emails,
    names, bios, comments, etc. Very few bytes are actually dedicated to binary
    numbers or data blobs. Strings everywhere.

    To summarize, the faster we can get from `Swift.Unsafe...Pointer<UInt8>` or
    `Foundation.Data` to `String` the better. That will affect (for the better!)
    quite literally our entire framework.
---
jlf: this comment from the same thread shows which questions we should answer for Rexx:
https://github.com/apple/swift/issues/50144#issuecomment-1108303720

    Along the lines of potentially separable issues, what is your validation story?
    If the stream of bytes contains invalid UTF-8, do you want:

    1) The initializer to fail resulting in nil
    2) The initializer to fail producing an error
    3) The invalid bytes to be replaced with U+FFFD
    4) The bytes verbatim, and experience the emergent behavior / unspecified results / security hazard from those bytes.

    For reference, I think [Rust's model](https://doc.rust-lang.org/std/string/struct.String.html) is pretty good:

    `from_utf8` produces an error explaining why the code units were invalid
    `from_utf8_lossy` replaces encoding errors with U+FFFD
    `from_utf8_unchecked` which takes the bytes, but if there's an encoding error, then memory safety has been violated

    I'm not entirely sure if accepting invalid bytes requires voiding memory safety
    (assuming bounds checking always happens), but it is totally a security hazard if used improperly.
    We may want to be very cautious about if/how we expose it.

    I think that trying to do read-time validation is dubious for UTF-16, and totally bananas for UTF-8.
    (jlf: I don't understand this last sentence. By "read-time", does he means "when working with the string?")

milseman Michael Ilseman added a comment - 5 Nov 2018 3:44 PM
    It's now the fastest encoding.
    https://forums.swift.org/t/string-s-abi-and-utf-8/17676/1
    https://github.com/apple/swift/pull/20315

https://github.com/apple/swift/blob/7e68e8f4a3cb1173e909dc22a3490c05e43fa592/stdlib/public/core/StringObject.swift
swift/stdlib/public/core/StringObject.swift
jlf: the link above is a frozen link
To have an up-to-date view, go to
https://github.com/apple/swift/tree/main/stdlib/public/core
Many code to review!
    String.swift
    StringBreadcrumbs.swift
    StringBridge.swift
    StringCharacterView.swift
    StringComparable.swift
    StringComparison.swift
    StringCreate.swift
    StringGraphemeBreaking.swift
        jlf: Apparently, there are some difficulties when going backwards .
          // When walking backwards, it's impossible to know whether we were in an emoji
          // sequence without walking further backwards. This walks the string backwards
          // enough until we figure out whether or not to break our
          // (.zwj, .extendedPictographic) question.

          // When walking backwards, it's impossible to know whether we break when we
          // see our first (.regionalIndicator, .regionalIndicator) without walking
          // further backwards. This walks the string backwards enough until we figure
          // out whether or not to break these RIs.
    StringGuts.swift
    StringGutsRangeReplaceable.swift
    StringGutsSlice.swift
    StringHashable.swift
    StringIndex.swift
    StringIndexConversions.swift
    StringIndexValidation.swift
    StringInterpolation.swift
    StringLegacy.swift
    StringNormalization.swift
    StringObject.swift
    StringProtocol.swift
    StringRangeReplaceableCollection.swift
    StringStorage.swift
    StringStorageBridge.swift
    StringSwitch.swift
    StringTesting.swift
    StringUTF16View.swift
    StringUTF8Validation.swift
    StringUTF8View.swift
    StringUnicodeScalarView.swift
    StringWordBreaking.swift
    Substring.swift

https://github.com/apple/swift/blob/main/stdlib/public/core/StringBreadcrumbs.swift
Breadcrumb optimization
The distance between successive breadcrumbs, measured in UTF-16 code units is 64.
internal static var breadcrumbStride: Int { 64 }
jlf: nothing sophisticated here...
They scan the whole string by iterating over the UTF-16 indexes and when i % stride == 0 then self.crumbs.append(curIdx)
When searching the offset for a String.Index, they do a binary search.

https://github.com/apple/swift/pull/20315/commits/2e368a3f6a25b5e84c0f682861ea0a5c9b3b26af
[String] Introduce StringBreadcrumbs
Breadcrumbs provide us amortized O(1) access to the UTF-16 view, which
is vital for efficient Cocoa interoperability.
---
jlf: this is the commit where breadcrumbs are added to Swift (Nov 4, 2018).

https://stackoverflow.com/questions/55389444/whats-does-extended-grapheme-clusters-are-canonically-equivalent-means-in-term
Whats does “extended grapheme clusters are canonically equivalent” means in terms of Swift String?
jlf:
They don't answer to the question :-(
no explanation about "canonically equivalent", just ONE poor example, no general definition.

https://forums.swift.org/t/pitch-unicode-equivalence-for-swift-source/21576/6
Pitch: Unicode Equivalence for Swift Source
jlf: interersting
Mar 13,2019
In short, there is a thorough set of rules already laid out in UAX#31 on how to normalize identifiers in programming languages.
Several of us have written several versions of a proposal to adopt it, but each time it has failed because of issues with emoji.
Recent versions of Unicode now have more robust classifications for emoji, so the proposal can be resurrected with better luck now, probably.
No need to start from scratch; feel free to build on the work that we’ve already done.
All of this applies only to identifiers. Literals should never be messed with by the compiler.
That are, after all, supposed to be literals.

13/06/2021
https://github.com/apple/swift-evolution/blob/main/proposals/0211-unicode-scalar-properties.md
Add Unicode Properties to Unicode.Scalar
    Issues Linking with ICU
    The Swift standard library uses the system's ICU libraries to implement its Unicode support.
    A third-party developer may expect that they could also link their application directly to the system ICU
    to access the functionality that they need, but this proves problematic on both Apple and Linux platforms.
    Apple
        On Apple operating systems, libicucore.dylib is built with function renaming disabled
        (function names lack the _NN version number suffix). This makes it fairly straightforward to import the C APIs
        and call them from Swift without worrying about which version the operating system is using.
        Unfortunately, libicucore.dylib is considered to be private API for submissions to the App Store,
        so applications doing this will be rejected. Instead, users must built their own copy of ICU from source
        and link that into their applications. This is significant overhead.
    Linux
        On Linux, system ICU libraries are built with function renaming enabled (the default),
        so function names have the _NN version number suffix. Function renaming makes it more difficult
        to use these APIs from Swift; even though the C header files contain #defines that map function names
        like u_foo_59 to u_foo, these #defines are not imported into Swift—only the suffixed function names are available.
        This means that Swift bindings would be fixed to a specific version of the library without some other intermediary layer.
        Again, this is significant overhead.
    extension Unicode.Scalar.Properties {
      public var isAlphabetic: Bool { get }    // Alphabetic
      public var isASCIIHexDigit: Bool { get }    // ASCII_Hex_Digit
      public var isBidiControl: Bool { get }    // Bidi_Control
      public var isBidiMirrored: Bool { get }    // Bidi_Mirrored
      public var isDash: Bool { get }    // Dash
      public var isDefaultIgnorableCodePoint: Bool { get }    // Default_Ignorable_Code_Point
      public var isDeprecated: Bool { get }    // Deprecated
      public var isDiacritic: Bool { get }    // Diacritic
      public var isExtender: Bool { get }    // Extender
      public var isFullCompositionExclusion: Bool { get }    // Full_Composition_Exclusion
      public var isGraphemeBase: Bool { get }    // Grapheme_Base
      public var isGraphemeExtend: Bool { get }    // Grapheme_Extend
      public var isHexDigit: Bool { get }    // Hex_Digit
      public var isIDContinue: Bool { get }    // ID_Continue
      public var isIDStart: Bool { get }    // ID_Start
      public var isIdeographic: Bool { get }    // Ideographic
      public var isIDSBinaryOperator: Bool { get }    // IDS_Binary_Operator
      public var isIDSTrinaryOperator: Bool { get }    // IDS_Trinary_Operator
      public var isJoinControl: Bool { get }    // Join_Control
      public var isLogicalOrderException: Bool { get }    // Logical_Order_Exception
      public var isLowercase: Bool { get }    // Lowercase
      public var isMath: Bool { get }    // Math
      public var isNoncharacterCodePoint: Bool { get }    // Noncharacter_Code_Point
      public var isQuotationMark: Bool { get }    // Quotation_Mark
      public var isRadical: Bool { get }    // Radical
      public var isSoftDotted: Bool { get }    // Soft_Dotted
      public var isTerminalPunctuation: Bool { get }    // Terminal_Punctuation
      public var isUnifiedIdeograph: Bool { get }    // Unified_Ideograph
      public var isUppercase: Bool { get }    // Uppercase
      public var isWhitespace: Bool { get }    // Whitespace
      public var isXIDContinue: Bool { get }    // XID_Continue
      public var isXIDStart: Bool { get }    // XID_Start
      public var isCaseSensitive: Bool { get }    // Case_Sensitive
      public var isSentenceTerminal: Bool { get }    // Sentence_Terminal (S_Term)
      public var isVariationSelector: Bool { get }    // Variation_Selector
      public var isNFDInert: Bool { get }    // NFD_Inert
      public var isNFKDInert: Bool { get }    // NFKD_Inert
      public var isNFCInert: Bool { get }    // NFC_Inert
      public var isNFKCInert: Bool { get }    // NFKC_Inert
      public var isSegmentStarter: Bool { get }    // Segment_Starter
      public var isPatternSyntax: Bool { get }    // Pattern_Syntax
      public var isPatternWhitespace: Bool { get }    // Pattern_White_Space
      public var isCased: Bool { get }    // Cased
      public var isCaseIgnorable: Bool { get }    // Case_Ignorable
      public var changesWhenLowercased: Bool { get }    // Changes_When_Lowercased
      public var changesWhenUppercased: Bool { get }    // Changes_When_Uppercased
      public var changesWhenTitlecased: Bool { get }    // Changes_When_Titlecased
      public var changesWhenCaseFolded: Bool { get }    // Changes_When_Casefolded
      public var changesWhenCaseMapped: Bool { get }    // Changes_When_Casemapped
      public var changesWhenNFKCCaseFolded: Bool { get }    // Changes_When_NFKC_Casefolded
      public var isEmoji: Bool { get }    // Emoji
      public var isEmojiPresentation: Bool { get }    // Emoji_Presentation
      public var isEmojiModifier: Bool { get }    // Emoji_Modifier
      public var isEmojiModifierBase: Bool { get }    // Emoji_Modifier_Base
    }
    extension Unicode.Scalar.Properties {

      // Implemented in terms of ICU's `u_isdefined`.
      public var isDefined: Bool { get }
    }
    Case Mappings
    The properties below provide full case mappings for scalars. Since a handful of mappings result in multiple scalars (e.g., "ß" uppercases to "SS"), these properties are String-valued, not Unicode.Scalar.
    extension Unicode.Scalar.Properties {

      public var lowercaseMapping: String { get }  // u_strToLower
      public var titlecaseMapping: String { get }  // u_strToTitle
      public var uppercaseMapping: String { get }  // u_strToUpper
    }
Identification and Classification
    extension Unicode.Scalar.Properties {

      /// Corresponds to the `Age` Unicode property, when a code point was first
      /// defined.
      public var age: Unicode.Version? { get }

      /// Corresponds to the `Name` Unicode property.
      public var name: String? { get }

      /// Corresponds to the `Name_Alias` Unicode property.
      public var nameAlias: String? { get }

      /// Corresponds to the `General_Category` Unicode property.
      public var generalCategory: Unicode.GeneralCategory { get }

      /// Corresponds to the `Canonical_Combining_Class` Unicode property.
      public var canonicalCombiningClass: Unicode.CanonicalCombiningClass { get }
    }

    extension Unicode {

      /// Represents the version of Unicode in which a scalar was introduced.
      public typealias Version = (major: Int, minor: Int)

      /// General categories returned by
      /// `Unicode.Scalar.Properties.generalCategory`. Listed along with their
      /// two-letter code.
      public enum GeneralCategory {
        case uppercaseLetter  // Lu
        case lowercaseLetter  // Ll
        case titlecaseLetter  // Lt
        case modifierLetter  // Lm
        case otherLetter  // Lo

        case nonspacingMark  // Mn
        case spacingMark  // Mc
        case enclosingMark  // Me

        case decimalNumber  // Nd
        case letterlikeNumber  // Nl
        case otherNumber  // No

        case connectorPunctuation  //Pc
        case dashPunctuation  // Pd
        case openPunctuation  // Ps
        case closePunctuation  // Pe
        case initialPunctuation  // Pi
        case finalPunctuation  // Pf
        case otherPunctuation  // Po

        case mathSymbol  // Sm
        case currencySymbol  // Sc
        case modifierSymbol  // Sk
        case otherSymbol  // So

        case spaceSeparator  // Zs
        case lineSeparator  // Zl
        case paragraphSeparator  // Zp

        case control  // Cc
        case format  // Cf
        case surrogate  // Cs
        case privateUse  // Co
        case unassigned  // Cn
      }

      public struct CanonicalCombiningClass:
        Comparable, Hashable, RawRepresentable
      {
        public static let notReordered = CanonicalCombiningClass(rawValue: 0)
        public static let overlay = CanonicalCombiningClass(rawValue: 1)
        public static let nukta = CanonicalCombiningClass(rawValue: 7)
        public static let kanaVoicing = CanonicalCombiningClass(rawValue: 8)
        public static let virama = CanonicalCombiningClass(rawValue: 9)
        public static let attachedBelowLeft = CanonicalCombiningClass(rawValue: 200)
        public static let attachedBelow = CanonicalCombiningClass(rawValue: 202)
        public static let attachedAbove = CanonicalCombiningClass(rawValue: 214)
        public static let attachedAboveRight = CanonicalCombiningClass(rawValue: 216)
        public static let belowLeft = CanonicalCombiningClass(rawValue: 218)
        public static let below = CanonicalCombiningClass(rawValue: 220)
        public static let belowRight = CanonicalCombiningClass(rawValue: 222)
        public static let left = CanonicalCombiningClass(rawValue: 224)
        public static let right = CanonicalCombiningClass(rawValue: 226)
        public static let aboveLeft = CanonicalCombiningClass(rawValue: 228)
        public static let above = CanonicalCombiningClass(rawValue: 230)
        public static let aboveRight = CanonicalCombiningClass(rawValue: 232)
        public static let doubleBelow = CanonicalCombiningClass(rawValue: 233)
        public static let doubleAbove = CanonicalCombiningClass(rawValue: 234)
        public static let iotaSubscript = CanonicalCombiningClass(rawValue: 240)

        public let rawValue: UInt8

        public init(rawValue: UInt8)
      }
    }
    Numerics
    Many Unicode scalars have associated numeric values.
    These are not only the common digits zero through nine, but also vulgar fractions
    and various other linguistic characters and ideographs that have an innate numeric value.
    These properties are exposed below. They can be useful for determining whether segments
    of text contain numbers or non-numeric data, and can also help in the design of algorithms
    to determine the values of such numbers.
    extension Unicode.Scalar.Properties {

      /// Corresponds to the `Numeric_Type` Unicode property.
      public var numericType: Unicode.NumericType?

      /// Corresponds to the `Numeric_Value` Unicode property.
      public var numericValue: Double?
    }

    extension Unicode {

      public enum NumericType {
        case decimal
        case digit
        case numeric
      }
    }

14/06/2021
https://lists.isocpp.org/sg16/2018/08/0121.php
Feedback from swift team

    Swift strings now sort with NFC (currently UTF-16 code unit order, but likely changed to Unicode scalar value order).
    We didn't find FCC significantly more compelling in practice. Since NFC is far more frequent in the wild
    (why waste space if you don't have to), strings are likely to already be in NFC.
    We have fast-paths to detect on-the-fly normal sections of strings (e.g. all ASCII, all < U+0300, NFC_QC=yes, etc.).
    We lazily normalize portions of string during comparison when needed.
    Q: Swift strings support comparison via normalization. Has use of canonical string equality been a performance issue?
       Or been a source of surprise to programmers?
    A: This was a big performance issue on Linux, where we used to do UCA+DUCET based comparisons.
       We switch to lexicographical order of NFC-normalized UTF-16 code units (future: scalar values),
       and saw a very significant speed up there. The remaining performance work revolves around checking
       and tracking whether a string is known to already be in a normal form, so we can just memcmp.
    Q: I'm curious why this was a larger performance issue for Linux than for (presumably) macOS and/or iOS.
    A: There were two main factors.
       The first is that on Darwin platforms, CFString had an implementation that we used instead of UCA+DUCET which was faster.
       The second is that Darwin platforms are typically up-to-date and have very recent versions of ICU.
       On Linux, we still support Ubuntu LTS 14.04 which has a version of ICU which predates Swift and didn't have any fast-paths for ASCII or mostly-ASCII text.
       Switching to our own implementation based on NFC gave us many X improvement over CFString, which in turn was many X faster than UCA+DUCET (especially on older versions of ICU).
    Q: How firmly is the Swift string implementation tied to ICU?
       If the C++ standard library were to add suitable Unicode support, what would motivate reimplementing Swift strings on top of it?
    A: Swift's tie to ICU is less firm than it used to be
       If the C++ standard library provided these operations, sufficiently up-to-date with Unicode version and comparable or better to ICU in performance,
       we would be willing to switch. A big pain in interacting with ICU is their limited support for UTF-8.
       Some users who would like to use a lighter-weight Swift and are unhappy at having to link against ICU, as it's fairly large, and it can complicate security audits.

https://forums.swift.org/t/pitch-unicode-for-string-processing/56907/6
[Pitch] Unicode for String Processing

https://github.com/apple/swift-evolution/blob/main/proposals/0350-regex-type-overview.md
jlf: surprising intro!
Swift strings provide an obsessively Unicode-forward model of programming with strings.
String processing with Collection's algorithms is woefully inadequate for many day-to-day
tasks compared to other popular programming and scripting languages.
We propose addressing this basic shortcoming through an effort we are calling regex.

https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/ProposalOverview.md
Regex Proposals
todo: read String processing algorithms https://forums.swift.org/t/pitch-regex-powered-string-processing-algorithms/55969
todo: read Unicode for String Processing https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/UnicodeForStringProcessing.md

https://stackoverflow.com/questions/41059974/german-character-%C3%9F-uppercased-in-ss
"ß" is converted to "SS" when using uppercased().
---
Use caseInsensitiveCompare() instead of converting the strings to upper or lowercase:
    let s1 = "gruß"
    let s2 = "GRUß"
    let eq = s1.caseInsensitiveCompare(s2) == .orderedSame
    print(eq) // true
This compares the strings in a case-insensitive way according to the Unicode standard.
There is also localizedCaseInsensitiveCompare() which does a comparison according to the current locale, and
    s1.compare(s2, options: .caseInsensitive, locale: ...)
for a case-insensitive comparison according to an arbitrary given locale.

https://www.kodeco.com/3418439-encoding-and-decoding-in-swift
jlf: out of subject, it's not related to strings. It's about serialization of data strctures.

https://github.com/apple/swift-evolution/blob/main/proposals/0241-string-index-explicit-encoding-offset.md
Deprecate String Index Encoded Offsets
Feb 23, 2019
jlf: I add this URL for this description, not for the topic covered by this proposition:
    String abstracts away details about the underlying encoding used in its storage.
    String.Index is opaque and represents a position within a String or Substring.
    This can make serializing a string alongside its indices difficult, and for that
    reason SE-0180 added a computed variable and initializer encodedOffset in Swift 4.0.

    String was always meant to be capable of handling multiple backing encodings for
    its contents, and this is realized in Swift 5. String now uses UTF-8 for its
    preferred “fast” native encoding, but has a resilient fallback for strings of
    different encodings. Currently, we only use this fall-back for lazily-bridged
    Cocoa strings, which are commonly encoded as UTF-16, though it can be extended
    in the future thanks to resilience.

    Unfortunately, SE-0180’s approach of a single notion of encodedOffset is flawed.
    A string can be serialized with a choice of encodings, and the offset is therefore
    encoding-dependent and requires access to the contents of the string to calculate.

https://www.tutorialkart.com/swift-tutorial/swift-read-text-file/#gsc.tab=0
Read text file

    import Foundation
    let file = "sample.txt"
    var result = ""
    //if you get access to the directory
    if let dir = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask).first {
        //prepare file url
        let fileURL = dir.appendingPathComponent(file)
        do {
            result = try String(contentsOf: fileURL, encoding: .utf8)
        }
        catch {/* handle if there are any errors */}
    }
    print(result)

https://www.appsdeveloperblog.com/read-and-write-string-into-a-text-file/
Read and Write String Into a Text File

      let fileName = "myFileName.txt"
      var filePath = ""

      // Fine documents directory on device
      let dirs : [String] = NSSearchPathForDirectoriesInDomains(FileManager.SearchPathDirectory.documentDirectory, FileManager.SearchPathDomainMask.allDomainsMask, true)
      if dirs.count > 0 {
          let dir = dirs[0] //documents directory
          filePath = dir.appending("/" + fileName)
          print("Local path = \(filePath)")
      } else {
          print("Could not find local directory to store file")
          return
      }

      // Set the contents
      let fileContentToWrite = "Text to be recorded into file"
      do {
          // Write contents to file
          try fileContentToWrite.write(toFile: filePath, atomically: false, encoding: String.Encoding.utf8)
      }
      catch let error as NSError {
          print("An error took place: \(error)")
      }

      // Read file content. Example in Swift
      do {
          // Read file content
          let contentFromFile = try NSString(contentsOfFile: filePath, encoding: String.Encoding.utf8.rawValue)
          print(contentFromFile)
      }
      catch let error as NSError {
          print("An error took place: \(error)")
      }

Testing the JMB's example
"ς".uppercased()                  // "Σ"
"σ".uc                            // "Σ"
"ὈΔΥΣΣΕΎΣ".lowercased()         // "ὀδυσσεύσ"     NOT SUPPORTED last Σ becomes ς
"ὈΔΥΣΣΕΎΣA".lowercased()        // "ὀδυσσεύσa"    last Σ becomes σ

https://developer.apple.com/documentation/swift/character/isnewline
isNewline
A Boolean value indicating whether this character represents a newline.
For example, the following characters all represent newlines:
    “\n” (U+000A): LINE FEED (LF)
    U+000B: LINE TABULATION (VT)
    U+000C: FORM FEED (FF)
    “\r” (U+000D): CARRIAGE RETURN (CR)
    “\r\n” (U+000D U+000A): CR-LF
    U+0085: NEXT LINE (NEL)
    U+2028: LINE SEPARATOR
    U+2029: PARAGRAPH SEPARATOR
---
jlf: this is related to Unicode properties of a character.
But what is the impacts on file I/O?

Typst lang

https://github.com/typst/typst
A new markup-based typesetting system that is powerful and easy to learn.
---
jlf: uses ICU4X
https://github.com/unicode-org/icu4x/issues/3811

XPath lang

https://www.w3.org/TR/xpath-functions-31/#string-functions
Functions on strings
jlf:
to read
no "grapheme" in this document.
written by Michael Kay (XSLT WG), Saxonica <http://www.saxonica.com/>

https://www.w3.org/TR/xpath-functions-31/#string.match
String functions that use regular expressions
jlf: part of the doc "Functions on strings" above, explicitely referenced for
direct access.

https://www.w3.org/TR/xpath-functions-31/#func-collation-key
Referenced in https://github.com/unicode-org/icu4x/issues/2689#issuecomment-1743127855
    hsivonen:
        I'm quite skeptical of processes that use XPath having the kind of lifetimes
        and numbers of comparisons that computing a sort key is justified, but whether
        or not exposing sort keys in XPath is a good idea, it's good to know that XPath
        has this dependency.
    faassen:
        I think the XPath spec (the library portion) has been influenced by the capabilities of ICU4J.
        The motivation for this facility is described in the "notes" section:
        https://www.w3.org/TR/xpath-functions-31/#func-collation-key
        and is basically to use this as a collation-dependent hashmap key.
        I can't judge myself how useful that is, so I'll defer to your skepticism.
        I'll note however that this same specification also provides the function
        library available to XQuery, and with XQuery the lifetimes and numbers of
        comparisons are likely to be much bigger.

Zig lang, Ziglyph

01/10/2024
https://codeberg.org/dude_the_builder/zg.git

04/07/2021
https://github.com/jecolon/ziglyph
Unicode text processing for the Zig programming language.
01/01/2024 deprecated, replaced by zg

https://devlog.hexops.com/2021/unicode-data-file-compression/
achieving 40-70% reduction over gzip alone

https://github.com/jecolon/ziglyph/issues/3
More size-optimal grapheme cluster sorting

08/02/2023
https://github.com/natecraddock/zf
a commandline fuzzy finder that prioritizes matches on filenames
To review: uses zygliph
https://github.com/jecolon/ziglyph/issues/20
Grapheme segmentation with ZWJ sequences

10/02/2023
https://github.com/jecolon/ziglyph/issues/20
Grapheme segmentation with ZWJ sequences
---
jlf: Executor is ok with utf8proc
t = "🐻‍❄️🐻‍❄️"~text
t~description=              -- 'UTF-8 not-ASCII (2 graphemes, 8 codepoints, 26 bytes, 0 error)'
t~characters==
    an Array (shape [8], 8 items)
     1 : ( "🐻"  U+1F43B So 2 "BEAR FACE" )
     2 : ( "‍"    U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" )
     3 : ( "❄"   U+2744 So 1 "SNOWFLAKE" )
     4 : ( "️"    U+FE0F Mn 0 "VARIATION SELECTOR-16", "VS16" )
     5 : ( "🐻"  U+1F43B So 2 "BEAR FACE" )
     6 : ( "‍"    U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" )
     7 : ( "❄"   U+2744 So 1 "SNOWFLAKE" )
     8 : ( "️"    U+FE0F Mn 0 "VARIATION SELECTOR-16", "VS16" )

https://devlog.hexops.com/2021/unicode-sorting-why-browsers-added-special-emoji-matching/
    Whether your application is in Go and has it’s own Unicode Collation Algorithm
    (UCA) implementation, or Rust and uses bindings to the popular ICU4C library -
    one thing is going to remain true: it requires large data files to work.

    The UCA algorithm depends on two quite large data table files to work:
    - UnicodeData.txt for normalization, a step required before sorting can take place.
    - allkeys.txt for weighting certain text above others.
    - And more, if you want truly locale-aware sorting and not just “the default”
      the UCA algorithm gives you.
    Together, these files can add up to over a half a megabyte.

    While WASM languages could shell out to JavaScript browser APIs for collation,
    I suspect they won’t due to the lack of guarantees around those APIs.

    A more likely scenario is languages continuing to leave locale-aware sorting
    as an optional, opt-in feature - that also makes your application larger.

    I think this a worthwhile problem to solve, so I am working on compression a
    lgorithms for these files specifically in Zig to reduce them to only a few
    tens of kilobytes.
    https://github.com/jecolon/ziglyph/issues/3

Knock, knock.

Knock, knock.
    Who’s there?
You.
    You who?
Yoo-hoo! It's You Nicode.

Knock, knock.
    Who’s there?
Sue.
    Sue who?
It's Sue Nicode.