Accumulation of URLs about Unicode
Contents:
Unicode standard
Unicode general informations
U+ notation, Unicode escape sequence
Security title
Segmentation, Grapheme
Normalization, equivalence
Character set
String matching - Lower vs Casefold
String matching - Collation
Locale
CLDR Common Locale Data Repository
Case mappings
Collation, sorting
BIDI title
Emoji
Countries, flags
Evidence of partial or wrong support of Unicode
Optimization, SIMD
Variation sequence
Whitespaces, separators
Hyphenation
DNS title, Domain Name title, Domain Name System title
All languages
Classical languages
Arabic language
Indic languages
CJK
Korean
Japanese
Polish
IME - Input Method Editor
Text editing
Text rendering, Text shaping library
String Matching
Fuzzy String Matching
Levenshtein distance and string similarity
String comparison
JSON
TOML serialization format
CBOR Concise Binary Representation
Binary encoding in Unicode
Invalid format
Mojibake
Filenames
WTF8
Codepoint/grapheme indexation
Rope
Encoding title
ICU title
ICU demos
ICU bindings
ICU4X title
utf8proc title
Twitter text parsing
terminal / console / cmd
QT Title
IBM OS
IBM RPG Lang
IBM z/OS
macOS OS
Windows OS
Language comparison
Regular expressions
Test cases, test-cases, tests files
font bold, italic, strikethrough, underline, backwards, upside down
youtube
xxx lang
Ada lang
Awk lang
C++ lang, cpp lang, Boost
cRexx lang
DotNet, CoreFx
Dafny lang
Dart lang
Elixir lang
Factor lang
Fortran lang
GO lang
jRuby lang
Java lang
JavaScript lang
Julia lang
Kotlin lang
Lisp lang
Mathematica lang
netrexx lang
Oracle
Perl lang (Perl 6 has been renamed to Raku)
PHP lang
Python lang
R lang
RAKU lang Rakudo lang (Perl6, Perl 6, MOAR-VM)
Rexx lang
Ruby lang
Rust lang
Saxon lang
SQL lang
Swift lang
Typst lang
XPath lang
Zig lang, Ziglyph
Knock, knock.
Unicode standard
Remember
Don't know why, but the Unicode consortium has 2 different URLs:
https://unicode.org/
https://www.unicode.org/
To avoid doubling URLs, I use the 2nd form.
https://home.unicode.org/
https://www.unicode.org/ (same as home.unicode.org)
https://www.unicode.org/versions/
https://www.unicode.org/versions/latest/ (latest version)
https://www.unicode.org/versions/enumeratedversions.html (current and previous versions)
https://www.unicode.org/Public/ (datas for current and previous versions)
https://www.unicode.org/ucd/
UCD = Unicode Character Database
https://www.unicode.org/Public/MAPPINGS (ISO8859)
These tables are considered to be authoritative mappings
between the Unicode Standard and different parts of
the ISO/IEC 8859 standard.
https://www.unicode.org/faq/specifications.html
https://www.unicode.org/reports/
Unicode® Technical Reports
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard,
but is published as a separate document. The Unicode Standard may require
conformance to normative content in a Unicode Standard Annex, if so specified
in the Conformance chapter of that version of the Unicode Standard.
A Unicode Technical Standard (UTS) is an independent specification.
Conformance to the Unicode Standard does not imply conformance to any UTS.
A Unicode Technical Report (UTR) contains informative material.
Conformance to the Unicode Standard does not imply conformance to any UTR.
Other specifications, however, are free to make normative references to a UTR.
Unicode Standard Annex (UAX)
UAX #9, The Unicode Bidirectional Algorithm
https://www.unicode.org/reports/tr9/
UAX #11, East Asian Width
https://www.unicode.org/reports/tr11/
UAX #14, Unicode Line Breaking Algorithm
https://www.unicode.org/reports/tr14/
UAX #15, Unicode Normalization Forms
https://www.unicode.org/reports/tr15/
UAX #24, Unicode Script Property
https://www.unicode.org/reports/tr24/
UAX #29, Unicode Text Segmentation
https://www.unicode.org/reports/tr29/
UAX #31, Unicode Identifier and Pattern Syntax
https://www.unicode.org/reports/tr31/
UAX #34, Unicode Named Character Sequences
https://www.unicode.org/reports/tr34/
UAX #38, Unicode Han Database (Unihan)
https://www.unicode.org/reports/tr38/
UAX #41, Common References for Unicode Standard Annexes
https://www.unicode.org/reports/tr41/
UAX #42, Unicode Character Database in XML
https://www.unicode.org/reports/tr42/
UAX #44, Unicode Character Database
https://www.unicode.org/reports/tr44/
UAX #45, U-Source Ideographs
https://www.unicode.org/reports/tr45/
UAX #50, Unicode Vertical Text Layout
https://www.unicode.org/reports/tr50/
Unicode Technical Standard (UTS)
UTS #22, UNICODE CHARACTER MAPPING MARKUP LANGUAGE (CharMapML)
https://www.unicode.org/reports/tr22/
This document specifies an XML format for the interchange of mapping data
for character encodings, and describes some of the issues connected with the
use of character conversion.
https://www.unicode.org/glossary
Code Point.
(1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF.
Not all code points are assigned to encoded characters. See code point type.
(2) A value, or position, for a character, in any coded character set.
Code Unit.
The minimal bit combination that can represent a unit of encoded text for processing or interchange.
The Unicode Standard uses 8-bit code units in the UTF-8 encoding form,
16-bit code units in the UTF-16 encoding form,
and 32-bit code units in the UTF-32 encoding form.
Unicode Scalar Value.
Any Unicode code point except high-surrogate and low-surrogate code points.
In other words, the ranges of integers 0 to D7FF and E000 to 10FFFF inclusive.
UNICODE COLLATION ALGORITHM
Unicode has an official string collation algorithm called UCA
https://www.unicode.org/reports/tr10/
https://www.unicode.org/reports/tr10/#S2.1.1
The Unicode Collation Algorithm takes an input Unicode string and a Collation Element Table,
containing mapping data for characters. It produces a sort key, which is an array of
unsigned 16-bit integers. Two or more sort keys so produced can then be binary-compared
to give the correct comparison between the strings for which they were generated.
08/06/2021
Default Unicode Collation Element Table (DUCET)
For the latest version, see:
https://www.unicode.org/Public/UCA/latest/allkeys.txt
---
UTS10-D1. Collation Weight: A non-negative integer used in the UCA to establish
a means for systematic comparison of constructed sort keys.
UTS10-D2. Collation Element: An ordered list of collation weights.
UTS10-D3. Collation Level: The position of a collation weight in a collation element.
https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms
UNICODE NORMALIZATION FORMS
https://www.unicode.org/reports/tr31/#Alternative_Identifier_Syntax
https://www.unicode.org/reports/tr31/
UNICODE IDENTIFIER AND PATTERN SYNTAX
jlf: there is ONE (just ONE) occurence of NFKC_CF:
Comparison and matching should be done after converting to NFKC_CF format.
Thus #MötleyCrüe should match #MÖTLEYCRÜE and other variants.
---
In the UnicodeStandard PDF:
- The mapping NFKC_Casefold (short alias NFKC_CF) is specified in the data
file DerivedNormalizationProps.txt in the Unicode Character Database.
- The derived binary property Changes_When_NFKC_Casefolded is also listed
in the data file DerivedNormalizationProps.txt in the Unicode Character Database.
Conformance 156 3.13 Default Case Algorithms
For more information on the use of NFKC_Casefold and caseless matching for identifiers,
see Unicode Standard Annex #31, “Unicode Identifier and Pattern Syntax
https://www.unicode.org/reports/tr51/
Unicode emoji
23/05/2021
https://www.unicode.org/notes/tn28/
UNICODEMATH, A NEARLY PLAIN-TEXT ENCODING OF MATHEMATIC
𝑎𝑏𝑐
𝑑
𝑎 + 𝑐
𝑑
(𝑎 + 𝑏)𝑛 = ∑ (𝑛 𝑘) 𝑎𝑘𝑏𝑛−𝑘
https://www.unicode.org/notes/tn5/
Unicode Technical Note #5
CANONICAL EQUIVALENCE IN APPLICATIONS
https://icu.unicode.org/design/normalizing-to-shortest-form
Canonically Equivalent Shortest Form (CESF)
This is usually, but not always, the NFC form.
Conformance
https://github.com/unicode-org/conformance
This repository provides tools and procedures for verifying that an
implementation is working correctly according to the data-based specifications.
The tests are implemented on several platforms including NodeJS (JavaScript),
ICU4X (RUST), ICU4C, etc.
Data Driven Test was initiated in 2022 at Google.
The first release of the package was delivered in October, 2022.
https://www.unicode.org/main.html
Unicode® Technical Site
https://www.unicode.org/faq/
https://www.unicode.org/faq/char_combmark.html
Characters and Combining Marks
https://codepoints.net/
Very detailled decription of each character
Source of the WEB site:
https://github.com/Codepoints/codepoints.net
https://util.unicode.org/UnicodeJsps/
Lot of informations about a character.
https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/UTF-16
https://en.wikipedia.org/wiki/UTF-32
http://xahlee.info/comp/unicode_index.html
http://xahlee.info/comp/unicode_invert_text.html
Inverted text: :ʇxǝʇ pǝʇɹǝʌuI
http://xahlee.info/comp/unicode_animals.html
T-REXX: 🦖
https://www.fontspace.com/unicode/analyzer
https://www.compart.com/en/unicode/
22/05/2021
https://onlineunicodetools.com/
Online Unicode tools is a collection of useful browser-based utilities for manipulating Unicode text.
28/05/2021
https://unicode.scarfboy.com/
Search tool
Provides plenty of information about Unicode characters
but no encoding UTF16
https://unicode-table.com/en/ search by name
Provides the encoding UTF16
https://www.minaret.info/test/menu.msp
Minaret Unicode Tests
Case Folding
Character Type
Collation
Normalization
Sorting
Transliteration
https://www.gosecure.net/blog/2020/08/04/unicode-for-security-professionals/
Unicode for Security Professionals
by Philippe Arteau | Aug 4, 2020
jlf : this article covers many of the Unicode characteristics
https://github.com/bits/UTF-8-Unicode-Test-Documents
Every Unicode character / codepoint in files and a file generator
http://www.ltg.ed.ac.uk/~richard/utf-8.html
let convert utf8 to codepoint + symbolic name
https://blog.lunatech.com/posts/2009-02-03-what-every-web-developer-must-know-about-url-encoding
https://mothereff.in/utf-8
UTF-8 encoder/decoder
https://corp.unicode.org/pipermail/unicode/
The Unicode Archives
January 2, 2014 - current
https://www.unicode.org/mail-arch/unicode-ml/
March 21, 2001 - April 2, 2020
https://www.unicode.org/mail-arch/unicode-ml/Archives-Old/
October 11, 1994 - March 19, 2001
https://www.unicode.org/search/
Search Unicode.org
https://www.w3.org/TR/charmod/
Character Model for the World Wide Web 1.0: Fundamentals
https://www.johndcook.com/blog/2021/11/01/number-sets-html/
Number sets in HTML and Unicode
ℕ U+2115 ℕ ℕ
ℤ U+2124 ℤ ℤ
ℚ U+211A ℚ ℚ
ℝ U+211D ℝ ℝ
ℂ U+2102 ℂ ℂ
ℍ U+210D ℍ ℍ
https://gregtatum.com/writing/2021/encoding-text-utf-32-utf-16-unicode/
https://gregtatum.com/writing/2021/encoding-text-utf-8-unicode/
https://lwn.net/Articles/667669/
Is the current Unicode design impractical?
jlf: this link is also in the section Raku Lang because it's about Perl6.
jlf: worth reading.
https://www.sciencedirect.com/science/article/pii/S1742287613000595
Unicode search of dirty data.
This paper discusses problems arising in digital forensics with regard to Unicode,
character encodings, and search. It describes how multipattern search can handle
the different text encodings encountered in digital forensics and a number of issues
pertaining to proper handling of Unicode in search patterns. Finally, we demonstrate
the feasibility of the approach and discuss the integration of our developed search
engine, lightgrep, with the popular bulk_extractor tool.
---
There are UTF-16LE strings which contain completely different UTF-8 strings as prefixes.
For example the byte sequence which is “nonsense” in UTF-8 is 潮獮湥敳 in UTF-16LE (!)
"nonsense"~c2x= -- '6E6F6E73656E7365'
"nonsense"~text("utf16be")~c2x= -- '6E6F 6E73 656E 7365'
"nonsense"~text("utf16be")~c2u= -- 'U+6E6F U+6E73 U+656E U+7365'
"nonsense"~text("utf16be")~utf8= -- T'湯湳敮獥' Le potage
"nonsense"~text("utf16le")~c2x= -- '6E6F 6E73 656E 7365'
"nonsense"~text("utf16le")~c2u= -- 'U+6F6E U+736E U+6E65 U+6573'
"nonsense"~text("utf16le")~utf8= -- T'潮獮湥敳' marée
https://github.com/simsong/bulk_extractor
http://t-a-w.blogspot.com/2008/12/funny-characters-in-unicode.html
SKULL AND CROSSBONES
SNOWMAN
POSTAL MARK FACE
APL FUNCTIONAL SYMBOL TILDE DIAERESIS
ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM
ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
THAI CHARACTER KHOMUT
GLAGOLITIC CAPITAL LETTER SPIDERY HA
VERY MUCH GREATER-THAN
NEITHER LESS-THAN NOR GREATER-THAN
HEAVY BLACK HEART
FLORAL HEART BULLET, REVERSED ROTATED
INTERROBANG
𝄞 (U+1D11E) MUSICAL SYMBOL G CLEF
𝕥 (U+1D565) MATHEMATICAL DOUBLE-STRUCK SMALL T
𝟶 (U+1D7F6) MATHEMATICAL MONOSPACE DIGIT ZERO
𠂊 (U+2008A) Han Character
https://www.unicode.org/udhr/
UDHR in Unicode
The goal of the UDHR in Unicode project is to demonstrate the use of Unicode
for a wide variety of languages, using the Universal Declaration of Human Rights
(UDHR) as a representative text.
https://github.com/jagracey/Awesome-Unicode
Awesome Unicode
https://cldr.unicode.org/index/charts
CLDR Charts
By-Type Chart: Numbers:Symbols
Question
I am using the following code excerpt to format numbers:
LocalizedNumberFormatter lnFmt = NumberFormatter.withLocale(Locale.US).unit(MeasureUnit.CELSIUS).unitWidth(NumberFormatter.UnitWidth.SHORT);
System.out.println(lnFmt.format(-10).toString());
In the resulting string, minus sign is represented as 0x2d (ASCII HYPHEN-MINUS).
Shouldn't it be U+2212 (Unicode MINUS SIGN)?
Answer
You can see the minus sign symbol being used for each locale here:
https://unicode-org.github.io/cldr-staging/charts/latest/by_type/numbers.symbols.html#2f08b5ebf85e1e8b
U+2212 is used in: ·fa· ·ps· ·uz_Arab· ·eo· ·et· ·eu· ·fi· ·fo· ·gsw· ·hr· ·kl· ·ksh· ·lt· ·nn· ·no· ·rm· ·se· ·sl· ·sv·
Question
Where this list of locales was taken from?
I am particulary interested in ‘ru’: why U+2212 is not used for it?
https://stackoverflow.com/questions/10143836/why-is-there-no-utf-24
Why is there no UTF-24? [duplicate]
Well, the truth is : UTF-24 was suggested in 2007 :
https://www.unicode.org/mail-arch/unicode-ml/y2007-m01/0057.html
Possible Duplicate:
Why UTF-32 exists whereas only 21 bits are necessary to encode every character?
https://stackoverflow.com/questions/6339756/why-utf-32-exists-whereas-only-21-bits-are-necessary-to-encode-every-character
https://unicodebook.readthedocs.io/
Book "Programming with Unicode"
2010-2011, Victor Stinner
jlf: only one occurrence of the word "grapheme".
Maybe at that time, it was not obvious that it would become an important concept.
https://mcilloni.ovh/2023/07/23/unicode-is-hard/
Unicode is harder than you think
23 Jul 2023
---
jlf: good overview, with some ICU samples.
https://www.kermitproject.org/utf8.html
UTF-8 SAMPLER
Last update: Sun Mar 12 14:21:05 2023
http://www.inter-locale.com/whitepaper/learn/learn-to-test.html
International Testing Basics
Testing non-English and non-ASCII (and/or Unicode) support in a product requires
tests and test plans that exercise the edge cases in the software.
https://www.youtube.com/watch?v=gd5uJ7Nlvvo
Plain Text - Dylan Beattie - NDC Copenhagen 2022
---
jlf: many comments say it's a good talk
did not watch
todo: watch
https://www.smashingmagazine.com/2012/06/all-about-unicode-utf8-character-sets/
Unicode, UTF8 & Character Sets: The Ultimate Guide
jlf: maybe to read
https://tonsky.me/blog/unicode/
The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)
https://news.ycombinator.com/item?id=37735801
What every software developer must know about Unicode in 2023
jlf: nothing new in this article, just reusing infos from other sites.
jlf: did not read all the comments
https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode
This article covers all Unicode characters with a derived property of "Math".
U+ notation, Unicode escape sequence
29/05/2021
https://stackoverflow.com/questions/1273693/why-is-u-used-to-designate-a-unicode-code-point/8891355
The Python language defines the following string literals:
u'xyz' to indicate a Unicode string, a sequence of Unicode characters
'\uxxxx' to indicate a string with a unicode character denoted by four hex digits
'\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits
\N{name} Character named name in the Unicode database
\uxxxx Character with 16-bit hex value xxxx. Exactly four hex digits are required.
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx. Exactly eight hex digits are required.
https://www.perl.com/article/json-unicode-and-perl-oh-my-/
Its \uXXXX escapes support only characters within Unicode’s BMP;
to store emoji or other non-BMP characters you either have to encode to UTF-8 directly.
or indicate a UTF-16 surrogate pair in \uXXXX escapes.
https://corp.unicode.org/pipermail/unicode/2021-April/009410.html
Need reference to good ABNF for \uXXXX syntax
https://bit.ly/UnicodeEscapeSequences
Unicode Escape Sequences Across Various Languages and Platforms
Security title
https://www.unicode.org/reports/tr39
UNICODE SECURITY MECHANISMS
https://www.unicode.org/Public/security/latest/confusables.txt
https://en.wikipedia.org/wiki/Homoglyph
https://www.trojansource.codes/
https://api.mtr.pub/vhf/confusable_homoglyphs
https://util.unicode.org/UnicodeJsps/confusables.jsp
https://www.w3.org/TR/charmod-norm/#normalizationLimitations
Confusable characters:
"ΡРP"~text~characters==
an Array (shape [3], 3 items)
1 : ( "Ρ" U+03A1 Lu 1 "GREEK CAPITAL LETTER RHO" )
2 : ( "Р" U+0420 Lu 1 "CYRILLIC CAPITAL LETTER ER" )
3 : ( "P" U+0050 Lu 1 "LATIN CAPITAL LETTER P" )
These confusable characters are not impacted by the lump option:
"ΡРP"~text~nfc(lump:)~characters -- same result
https://www.unicode.org/reports/tr36/#visual_spoofing
UNICODE SECURITY CONSIDERATIONS
http://www.unicode.org/reports/tr55/
Draft Unicode® Technical Standard #55
UNICODE SOURCE CODE HANDLING
---
While the normative material for computer language specifications is part of the
Unicode Standard, in Unicode Standard Annex #31, Unicode Identifiers and Syntax
[UAX31], the algorithms specific to the display of source code or to higher-level
diagnostics are specified in this document.
Note: While, for the sake of brevity, many of the examples in this document make
use of non-ASCII identifiers, most of the issues described here apply even if
non-ASCII characters are confined to strings and comments.
---
3.1.1 Normalization and Case
Case-insensitive languages should meet requirement UAX31-R4 with normalization
form KC, and requirement UAX31-R5 with full case folding. They should ignore
default ignorable code points in comparison. Conformance with these requirements
and ignoring of default ignorable code points may be achieved by comparing
identifiers after applying the transformation toNFKC_Casefold.
Note: Full case folding is preferable to simple case folding, as it better
matches expectations of case-insensitive equivalence.
The choice between Normalization Form C and Normalization Form KC should match
expectations of identifier equivalence for the language.
In a case-sensitive language, identifiers are the same if and only if they look
the same, so Normalization Form C (canonical equivalence) is appropriate, as
canonical equivalent sequences should display the same way.
In a case-insensitive language, the equivalence relation between identifiers is
based on a more abstract sense of character identity; for instance, e and E are
treated as the same letter. Normalization Form KC (compatibility equivalence) is
an equivalence between characters that share such an abstract identity.
Example: In a case-insensitive language, SO and so are the same identifier; if
that language uses Normalization Form KC, the identifiers so and 𝖘𝖔 are likewise
identical.
Unicode 15.1
[icu-design] ICU 74 API proposal: bidiSkeleton and LTR- and RTL-confusabilities
The Source Code Working Group, a limited-duration working group under the
Properties & Algorithms Group of the Unicode Technical Committee, has added a
new bidi-aware concept of confusability to UTS #39 in Unicode Version 15.1;
until publication see the proposed update,
https://www.unicode.org/reports/tr39/tr39-27.html#Confusable_Detection.
The new UTS #55, Unicode Source Code Handling, to be published simultaneously
with Unicode Version 15.1, recommends the use of this new kind of confusability:
https://www.unicode.org/reports/tr55/tr55-2.html#Confusable-Detection.
https://semanticdiff.com/blog/pull-request-unicode-tricks/
Unicode tricks in pull requests: Do review tools warn us?
Segmentation, Grapheme
29/05/2021
https://github.com/alvinlindstam/grapheme
https://pypi.org/project/grapheme/
Here too, he says that CR+LF is a grapheme...
Same here:
https://www.reddit.com/r/programming/comments/m274cg/til_rn_crlf_is_a_single_grapheme_cluster/
https://www.unicode.org/reports/tr29/#Table_Combining_Char_Sequences_and_Grapheme_Clusters
01/06/2021
https://halt.software/optimizing-unicodes-grapheme-cluster-break-algorithm/
They claim this improvement:
For the simple data set, this was 0.38 of utf8proc time.
For the complex data set, this was 0.56 of utf8proc time.
01/06/2021
https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/
GraphemeCursor Cursor-based segmenter for grapheme clusters.
GraphemeIndices External iterator for grapheme clusters and byte offsets.
Graphemes External iterator for a string's grapheme clusters.
USentenceBoundIndices External iterator for sentence boundaries and byte offsets.
USentenceBounds External iterator for a string's sentence boundaries.
UWordBoundIndices External iterator for word boundaries and byte offsets.
UWordBounds External iterator for a string's word boundaries.
UnicodeSentences An iterator over the substrings of a string which, after splitting
the string on sentence boundaries, contain any characters with the Alphabetic
property, or with General_Category=Number.
UnicodeWords An iterator over the substrings of a string which, after splitting
the string on word boundaries, contain any characters with the Alphabetic
property, or with General_Category=Number.
https://github.com/knighton/unicode
Minimalist Unicode normalization/segmentation library. Python and C++.
Abandonned, last commit 21/05/2015
https://hsivonen.fi/string-length/
First published: 2019-09-08
It’s Not Wrong that "🤦🏼♂️".length == 7
But It’s Better that "🤦🏼♂️".len() == 17 and Rather Useless that len("🤦🏼♂️") == 5
But I Want the Length to Be 1!
jlf:
"🤦🏼♂️"~text~length= -- 1
"🤦🏼♂️"~text~characters==
an Array (shape [5], 5 items)
1 : ( "🤦" U+1F926 So 2 "FACE PALM" )
2 : ( "🏼" U+1F3FC Sk 2 "EMOJI MODIFIER FITZPATRICK TYPE-3" )
3 : ( "" U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" )
4 : ( "♂" U+2642 So 1 "MALE SIGN" )
5 : ( "️" U+FE0F Mn 0 "VARIATION SELECTOR-16", "VS16" )
07/06/2021
https://news.ycombinator.com/item?id=20914184
String lengths in Unicode
Claude Roux
We went through a lot of pain to get this right in Tamgu ( https://github.com/naver/tamgu ).
In particular, emojis can be encoded across 5 or 6 Unicode characters.
A "black thumb up" is encoded with 2 Unicode characters: the thumb glyph and its color.
This comes at a cost. Every time you extract a sub-string from a string,
you have to scan it first for its codepoints, then convert character positions
into byte positions. One way to speed up stuff a bit, is to check if the string
is in ASCII (see https://lemire.me/blog/2018/05/16/validating-utf-8-strings-u )
and apply regular operator then.
We implemented many techniques based on "intrinsics" instructions to speed up
conversions and search in order to avoid scanning for codepoints.
See https://github.com/naver/tamgu/blob/master/src/conversion.cxx for more information.
https://github.com/naver/tamgu/wiki/4.-Speed-up-UTF8-string-processing-with-Intel's-%22intrinsics%22-instructions-(en)
jlf: they have specific support for Korean... Probably because the NAVER company is from Republic of Korea ?
08/06/2021
https://twitter.com/hashtag/tamgu?src=hashtag_click
https://twitter.com/hashtag/TAL?src=hashtag_click
#tamgu le #langage_de_programmation pour le Traitement Automatique des Langues (#TAL).
jlf 30/09/2021
I have a doubt about that:
Is 👩👨👩👧' really a grapheme?
When moving the cursor in BBEdit, I see a boundary between each character.
[later]
Ok, when moving the cursor in Visual Studio Code, it's really a unique grapheme, no way to put the cursor "inside".
And the display is aligned with what I see in Google Chrome :
one WOMAN followed by a family, and no way to put the cursor between the WOMAN and the family.
---
https://www.unicode.org/review/pr-27.html (old, talk about Unicode 4)
https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries (todo: review occurences of ZWJ)
29/10/2021
https://h3manth.com/posts/unicode-segmentation-in-javascript/
https://github.com/tc39/proposal-intl-segmenter
https://news.ycombinator.com/item?id=21690326
Tailored grapheme clusters
Grapheme clusters are locale-dependent, much like string collation is locale-dependent.
What Unicode gives you by default, the (extended) grapheme cluster, is as useful as
the DUCET (Default Unicode Collation Element Table); while you can live with them,
you would be unsatisfied. In fact there are tons of Unicode bugs that can't be corrected
due to the compatibility reason, and can only be fixed via tailored locale-dependent schemes.
---
Hangul normalization and collation is broken in Unicode, albeit for slightly different reasons.
The Unicode Collation Algorithm explictly devotes two sections related to Hangul; the first section, for "trailing weights" [1], is recommended for the detailed explanation.
The Unicode Text Segmentation standard [2] explicitly mentions that Indic aksaras [3]
require the tailoring to grapheme clusters. Depending on the view, you can also consider
orthographic digraphs as examples (Dutch "ij" is sometimes considered a single character for example).
https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
What's the difference between a character, a code point, a glyph and a grapheme?
jlf: not very good...
https://github.com/clipperhouse/words
words is a command which splits strings into individual words, as defined by Unicode.
It accepts text from stdin, and writes one word (token) per line to stdout.
https://www.unicode.org/reports/tr29/#Random_Access
jlf: Executor uses indexers for random access (ako breadcrumbs).
Random access introduces a further complication. When iterating through a string
from beginning to end, a regular expression or state machine works well. From
each boundary to find the next boundary is very fast. By constructing a state
table for the reverse direction from the same specification of the rules,
reverse iteration is possible.
However, suppose that the user wants to iterate starting at a random point in
the text, or detect whether a random point in the text is a boundary. If the
starting point does not provide enough context to allow the correct set of rules
to be applied, then one could fail to find a valid boundary point. For example,
suppose a user clicked after the first space after the question mark in
“Are␣you␣there?␣ ␣No,␣I’m␣not”. On a forward iteration searching for a sentence
boundary, one would fail to find the boundary before the “N”, because the “?”
had not been seen yet.
A second set of rules to determine a “safe” starting point provides a solution.
Iterate backward with this second set of rules until a safe starting point is
located, then iterate forward from there. Iterate forward to find boundaries
that were located between the safe point and the starting point; discard these.
The desired boundary is the first one that is not less than the starting point.
The safe rules must be designed so that they function correctly no matter what
the starting point is, so they have to be conservative in terms of finding
boundaries, and only find those boundaries that can be determined by a small
context (a few neighboring characters).
This process would represent a significant performance cost if it had to be
performed on every search. However, this functionality can be wrapped up in an
iterator object, which preserves the information regarding whether it currently
is at a valid boundary point. Only if it is reset to an arbitrary location in
the text is this extra backup processing performed. The iterator may even cache
local values that it has already traversed.
Unicode 15.1
New rule GB9c for grapheme segmentation.
https://www.unicode.org/reports/tr29/
---
No longer available:
https://www.unicode.org/reports/tr29/proposed.html
---
jlf: saw this review note
"the new rule GB9c has been implemented in CLDR and ICU as a profile for some years"
What is a profile?
---
This specification defines default mechanisms; more sophisticated implementations
can and should tailor them for particular locales or environments and, for the
purpose of claiming conformance, document the tailoring in the form of a profile.
...
Note that a profile can both add and remove boundary positions, compared to the
results specified by UAX29-C1-1, UAX29-C2-1, or UAX29-C3-1.
https://github.com/unicode-org/lstm_word_segmentation
Python code for training an LSTM model for word segmentation in Thai, Burmese,
and similar languages.
Normalization, equivalence
https://www.unicode.org/faq/normalization.html
Normalization FAQ
https://www.macchiato.com/unicode-intl-sw/nfc-faq
NFC FAQ
jlf: MUST READ!
https://www.unicode.org/reports/tr15
UNICODE NORMALIZATION FORMS
26/11/2013
Text normalization in Go
https://blog.golang.org/normalization
27/11/2013
The string type is broken
https://mortoray.com/2013/11/27/the-string-type-is-broken/
https://news.ycombinator.com/item?id=6807524
https://www.reddit.com/r/programming/comments/1rkdip/the_string_type_is_broken/
In the comments
Objective-C’s NSString type does correctly upper-case baffle into BAFFLE.
(where the rectangle is a grapheme showing 2 small 'f')
Q: What about getting the first three characters of “baffle”? Is “baf” the correct answer?
A: That’s a good question. I suspect “baf” is the correct answer, and I wonder if there is any library that does it.
I suspect if you normalize it first (since the ffl would disappear I think).
A: The ligarture disappears in NFK[CD] but not in NF[CD].
Whether normalization to NFK[CD] is a good idea depends (as always) on the situation.
For visual grapheme cluster counting, one would convert the entire text to NFKC.
For getting teaser text from an article i would not a normalization step
and let a ligature count as just one grapheme cluster even if it may resemble three of them logically.
I assume, that articles are stored in NFC (the nondestructive normalization form with smallest memory footprint).
The Unicode standard does not treat ligatures as containing more than one grapheme cluster for that normalization forms that permits them.
So “efflab” (jlf: efflab) is the correct result of reversing “baffle” (jlf: baffle)
and “baffle”[2] has to return “ffl” even when working on the grapheme cluster level!
There may or may not be a need for another grapheme cluster definition that permits splitting of ligatures in NF[CD].
A straight forward way to implement a reverse function adhering to that special definition would NFKC each Unicode grapheme cluster on the fly.
When that results in multiple Unicode grapheme clusters, that are used – else the original is preserved (so that “ℕ” does not become “N”).
The real problem is to find a good name for that special interpretation of a grapheme cluster…
Note :
see also the comment of Tom Christiansen about casing.
I don't copy-paste here, too long.
https://github.com/blackwinter/unicode
Unicode normalization library. (Mirror of Yoshida-san's code base to maintain the RubyGem.)
Abandonned, last commit 07/07/2016
https://github.com/sjorek/unicode-normalization
An enhanced facade to existing unicode-normalization implementations
Last commit 25/03/2018
https://docs.microsoft.com/en-us/windows/win32/intl/using-unicode-normalization-to-represent-strings
Using Unicode Normalization to Represent Strings
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
String.prototype.normalize()
The normalize() method returns the Unicode Normalization Form of the string.
https://forums.swift.org/t/string-case-folding-and-normalization-apis/14663/3
For the comments
https://en.wikipedia.org/wiki/Unicode_equivalence
Unicode equivalence is the specification by the Unicode character encoding standard that some sequences
of code points represent essentially the same character. This feature was introduced in the standard
to allow compatibility with preexisting standard character sets, which often included similar or identical characters.
On Wed, Oct 28, 2020 at 9:54 AM Mark Davis ☕️ <mark@macchiato.com> wrote:
Re: [icu-support] Options for Immutable Collation?
I think your search for 'middle ground' is fruitless.
An NFKD ordering is not correct for any human language, and changes with each new Unicode version.
And even the default Unicode collation ordering is wrong for many languages, because there is no order that simultaneously satisfies all (eg German ordering and Swedish ordering are incompatible).
Your 'middle ground' would be correct for nobody, and yet be unstable across Unicode versions; or worse yet, fail for new characters.
IMO, the best practice for a file system (or like systems) is to store in codepoint order. When called upon to present a sorted list of files to a user, the displaying program should sort that list according to the user's language preferences.
You are right: for a deterministic/reproducible list sorting for a cross-platform filesystem API, anything more complex would be an implementation hazard.
However, after reviewing both developer discussions and implementation of Unicode handling in 6+ filesystems, IDNA200X, PRECIS and getting roped into work on an IETF i18n filesystem best-practices RFC ... I've got some thoughts. Thoughts that I will put into a new thread after I do some experimenting : ).
Thank you all so much!!!
-Zach Lym
08/06/2021
https://fr.wikipedia.org/wiki/Normalisation_Unicode
NFD Les caractères sont décomposés par équivalence canonique et réordonnés
canonical decomposition
NFC Les caractères sont décomposés par équivalence canonique, réordonnés, et composés par équivalence canonique
canonical decomposition followed by canonical composition
NFKD Les caractères sont décomposés par équivalence canonique et de compatibilité, et sont réordonnés
compatibility decomposition
NFKC Les caractères sont décomposés par équivalence canonique et de compatibilité, sont réordonnés et sont composés par équivalence canonique
compatibility decomposition followed by canonical composition
FCD "Fast C or D" form; cf. UTN #5
FCC "Fast C Contiguous"; cf. UTN #5
09/06/2021
Rust
https://docs.rs/unicode-normalization
Decompositions External iterator for a string decomposition’s characters.
Recompositions External iterator for a string recomposition’s characters.
Replacements External iterator for replacements for a string’s characters.
StreamSafe UAX15-D4: This iterator keeps track of how many non-starters
there have been since the last starter in NFKD and will emit
a Combining Grapheme Joiner (U+034F) if the count exceeds 30.
is_nfc Authoritatively check if a string is in NFC.
is_nfc_quick Quickly check if a string is in NFC, potentially returning IsNormalized::Maybe if further checks are necessary. In this case a check like s.chars().nfc().eq(s.chars()) should suffice.
is_nfc_stream_safe Authoritatively check if a string is Stream-Safe NFC.
is_nfc_stream_safe_quick Quickly check if a string is Stream-Safe NFC.
is_nfd Authoritatively check if a string is in NFD.
is_nfd_quick Quickly check if a string is in NFD.
is_nfd_stream_safe Authoritatively check if a string is Stream-Safe NFD.
is_nfd_stream_safe_quick Quickly check if a string is Stream-Safe NFD.
is_nfkc Authoritatively check if a string is in NFKC.
is_nfkc_quick Quickly check if a string is in NFKC.
is_nfkd Authoritatively check if a string is in NFKD.
is_nfkd_quick Quickly check if a string is in NFKD.
Enums
IsNormalized The QuickCheck algorithm can quickly determine if a text is
or isn’t normalized without any allocations in many cases,
but it has to be able to return Maybe when a full decomposition
and recomposition is necessary.
08/06/2021
Pharo
https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43
https://github.com/duerst/eprun
Efficient Pure Ruby Unicode Normalization (eprun)
According to julia/utf8proc, the interesting part is the tests.
https://corp.unicode.org/pipermail/unicode/2020-December/009150.html
Normalization Generics (NFx, NFKx, NFxy)
https://6guts.wordpress.com/2015/04/12/this-week-unicode-normalization-many-rts/
https://gregtatum.com/writing/2021/diacritical-marks/
DIACRITICAL MARKS IN UNICODE
https://news.ycombinator.com/item?id=29751641
Unicode Normalization Forms: When ö ≠ ö
https://blog.opencore.ch/posts/unicode-normalization-forms/
https://unicode-org.github.io/icu/userguide/transforms/normalization/
ICU Documentation
Normalization
Has a few comments about NFKC_Casefold
- NFKC_Casefold: ICU 4.4 supports the combination of NFKC, case folding and
removing ignorable characters which was introduced with Unicode 5.2.
- Data Generation Tool
https://stackoverflow.com/questions/56995429/will-normalizing-a-string-give-the-same-result-as-normalizing-the-individual-gra
Will normalizing a string give the same result as normalizing the individual grapheme clusters?
---
No, that generally is not true. The Unicode Standard warns against the assumption that concatenating
normalised strings produces another normalised string. From UAX #15:
In using normalization functions, it is important to realize that none of the Normalization Forms
are closed under string concatenation. That is, even if two strings X and Y are normalized,
their string concatenation X+Y is not guaranteed to be normalized.
https://stackoverflow.com/questions/7171377/separating-unicode-ligature-characters
NFKD is no panacea: there are plenty of ligatures and other notionally combined
forms it just does not work on at all. For example, it will not manage to decompose
ß or ẞ to SS (even those there is a casefold thither!), nor Æ to AE or æ to ae,
nor Œ to OE or œ to oe. It is also useless for turning ð or đ into d or ø into o.
For all those things, you need the UCA (Unicode Collation Algorithm), not NFKD.
NFD/NFKD also both have the annoying property of destroying singletons, if this
matters to you.
---
my understanding is that those decompositions you mention should not be done.
They are not simply ligatures in the typographical sense, but real separate
characters that are used differently! ß can be decomposed to ss if necessary
(for example if you can only store ASCII), but they are not equivalent. The ff
Ligature, on the other hand is only a typographical ligature.
Character set
https://www.gnu.org/software/libc/manual/html_mono/libc.html#Character-Set-Handling
String matching - Lower vs Casefold
https://stackoverflow.com/questions/45745661/lower-vs-casefold-in-string-matching-and-converting-to-lowercase
https://www.w3.org/TR/charmod-norm/
Character Model for the World Wide Web: String Matching
MUST READ, PLENTY OF EXAMPLES FOR CORNER CASES
https://www.w3.org/TR/charmod-norm/#definitionCaseFolding
Very good explanation!
A few characters have a case folding that map one Unicode code point to two or more code points.
This set of case foldings are called the full case foldings.
character ß U+00DF LATIN SMALL LETTER SHARP S
- The full case folding and the lower case mapping of this character is to two ASCII letters 's'.
- The upper case mapping is to "SS".
Because some applications cannot allocate additional storage when performing a case fold operation,
Unicode provides a simple case folding that maps a code point that would normally fold to more or
fewer code points to use a single code point for comparison purposes instead.
Unlike the full folding, this folding invariably alters the content (and potentially the meaning) of the text.
Unicode simple is not appropriate for use on the Web.
character ᾛ [U+1F9B GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI]
ᾛ ⇒ ἣι full case fold: U+1F23 GREEK SMALL LETTER ETA WITH DASIA AND VARIA + U+03B9 GREEK SMALL LETTER IOTA
ᾛ ⇒ ᾓ simple case fold: U+1F93 GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI
Language Sensitivity
Another aspect of case mapping and case folding is that it can be language sensitive.
Unicode defines default case mappings and case foldings for each encoded character,
but these are only defaults and are not appropriate in all cases. Some languages need
case mapping to be tailored to meet specific linguistic needs. One example of this are
Turkic languages written in the Latin script:
Default Folding
I ⇒ i Default folding of letter I
Turkic Language Folding
I ⇒ ı Turkic language folding of dotless (ASCII) letter I
İ ⇒ i Turkic language folding of dotted letter I
https://www.w3.org/TR/charmod-norm/#matchingAlgorithm
There are four choices for text normalization:
- Default.
This normalization step has no effect on the text and, as a result, is sensitive
to form differences involving both case and Unicode normalization.
- ASCII Case Fold.
Comparison of text with the characters case folded in the ASCII (Basic Latin, U+0000 to U+007F) range.
- Unicode Canonical Case Fold.
Comparison of text that is both case folded and has Unicode canonical normalization applied.
- Unicode Compatibility Case Fold.
Comparison of text that is both case folded and has Unicode compatibility normalization applied.
This normalization step is presented for completeness, but it is not generally appropriate for use on the Web.
https://www.elastic.co/guide/en/elasticsearch/guide/current/languages.html
Elasticsearch
Dealing with Human Language
https://stackoverflow.com/questions/319426/how-do-i-do-a-case-insensitive-string-comparison
Related to Python, but the comments are very general and worth reading.
---
Unicode Standard section 3.13 has two other definitions for caseless comparisons:
(D146, canonical) NFD(toCasefold(NFD(str))) on both sides and
(D147, compatibility) NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) on both sides.
It states the inner NFD is solely to handle a certain Greek accent character.
https://boyter.org/posts/unicode-support-what-does-that-actually-mean/
https://news.ycombinator.com/item?id=23524400
ſecret == secret == Secret
ſatisfaction == satisfaction == ſatiſfaction == Satiſfaction == SatiSfaction === ſatiSfaction
Another good example to consider is the character Æ.
Under simple case folding rules the lower of Æ is ǣ.
However with full case folding rules this also matches ae.
Which one is correct? Well that depends on who you ask.
See also
https://github.com/unicode-org/icu4x/issues/3151
in the section "ICU4X title".
https://lwn.net/Articles/784316/
Working with UTF-8 in the kernel
jlf: interesting read about NTFS caseless, and about a drama because of lack of
support for the turkish case.
String matching - Collation
https://unicode-org.github.io/icu/userguide/collation/string-search.html
(ICU) String Search Service
jlf: they give 3 issues applicable to text searching. Accented letters and
conjoined letters are covered by Executor. But ignorable punctuation is not.
Locale
02/06/2021
https://www.php.net/manual/fr/function.setlocale.php
Warning
The locale information is maintained per process, not per thread.
If you are running PHP on a multithreaded server API , you may experience sudden changes
in locale settings while a script is running, though the script itself never called setlocale().
This happens due to other scripts running in different threads of the same process at the same time,
changing the process-wide locale using setlocale().
On Windows, locale information is maintained per thread as of PHP 7.0.5.
On Windows, setlocale(LC_ALL, '') sets the locale names from the system's regional/language settings (accessible via Control Panel).
https://www.gnu.org/software/libc/manual/html_mono/libc.html#Locales
Locales and Internationalization
https://pubs.opengroup.org/onlinepubs/9699919799/
IEEE Std 1003.1-2017
Locale
https://unix.stackexchange.com/questions/87745/what-does-lc-all-c-do/87763#87763
What does "LC_ALL=C" do?
https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f027338b0fab0f5078971fbe
stream_libarchive: workaround various types of locale braindeath
(legendary C locales rant)
https://stackoverflow.com/questions/30479607/explain-the-effects-of-export-lang-lc-ctype-and-lc-all
The LANG, LC_CTYPE and LC_ALL are special environment variables which after
they got exported to the shell environment, are available and ready to be rea
by certain programs which supports a locale (natural language formatting for C).
Each variable sets the C library's notion of natural language formatting style
for particular sets of routines, for example:
- LC_ALL - Set the entire locale generically
- LC_CTYPE - Set a locale for the ctype and multibyte functions.
This controls recognition of upper and lower case, alphabetic or non-
alphabetic characters, and so on.
and other such as LC_COLLATE (for string collation routines),
LC_MESSAGES (for message catalogs),
LC_MONETARY (for formatting monetary values),
LC_NUMERIC (for formatting numbers),
LC_TIME (for formatting dates and times).
Regarding LANG, it is used as a substitute for any unset LC_* variable.
See: man setlocale (BSD), man locale
So when certain C functions are called (such as setlocale, ctype, multibyte, catopen, printf, etc.),
they read the locale settings from the configuration files and local environment in order to control
and format natural language formatting style as per C programming language standards.
see: setlocale http://www.unix.com/man-page/freebsd/3/setlocale/
see: ctype http://www.unix.com/man-page/freebsd/3/ctype/
see: multibyte http://www.unix.com/man-page/freebsd/3/multibyte/
see: catopen http://www.unix.com/man-page/freebsd/3/catopen/
see:printf http://www.unix.com/man-page/freebsd/3/printf/
see: ISO C99 https://en.wikipedia.org/wiki/C99
see: C Library - <locale.h> https://www.tutorialspoint.com/c_standard_library/locale_h.htm
AIX documentation
https://www.ibm.com/docs/en/aix/7.1?topic=globalization-locales
- Understanding locale
- Understanding locale categories
- Understanding locale environment variables
- Understanding the locale definition source file
- Multibyte subroutines
- Wide character subroutines
- Bidirectionality and character shaping
- Code set independence
- File name matching
- Radix character handling
- Programming model
https://bugzilla.mozilla.org/show_bug.cgi?id=1612379
Narrow down the list of ICU locales we ship
https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/data_management.md
Data management in ICU4X
https://pubs.opengroup.org/onlinepubs/9699919799/
localedef - define locale environment
If the locale value begins with a slash, it shall be interpreted as the pathname
of a file that was created in the output format used by the localedef utility;
see OUTPUT FILES under localedef. Referencing such a pathname shall result in
that locale being used for the indicated category.
CLDR Common Locale Data Repository
19/06/2021
https://github.com/twitter/twitter-cldr-rb
Ruby implementation of the ICU (International Components for Unicode) that uses
the Common Locale Data Repository to format dates, plurals, and more.
https://github.com/twitter/twitter-cldr-js
JavaScript implementation of the ICU (International Components for Unicode) that uses
the Common Locale Data Repository to format dates, plurals, and more. Based on twitter-cldr-rb.
https://unicode-org.atlassian.net/jira/software/c/projects/CLDR/issues/?filter=allissues
CLDR tickets
Case mappings
Rule Final_Sigma in default case algorithms.
https://github.com/php/php-src/pull/10268
jlf: difficult to implement, involves to scan arbitrarily far to the left and
right of capital sigma.
https://www.unicode.org/faq/casemap_charprop.html
https://stackoverflow.com/questions/7360996/unicode-correct-title-case-in-java?noredirect=1&lq=1
Unicode-correct title case in Java
https://docs.rs/unicode-case-mapping/latest/unicode_case_mapping/
Example
assert_eq!(unicode_case_mapping::to_lowercase('İ'), ['i' as u32, 0x0307]);
assert_eq!(unicode_case_mapping::to_lowercase('ß'), ['ß' as u32, 0]);
assert_eq!(unicode_case_mapping::to_uppercase('ß'), ['S' as u32, 'S' as u32, 0]);
assert_eq!(unicode_case_mapping::to_titlecase('ß'), ['S' as u32, 's' as u32, 0]);
assert_eq!(unicode_case_mapping::to_titlecase('-'), [0; 3]);
assert_eq!(unicode_case_mapping::case_folded('I'), NonZeroU32::new('i' as u32));
assert_eq!(unicode_case_mapping::case_folded('ß'), None);
assert_eq!(unicode_case_mapping::case_folded('ẞ'), NonZeroU32::new('ß' as u32));
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/titlecase.html
fun Char.titlecase(): String
val chars = listOf('a', 'Dž', 'ʼn', '+', 'ß')
val titlecaseChar = chars.map { it.titlecaseChar() }
val titlecase = chars.map { it.titlecase() }
println(titlecaseChar) // [A, Dž, ʼn, +, ß]
println(titlecase) // [A, Dž, ʼN, +, Ss]
fun Char.titlecase(locale: Locale): String
val chars = listOf('a', 'Dž', 'ʼn', '+', 'ß', 'i')
val titlecase = chars.map { it.titlecase() }
val turkishLocale = Locale.forLanguageTag("tr")
val titlecaseTurkish = chars.map { it.titlecase(turkishLocale) }
println(titlecase) // [A, Dž, ʼN, +, Ss, I]
println(titlecaseTurkish) // [A, Dž, ʼN, +, Ss, İ]
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/177
jlf: good summary, was not so obvious before I understand there are simple and
full case mappings...
Also note that the Unicode standard only provides defaults for, but then
goes on to say that locale/language specific mappings should really be used.
The Unicode standard is very explicit that things like uppercase
transformations should be able to handle language specific issues such as
the Turkish dotted and dotless i, and that “ß” should be uppercased to
“SS” in German.
See:
Q: Is all of the Unicode case mapping information in UnicodeData.txt?
A: No. The UnicodeData.txt file includes all of the one-to-one case mappings.
Since many parsers were built with the expectation that UnicodeData.txt
would have at most a single character in each case mapping field, the
file SpecialCasing.txt was added to provide the one-to-many mappings,
such as the one needed for uppercasing ß (U+00DF LATIN SMALL LETTER SHARP S).
In addition, CaseFolding.txt contains additional mappings used in case
folding and caseless matching. For more information, see Section 5.18,
Case Mappings in The Unicode Standard.
and
A: The Unicode Standard defines the default case mapping for each
individual character, with each character considered in isolation. This
mapping does not provide for the context in which the character appears,
nor for the language-specific rules that must be applied when working in
natural language text.
https://www.b-list.org/weblog/2018/nov/26/case/
Truths programmers should know about case
Collation, sorting
https://www.unicode.org/reports/tr35/tr35-collation.html
UNICODE LOCALE DATA MARKUP LANGUAGE (LDML)
PART 5: COLLATION
01/06/2021
https://github.com/jgm/unicode-collation
https://hackage.haskell.org/package/unicode-collation
Haskell implementation of the Unicode Collation Algorithm
https://icu4c-demos.unicode.org/icu-bin/collation.html
ICU Collation Demo
https://www.enterprisedb.com/docs/epas/latest/epas_guide/03_database_administration/06_unicode_collation_algorithm/
Unicode Collation Algorithm
https://www.minaret.info/test/collate.msp
This page provides a means to convert a string of Unicode characters into a binary collation key using
the Java language version ("icu4j") of the IBM International Components for Unicode (ICU) library.
A collation key is the basis for sorting and comparing strings in a language-sensitive Unicode environment.
A collation key is built using a "locale" (a designation for a particular laguage or a variant) and a comparison level.
The levels supported here (Primary, Secondary, Tertiary, Quaternary and Identical) correspond to levels
"L1" through "Ln" as described in Unicode Technical Standard #10 - Unicode Collation Algorithm.
When comparing collation keys for two different strings, both keys must have been created using the same locale
and comparison level in order to be meaningful. The two keys are compared from left to right, byte for byte
until one of the bytes is not equal to the other. Whichever byte is numerically less than the other causes
the source string for that collation key to sort before the other string.
https://lemire.me/blog/2018/12/17/sorting-strings-properly-is-stupidly-hard/
It's the comments section which is interesting.
https://discourse.julialang.org/t/sorting-strings-by-unicode-collation-order/11195
Not supported
03/08/2022
https://discourse.julialang.org/t/unicode-15-0-beta-and-sorting-collation/83090
https://www.unicode.org/emoji/charts-15.0/emoji-ordering.html
https://en.wikipedia.org/wiki/Natural_sort_order
Natural sort order is an ordering of strings in alphabetical order,
except that multi-digit numbers are ordered as a single character.
Natural sort order has been promoted as being more human-friendly ("natural")
than the machine-oriented pure alphabetical order.
For example, in alphabetical sorting "z11" would be sorted before "z2"
because "1" is sorted as smaller than "2",
while in natural sorting "z2" is sorted before "z11" because "2" is sorted as smaller than "11".
Alphabetical sorting:
z11
z2
Natural sorting:
z2
z11
Functionality to sort by natural sort order is built into many programming languages and libraries.
02/06/2021
https://www.postgresql.org/message-id/flat/BA6132ED-1F6B-4A0B-AC22-81278F5AB81E%40tripadvisor.com
The dangers of streaming across versions of glibc: A cautionary tale
SELECT 'M' > 'ஐ';
'FULLWIDTH LATIN CAPITAL LETTER M' (U+FF2D)
'TAMIL LETTER AI' (U+0B90)
Across different machines, running the same version of postgres, and in databases
with identical character encodings and collations ('en_US.UTF-8') that select will
return different results if the version of glibc is different.
master:src/backend/utils/adt/varlena.c:1494,1497 These are the lines where postgres
calls strcoll_l and strcoll, in order to sort strings in a locale aware manner.
The reality is that there are different versions of glibc out there in the wild,
and they do not sort consistently across versions/environments.
https://collations.info/concepts/
a site devoted to working with Collations, Unicode, Encodings, Code Pages, etc in Microsoft SQL Server.
BIDI title
https://www.iamcal.com/understanding-bidirectional-text/
Understanding Bidirectional (BIDI) Text in Unicode
https://www.w3.org/International/articles/inline-bidi-markup/uba-basics
Unicode Bidirectional Algorithm basics
jlf: the example are gif images :-(( no way to copy-paste the characters.
https://www.unicode.org/notes/tn39/
BIDI BRACKETS FOR DUMMIES
https://stackoverflow.com/questions/5801820/how-to-solve-bidi-bracket-issues
How to solve BiDi bracket issues?
https://gist.github.com/mvidner/e96ac917d9a54e09d9730220a34b0d24
Problems with Bidirectional (BiDi) Text
https://www.w3.org/International/questions/qa-bidi-unicode-controls
How to use Unicode controls for bidi text
https://github.com/mvidner/bidi-test
Testing bidirectional text
https://terminal-wg.pages.freedesktop.org/bidi/
BiDi in Terminal Emulators
https://www.w3.org/International/articles/inline-bidi-markup/uba-basics
Unicode Bidirectional Algorithm basics
W3C
http://fribidi.org/
GNU FriBidi is an implementation of the Unicode Bidirectional Algorithm (bidi).
jlf: dead...
The latest release is fribidi-0.19.7.tar.bz2 from August 4, 2015. This release
is based on Unicode 6.2.0 character database.
---
jlf: maybe not dead, but low activity... v1.0.13
https://github.com/fribidi/fribidi
https://news.ycombinator.com/item?id=37990523
Ask HN: Bidirectional Text Navigation
Emoji
https://www.unicode.org/Public/emoji/15.0/emoji-test.txt
https://emojipedia.org/
http://xahlee.info/comp/unicode_emoji.html
29/05/2021
https://tonsky.me/blog/emoji/
27/02/2023
https://news.ycombinator.com/item?id=34925446
Discussion about emoji and graphemes (again...).
Nothing very interesting in this discussion.
Remember:
The "length" of a string in extended grapheme clusters is not stable across Unicode versions, which seems like a recipe for confusion.
The length in code units is unambiguous and constant across versions.
---
Executor:
NinjaCat = "🐱👤"
NinjaCat~description=
'UTF-8 not-ASCII (11 bytes)'
NinjaCat~text~characters==
an Array (shape [3], 3 items)
1 : ( "🐱" U+1F431 So 2 "CAT FACE" )
2 : ( "" U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" )
3 : ( "👤" U+1F464 So 2 "BUST IN SILHOUETTE" )
Countries, flags
22/05/2021
https://en.wikipedia.org/wiki/Regional_indicator_symbol
Regional indicator symbol
https://en.wikipedia.org/wiki/ISO_3166-1
ISO 3166-1 (Codes for the representation of names of countries and their subdivisions)
https://observablehq.com/@jobleonard/which-unicode-flags-are-reversible
Evidence of partial or wrong support of Unicode
13/08/2013
We don’t need a string type
https://mortoray.com/2013/08/13/we-dont-need-a-string-type/
01/12/2013
Strings in Ruby are UTF-8 now… right?
http://andre.arko.net/2013/12/01/strings-in-ruby-are-utf-8-now/
14/07/2017
Testing Ruby's Unicode Support
http://blog.honeybadger.io/ruby-s-unicode-support/
22/05/2021
Emoji.length == 2
https://news.ycombinator.com/item?id=13830177
Lot of comments, did not read all, to continue
22/05/2021
https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
Let's Stop Ascribing Meaning to Code Points
18/07/2021
https://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-assumptions/
Breaking Our Latin-1 Assumptions
Optimization, SIMD
08/06/2021
https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/
[obsolete]
https://github.com/lemire/fastvalidate-utf-8
header-only library to validate utf-8 strings at high speeds (using SIMD instructions)
jlf 2023/06/16 (now obsolete)
NOTE: The fastvalidate-utf-8 library is obsolete as of 2022: please adopt the simdutf library.
It is much more powerful, faster and better tested.
https://github.com/simdutf/simdutf
simdutf: Unicode at gigabytes per second
08/06/2021
https://github.com/simdjson/simdjson
simdjson : Parsing gigabytes of JSON per second
The simdjson library uses commonly available SIMD instructions and microparallel algorithms
to parse JSON 4x faster than RapidJSON and 25x faster than JSON for Modern C++.
Minify JSON at 6 GB/s, validate UTF-8 at 13 GB/s, NDJSON at 3.5 GB/s
https://arxiv.org/abs/2010.03090
Validating UTF-8 In Less Than One Instruction Per Byte
John Keiser, Daniel Lemire
The majority of text is stored in UTF-8, which must be validated on ingestion.
We present the lookup algorithm, which outperforms UTF-8 validation routines used
in many libraries and languages by more than 10 times using commonly available SIMD instructions.
To ensure reproducibility, our work is freely available as open source software.
https://r-libre.teluq.ca/2178/
Recherche et analyse de solutions performantes pour le traitement de fichiers JSON dans un langage de haut niveau [r-libre/2178]
Referenced from
https://lemire.me/blog/
Daniel Lemire's blog – Daniel Lemire is a computer science professor at the University of Quebec (TELUQ) in Montreal.
His research is focused on software performance and data engineering. He is a techno-optimist.
https://github.com/simdutf/simdutf
https://news.ycombinator.com/item?id=32700315
Unicode routines (UTF8, UTF16, UTF32): billions of characters per second using SSE2, AVX2, NEON, AVX-512.
https://jcdav.is/2016/09/01/How-the-JVM-compares-your-strings/
(jlf: also referenced in the section "String comparison")
How the JVM compares your strings using the craziest x86 instruction you've never heard of
---
Comment from a Swift thread:
https://forums.swift.org/t/string-s-abi-and-utf-8/17676/25
PCMPxSTRx is no longer faster than equivalent "simple" vector instruction sequences for straightforward comparisons
(this had already been the case for a few years when that article was written, which is curious).
It can be used productively (with some care) for some other operations like substring matching,
but that's not as much of a heavy-hitter. There's a bunch of string stuff that will benefit from general vectorization,
and which is absolutely on our roadmap to tackle, but using the PCMPxSTRx instructions specifically isn't a source of wins on the most important operations.
https://news.ycombinator.com/item?id=34267936
Transcoding Unicode with AVX-512: AMD Zen 4 vs. Intel Ice Lake (lemire.me)
https://www.reddit.com/r/java/comments/qafjtg/faster_charset_encoding/
Java 17 uses avx in both encoding and decoding
https://lemire.me/blog/2023/02/16/computing-the-utf-8-size-of-a-latin-1-string-quickly-avx-edition/
Computing the UTF-8 size of a Latin 1 string quickly (AVX edition)
Variation sequence
https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-variation-sequences.txt
https://www.unicode.org/Public/15.1.0/ucd/emoji/emoji-variation-sequences.txt
# emoji-variation-sequences.txt
22/05/2021
List of all code points that can display differently via a variation sequence
http://randomguy32.de/unicode/charts/standardized-variants/#emoji
Safari is better to display the characters.
Google Chrome and Opera have the same limitations: some characters are not supported (ex: section Phags-Pa).
https://sethmlarson.dev/unicode-variation-selectors
Mahjong tiles and Unicode variation selectors
Whitespaces, separators
22/05/2021
https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/
A section about wcwidth.
A section about spaces:
There are actually two definitions of whitespace in Unicode.
Unicode assigns every codepoint a category, and has three categories for
what sounds like whitespace:
“Separator, space”;
“Separator, line”;
“Separator, paragraph”.
CR, LF, tab, and even vertical tab are all categorized as “Other, control”
and not as separators.
The only character in the “Separator, line” category is U+2028 LINE SEPARATOR,
and the only character in “Separator, paragraph” is U+2029 PARAGRAPH SEPARATOR.
Thankfully, all of these have the WSpace property.
As an added wrinkle, the lone oddball character “⠀” renders like a space in most fonts.
jlf: 2 cols x 3 lines of debossed dots.
But it’s not whitespace, it’s not categorized as a separator, and it doesn’t have WSpace.
It’s actually U+2800 BRAILLE PATTERN BLANK, the Braille character with none of the dots raised.
(I say “most fonts” because I’ve occasionally seen it rendered as a 2×4 grid of open circles.)
Hyphenation
break words into syllables
I need to break words into syllables:astronomical --> as - tro - nom - ic - al
Is it possible to do this (in different languages) using ICU library? (if no, may be you suggest other tools for it?)
Andreas Heigl:
While it looks like this is not something for ICU[1], there are libraries out there handling that - most of the time based on the thesis of Marc Liang.
I've built an implementation for PHP[2] but there are a lot of others out there[3].
[1] https://github.com/unicode-org/icu4x/issues/164#issuecomment-651410272
[2] https://github.com/heiglandreas/Org_Heigl_Hyphenator
[3] https://github.com/search?q=hyphenate&type=repositories
https://tug.org/docs/liang/liang-thesis.pdf
DNS title, Domain Name title, Domain Name System title
http://lambda-the-ultimate.org/node/5674#comment-97016
jlf: I created this section because of this comment
Have you ever looked at how international encoding of DNS names are done in URLs? It uses Punycode, and it's a disaster.
Here's a good starting point to read up on this: https://en.wikipedia.org/wiki/Internationalized_domain_name
https://en.wikipedia.org/wiki/Internationalized_domain_name
Internationalized domain name
ToASCII leaves ASCII labels unchanged. It fails if the label is unsuitable for
the Domain Name System. For labels containing at least one non-ASCII character,
ToASCII applies the Nameprep algorithm (https://en.wikipedia.org/wiki/Nameprep)
This converts the label to lowercase and performs other normalization. ToASCII
then translates the result to ASCII, using Punycode (https://en.wikipedia.org/wiki/Punycode)
Finally, it prepends the four-character string "xn--". This four-character string
is called the ASCII Compatible Encoding (ACE) prefix. It is used to distinguish
labels encoded in Punycode from ordinary ASCII labels. The ToASCII algorithm can
fail in several ways. For example, the final string could exceed the 63-character
limit of a DNS label. A label for which ToASCII fails cannot be used in an
internationalized domain name.
The function ToUnicode reverses the action of ToASCII, stripping off the ACE prefix
and applying the Punycode decode algorithm. It does not reverse the Nameprep processing,
since that is merely a normalization and is by nature irreversible. Unlike ToASCII,
ToUnicode always succeeds, because it simply returns the original string if decoding fails.
In particular, this means that ToUnicode has no effect on a string that does not
begin with the ACE prefix.
https://en.wikipedia.org/wiki/Punycode
Punycode is a representation of Unicode with the limited ASCII character subset
used for Internet hostnames. Using Punycode, host names containing Unicode characters
are transcoded to a subset of ASCII consisting of letters, digits, and hyphens,
which is called the letter–digit–hyphen (LDH) subset. For example, München
(German name for Munich) is encoded as Mnchen-3ya.
All languages
https://www.omniglot.com/index.htm
The online encyclopedia of writing systems & languages
jlf: nothing about Unicode but good for culture générale.
Classical languages
https://docs.cltk.org/en/latest/
https://github.com/cltk/cltk
The Classical Language Toolkit
Python library
The Classical Language Toolkit (CLTK) is a Python library offering natural
language processing (NLP) for pre-modern languages. Pre-configured pipelines are
available for 19 languages.
Akkadian
Arabic
Aramaic
Classical Chinese
Coptic
Gothic
Greek
Hindi
Latin
Middle High German
English
French
Old Church Slavonic
Old Norse
Pali
Panjabi
Sanskrit (Some parts of the Sanskrit library are forked from the Indic NLP Library)
Arabic language
https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
Arabic script in Unicode
Indic languages
https://www.unicode.org/faq/indic.html
Indic scripts in the narrow sense are the nine major Brahmi-derived scripts of India.
In a wider sense, the term can cover all Brahmic scripts and Kharoshthi.
What is ISCII?
Indian Standard Code for Information Interchange (ISCII) is the character code
for Indian scripts that originate from the Brahmi script.
Keywords:
nukta
Vedic Sanskrit
vowel signs (matras)
vowel modifiers (candrabindu, anusvara)
the consonant modifier (nukta)
Tamil
Bengali (Bangla) / Assamese Script
Sindhi implosive consonants
FAQ: How do I collate Indic language data?
Collation order is not the same as code point order. A good treatment of some
issues specific to collation in Indic languages can be found in the paper
Issues in Indic Language Collation by Cathy Wissink (https://www.unicode.org/notes/tn1/)
Collation in general must proceed at the level of language or language variant,
not at the script or codepoint levels. See also UTS #10: Unicode Collation Algorithm.
Some Indic-specific issues are also discussed in that report.
This section illustrates that Unicode’s concepts like “extended grapheme cluster”
are meant to provide some low-level, general segmentation, and are not going
to be enough for ideal experience for end users.
https://en.wikipedia.org/wiki/Devanagari#Conjunct_consonants
https://en.wikipedia.org/wiki/Devanagari_conjuncts
Conjunct consonants are a form of orthographic ligature characteristic of the
Brahmic scripts. They are constructed of more than two consonant letters.
Biconsonantal conjuncts are common, but longer conjuncts are increasingly
constrained by the languages' phonologies and the actual number of conjuncts
observed drops sharply. Ulrich Stiehl includes a five-letter Devanagari conjunct
र्त्स्न्य (rtsny)[1] among the top 360 most frequent conjuncts found in Classical
Sanskrit;[2] the complete list appears below. Conjuncts often span a syllable
boundary, and many of the conjuncts below occur only in the middle of words,
where the coda consonants of one syllable are conjoined with the onset c
onsonants of the following syllable.
[1] As in Sanskrit word कार्त्स्न्य (In Bengali Script কার্ৎস্ন্য), meaning "The Whole, Entirety"
[2] Stiehl, Ulrich. "Devanagari-Schreibübungen" (PDF). www.sanskritweb.net.
http://www.sanskritweb.net/deutsch/devanagari.pdf
https://stackoverflow.com/questions/6805311/combining-devanagari-characters
Combining Devanagari characters
"बिक्रम मेरो नाम हो"~text~graphemes==
a GraphemeSupplier
1 : T'बि'
2 : T'क्' <-- According the comments, these 2 graphemes should be only one: क्र
3 : T'र' <-- even ICU doesn't support that... it's a tailored grapheme cluster
4 : T'म'
5 : T' '
6 : T'मे'
7 : T'रो'
8 : T' '
9 : T'ना'
10 : T'म'
11 : T' '
12 : T'हो'
"बिक्रम मेरो नाम हो"~text~characters==
an Array (shape [18], 18 items)
1 : ( "ब" U+092C Lo 1 "DEVANAGARI LETTER BA" )
2 : ( "ि" U+093F Mc 0 "DEVANAGARI VOWEL SIGN I" )
3 : ( "क" U+0915 Lo 1 "DEVANAGARI LETTER KA" )
4 : ( "्" U+094D Mn 0 "DEVANAGARI SIGN VIRAMA" ) <-- influence segmentation
5 : ( "र" U+0930 Lo 1 "DEVANAGARI LETTER RA" )
6 : ( "म" U+092E Lo 1 "DEVANAGARI LETTER MA" )
7 : ( " " U+0020 Zs 1 "SPACE", "SP" )
8 : ( "म" U+092E Lo 1 "DEVANAGARI LETTER MA" )
9 : ( "े" U+0947 Mn 0 "DEVANAGARI VOWEL SIGN E" )
10 : ( "र" U+0930 Lo 1 "DEVANAGARI LETTER RA" )
11 : ( "ो" U+094B Mc 0 "DEVANAGARI VOWEL SIGN O" )
12 : ( " " U+0020 Zs 1 "SPACE", "SP" )
13 : ( "न" U+0928 Lo 1 "DEVANAGARI LETTER NA" )
14 : ( "ा" U+093E Mc 0 "DEVANAGARI VOWEL SIGN AA" )
15 : ( "म" U+092E Lo 1 "DEVANAGARI LETTER MA" )
16 : ( " " U+0020 Zs 1 "SPACE", "SP" )
17 : ( "ह" U+0939 Lo 1 "DEVANAGARI LETTER HA" )
18 : ( "ो" U+094B Mc 0 "DEVANAGARI VOWEL SIGN O" )
In Devanagari, each grapheme cluster consists of an initial letter, optional
pairs of virama (vowel killer) and letter, and an optional vowel sign.
virama = u'\N{DEVANAGARI SIGN VIRAMA}'
cluster = u''
last = None
for c in s:
cat = unicodedata.category(c)[0]
if cat == 'M' or cat == 'L' and last == virama:
cluster += c
else:
if cluster:
yield cluster
cluster = c
last = c
if cluster:
yield cluster
---
Let's cover the grammar very quickly: The Devanagari Block.
As a developer, there are two character classes you'll want to concern yourself with:
Sign:
This is a character that affects a previously-occurring character.
Example, this character: ्. The light-colored circle indicates the location
of the center of the character it is to be placed upon.
Letter / Vowel / Other:
This is a character that may be affected by signs.
Example, this character: क.
Combination result of ् and क: क्. But combinations can extend, so क् and षति will
actually become क्षति (in this case, we right-rotate the first character by 90 degrees,
modify some of the stylish elements, and attach it at the left side of the second character).
https://news.ycombinator.com/item?id=20058454
If I type anything like किमपि (“kimapi”) and hit backspace, it turns into किमप (“kimapa”).
That is, the following sequence of codepoints:
0915 DEVANAGARI LETTER KA
093F DEVANAGARI VOWEL SIGN I
092E DEVANAGARI LETTER MA
092A DEVANAGARI LETTER PA
093F DEVANAGARI VOWEL SIGN I
made of three grapheme clusters (containing 2, 1, and 2 codepoints respectively),
turns after a single backspace into the following sequence:
0915 DEVANAGARI LETTER KA
093F DEVANAGARI VOWEL SIGN I
092E DEVANAGARI LETTER MA
092A DEVANAGARI LETTER PA
This is what I expect/find intuitive, too, as a user.
Similarly अन्यच्च is made of 3 grapheme clusters but you hit backspace 7 times to delete it
(though there I'd slightly have preferred अन्यच्च→अन्यच्→अन्य→अन्→अ instead of
अन्यच्च→अन्यच्→अन्यच→अन्य→अन्→अन→अ that's seen, but one can live with this).
https://github.com/anoopkunchukuttan/indic_nlp_library
The goal of the Indic NLP Library is to build Python based libraries for common
text processing and Natural Language Processing in Indian languages.
The library provides the following functionalities:
Text Normalization
Script Information
Word Tokenization and Detokenization
Sentence Splitting
Word Segmentation
Syllabification
Script Conversion
Romanization
Indicization
Transliteration
Translation
https://github.com/AI4Bharat/indicnlp_catalog
The Indic NLP Catalog
jlf: way beyond Unicode, tons of URLs...
https://news.ycombinator.com/item?id=20056966
jlf: Devnagari seems to be an example where grapheme is not the right segmentation
What does "index" mean? (hindi) "इंडेक्स" का क्या अर्थ है?
Including the quote marks, spaces, and question mark, that's 18 characters.
as a native speaker, shouldn't they be considered 15 characters?
क्स, क्या and र्थ each form individual conjunct consonants.
Counting them as two would then beget the question as to why डे is not considered
two characters too, seeing as it is formed by combining ड and ए, much like क्स
is formed by combining क् and स.
...
Devnagari allows simple characters to form compound characters.
Regarding क्स and डे, the difference between them is that the former is a combination
of two consonants (pronounced "ks") while the latter is formed by a consonant and
a vowel ("de"). However, looking at the visual representation is wrong, since डा
(consonant+vowel) would also look like two characters.
https://slidetodoc.com/indic-text-segmentation-presented-by-swaran-lata-senior/
INDIC TEXT SEGMENTATION
https://github.com/w3c/iip/issues/34
the final rendered state of the text is what influences the segmentation,
rather than the sequence of code points used.
https://docs.microsoft.com/en-us/typography/
https://docs.microsoft.com/en-us/typography/script-development/tamil
Developing OpenType Fonts for Tamil Script
The first step is to analyze the input text and break it into syllable clusters.
Then apply font features and computes ligatures and combine marks.
https://docs.microsoft.com/en-us/typography/script-development/devanagari
Developing OpenType Fonts for Devanagari Script
https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/
Picking Apart the Crashing iOS String
Posted by Manish Goregaokar on February 15, 2018
Indic scripts and consonant clusters
jlf: he's a black belt! or is it his native tongue?
https://stackoverflow.com/questions/75210512/how-to-split-devanagari-bi-tri-and-tetra-conjunct-consonants-as-a-whole-from-a-s
How to split Devanagari bi-tri and tetra conjunct consonants as a whole from a string?
"हिन्दी मुख्यमंत्री हिमंत"
Current output:
हि न् दी मु ख् य मं त् री हि मं त
Desired ouput:
हि न्दी मु ख्य मं त्री हि मं त
https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf
Proper Complex Script Support in Text Terminals
page 8
Characters in one line will further be grouped into terminal clusters.
A terminal cluster contains the characters that are combined together in the
terminal environment. It is an instance of the tailored grapheme cluster defined
in UAX #29. In Indic scripts, for example, syllables with virama conjoiners in
the middle will be considered one single terminal cluster, while they are
treated as multiple extended grapheme clusters in UAX #29.
---
page9
In some writing systems, the form of a character may depend on the characters
that follow it. One example of this is Devanagari’s repha forms. This requires
the establishment of a work zone that contains the most recent characters, and
the property of the characters in the work zone is considered volatile and may
change depending on the incoming text from the guest.
When the terminal receives text, it will first append the text into the work
zone and measure the entire work zone to process potential property changes.
If the measurement result says that the text in the work zone could be broken
into multiple clusters, then the work zone will be shrunk to only contain the
last (maybe incomplete) cluster. The text before that will be committed, and its
properties will no longer change. As a result, at any time the work zone will
contain at most one cluster. When the cursor moves (via the terminal receiving a
cursor move command or a newline), all the text in the work zone will be
committed—even if it is incomplete—and the work zone will be cleared.
https://slideplayer.com/slide/11341056/
INDIC TEXT SEGMENTATION
todo: read
https://news.ycombinator.com/item?id=9219162
I Can Text You A Pile of Poo, But I Can’t Write My Name
March 17th, 2015
jlf: the article is about Bengali, but HN comments are also for other languages.
todo: read
https://www.unicode.org/L2/L2023/23140-graphemes-expectations.pdf
Unicode 15.1:
Unicode grapheme clusters tend to be closer to the larger user-perceived units.
Hangul text is clearly segmented into syllable blocks. For Brahmic scripts,
things are less clear. Grapheme clusters may contain several base-level units,
but up to Unicode 15 always broke after virama characters. This broke not only
within orthographic syllables, but for a number of scripts also within the
encoding of conjunct forms that users perceive as base-level units, such as
Khmer coengs (see subsection Subscript Consonant Signs of section 16.4 Khmer of
the Unicode Standard). In Unicode 15.1, this is being corrected for six scripts,
while leaving the others broken.
CJK
https://resources.oreilly.com/examples/9781565922242/blob/master/doc/cjk.inf
Version 2.1 (July 12, 1996)
Online Companion to "Understanding Japanese Information Processing"
This online document provides information on CJK (that is, Chinese, Japanese,
and Korean) character set standards and encoding systems.
---
jlf: 1996... but maybe some things to learn.
https://en.wikipedia.org/wiki/Cangjie_input_method
Cangjie input method
jlf: nothing about Unicode... but maybe some things to learn.
Korean
22/05/2021
http://gernot-katzers-spice-pages.com/var/korean_hangul_unicode.html
The Korean Writing System
Japanese
https://heistak.github.io/your-code-displays-japanese-wrong/
https://news.ycombinator.com/item?id=29022906
https://www.johndcook.com/blog/2022/09/25/katakana-hiragana-unicode/
https://news.ycombinator.com/item?id=32987710
Polish
https://www.twardoch.com/download/polishhowto/index.html
Polish diacritics how to?
https://hsivonen.fi/ime/
An IME is a piece of software that transforms user-generated input events
(mostly keyboard events, but some IMEs allow some auxiliary pointing device interaction)
into text in a manner more complex than a mere keyboard layout.
Basically, if the relationship between the keys that a user presses on a hardware keyboard
and the text that ends up in an applications text buffer is more complex than when writing French,
an IME is in use.
Text editing
https://lord.io/text-editing-hates-you-too/
TEXT EDITING HATES YOU TOO
Text rendering, Text shaping library
https://faultlore.com/blah/text-hates-you/
Text Rendering Hates You
Aria Beingessner
September 28th, 2019
jlf: culture générale
todo: read
https://harfbuzz.github.io/
https://github.com/harfbuzz/harfbuzz
jlf: referenced by ICU
Users of ICU Layout are strongly encouraged to consider the HarfBuzz project as
a replacement for the ICU Layout Engine.
Uniscribe if you are writing Windows software
CoreText on macOS
String Matching
https://www.w3.org/TR/charmod-norm/
String matching
Case folding is the process of making two texts which differ only in case identical for comparison purposes, that is, it is meant for the purpose of string matching.
This is distinct from case mapping, which is primarily meant for display purposes.
As with the default case mappings, Unicode defines default case fold mappings ("case folding") for each Unicode code point.
Fuzzy String Matching
29/05/2021
https://github.com/logannc/fuzzywuzzy-rs
Rust port of the Python fuzzywuzzy
https://github.com/seatgeek/fuzzywuzzy --> moved to https://github.com/seatgeek/thefuzz
Levenshtein distance and string similarity
https://github.com/ztane/python-Levenshtein/
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
String comparison
31/05/2021
https://stackoverflow.com/questions/49662585/how-do-i-compare-a-unicode-string-that-has-different-bytes-but-the-same-value
A pair NFC considers different but a user might consider the same is 'µ' (MICRO SIGN) and 'μ' (GREEK SMALL LETTER MU).
NFKC will collapse these two.
https://www.unicode.org/reports/tr10/
Unicode® Technical Standard #10
UNICODE COLLATION ALGORITHM
Collation is the general term for the process and function of determining the sorting order of strings of characters.
Collation varies according to language and culture: Germans, French and Swedes sort the same characters differently.
It may also vary by specific application: even within the same language, dictionaries may sort differently than phonebooks or book indices.
For non-alphabetic scripts such as East Asian ideographs, collation can be either phonetic or based on the appearance of the character.
Collation can also be customized according to user preference, such as ignoring punctuation or not, putting uppercase before lowercase (or vice versa), and so on.
https://www.unicode.org/reports/tr10/#Common_Misperceptions
Collation is not aligned with character sets or repertoires of characters.
Collation is not code point (binary) order.
Collation is not a property of strings.
Collation order is not preserved under concatenation or substring operations, in general.
Collation order is not preserved when comparing sort keys generated from different collation sequences.
Collation order is not a stable sort.
Collation order is not fixed.
https://en.wikipedia.org/wiki/Unicode_equivalence
Short definition of NFD, NFC, NFKD, NFKC
In this article, a short paragraph which confirms that it's important to keep
the original string unchanged !
Errors due to normalization differences
When two applications share Unicode data, but normalize them differently, errors and data loss can result.
In one specific instance, OS X normalized Unicode filenames sent from the Samba file- and printer-sharing software.
Samba did not recognise the altered filenames as equivalent to the original, leading to data loss.[4][5]
Resolving such an issue is non-trivial, as normalization is not losslessly invertible.
http://sourceforge.net/p/netatalk/bugs/348/
#348 volcharset:UTF8 doesn't work from Mac
https://www.unicode.org/faq/normalization.html
Mode detailled description of normalization
PHP
http://php.net/manual/en/collator.compare.php
Collator::compare -- collator_compare — Compare two Unicode strings
Object oriented style
public int Collator::compare ( string $str1 , string $str2 )
Procedural style
int collator_compare ( Collator $coll , string $str1 , string $str2 )
http://php.net/manual/en/class.collator.php
Provides string comparison capability with support for appropriate locale-sensitive sort orderings.
Swift
https://developer.apple.com/library/prerelease/watchos/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
Two String values (or two Character values) are considered equal if their extended grapheme clusters are canonically equivalent.
Extended grapheme clusters are canonically equivalent if they have the same linguistic meaning and appearance,
even if they are composed from different Unicode scalars behind the scenes.
.characters.count
for character in dogString.characters
for codeUnit in dogString.utf8
for codeUnit in dogString.utf16
for scalar in dogString.unicodeScalars
Nothing about ordered comparison in Swift doc ?
http://oleb.net/blog/2014/07/swift-strings/
Ordering strings with the < and > operators uses the default Unicode collation algorithm.
In the example below, "é" is smaller than i because the collation algorithm specifies
that characters with combining marks follow right after their base character.
"résumé" < "risotto" // -> true
The String type does not (yet?) come with a method to specify the language to use for collation.
You should continue to use
-[NSString compare:options:range:locale:]
or
-[NSString localizedCompare:]
if you need to sort strings that are shown to the user.
In this example, specifying a locale that uses the German phonebook collation yields a different result than the default string ordering:
let muffe = "Muffe"
let müller = "Müller"
muffe < müller // -> true
// Comparison using an US English locale yields the same result
let muffeRange = muffe.startIndex..<muffe.endIndex
let en_US = NSLocale(localeIdentifier: "en_US")
muffe.compare(müller, options: nil, range: muffeRange, locale: en_US) // -> .OrderedAscending
// Germany phonebook ordering treats "ü" as "ue".
// Thus, "Müller" < "Muffe"
let de_DE_phonebook = NSLocale(localeIdentifier: "de_DE@collation=phonebook")
muffe.compare(müller, options: nil, range: muffeRange, locale: de_DE_phonebook) // -> .OrderedDescending
Java
https://jcdav.is/2016/09/01/How-the-JVM-compares-your-strings/
How the JVM compares your strings using the craziest x86 instruction you've never heard of.
---
A comment about this article:
PCMPxSTRx is no longer faster than equivalent "simple" vector instruction sequences for straightforward comparisons
(this had already been the case for a few years when that article was written, which is curious).
It can be used productively (with some care) for some other operations like substring matching,
but that's not as much of a heavy-hitter.
There's a bunch of string stuff that will benefit from general vectorization, and which is absolutely on our roadmap to tackle,
but using the PCMPxSTRx instructions specifically isn't a source of wins on the most important operations
C#
https://docs.microsoft.com/en-us/dotnet/standard/base-types/comparing
https://docs.microsoft.com/en-us/dotnet/core/extensions/performing-culture-insensitive-string-comparisons
JSON
https://www.reddit.com/r/programming/comments/q5vmxc/parsing_json_is_a_minefield_2018/
https://seriot.ch/projects/parsing_json.html
Parsing JSON is a Minefield
Search for "unicode"
30/05/2021
https://datatracker.ietf.org/doc/html/rfc8259
The JavaScript Object Notation (JSON) Data Interchange Format
See this section about strings and encoding:
https://datatracker.ietf.org/doc/html/rfc8259#section-7
https://github.com/toml-lang/toml
Tom's Obvious, Minimal Language
TOML is a nice serialization format for human-maintained data structures.
It’s line-delimited and—of course!—allows comments, and any Unicode code point can be expressed in simple hexadecimal.
TOML is fairly new, and its specification is still in flux;
CBOR Concise Binary Representation
https://cbor.io/
RFC 8949 Concise Binary Object Representation
CBOR improves upon JSON’s efficiency and also allows for storage of binary strings.
Whereas JSON encoders must stringify numbers and escape all strings,
CBOR stores numbers “literally” and prefixes strings with their length,
which obviates the need to escape those strings.
https://www.rfc-editor.org/rfc/rfc8949.html
RFC 8949 Concise Binary Object Representation (CBOR)
In contrast to formats such as JSON, the Unicode characters in this type are never escaped.
Thus, a newline character (U+000A) is always represented in a string as the byte 0x0a,
and never as the bytes 0x5c6e (the characters "\" and "n")
nor as 0x5c7530303061 (the characters "\", "u", "0", "0", "0", and "a").
Binary encoding in Unicode
10/07/2021
https://qntm.org/unicodings
Efficiently encoding binary data in Unicode
in UTF-8, use Base64 or Base85
in UTF-16, use Base32768
in UTF-32, use Base65536
https://qntm.org/safe
What makes a Unicode code point safe?
https://github.com/qntm/safe-code-point
Ascertains whether a Unicode code point is 'safe' for the purposes of encoding binary data
https://github.com/qntm/base2048
Binary encoding optimised for Twitter
Originally, Twitter allowed Tweets to be at most 140 characters.
On 26 September 2017, Twitter allowed 280 characters.
Maximum Tweet length is indeed 280 Unicode code points.
Twitter divides Unicode into 4,352 "light" code points (U+0000 to U+10FF inclusive)
and 1,109,760 "heavy" code points (U+1100 to U+10FFFF inclusive).
Base2048 solely uses light characters, which means a new "long" Tweet can contain
at most 280 characters of Base2048. Base2048 is an 11-bit encoding, so those 280
characters encode 3080 bits i.e. 385 octets of data, significantly better than Base65536.
https://github.com/qntm/base65536
Unicode's answer to Base64
Base2048 renders Base65536 obsolete for its original intended purpose of sending
binary data through Twitter.
However, Base65536 remains the state of the art for sending binary data through
text-based systems which naively count Unicode code points, particularly those
using the fixed-width UTF-32 encoding.
22/07/2021
https://stackoverflow.com/questions/52131881/does-the-winapi-ever-validate-utf-16
Does the WinApi ever validate UTF-16?
Windows wide characters are arbitrary 16-bit numbers (formerly called "UCS-2",
before the Unicode Standard Consortium purged that notation). So you cannot
assume that it will be a valid UTF-16 sequence. (MultiByteToWideChar is a
notable exception that does return only UTF-16)
28/07/2021
https://invisible-island.net/xterm/bad-utf8/
Unicode replacement character in the Linux console.
This test text examines, how UTF-8 decoders handle various types of
corrupted or otherwise interesting UTF-8 sequences.
jlf : difficult to understand what is the conclusion...
What I notice in this review is :
Unicode 10.0.0's chapter 3 (June 2017): each of the ill-formed code units is separately replaced by U+FFFD.
That recommendation first appeared in Unicode 6's chapter 3 on conformance (February 2011).
However the comments about “best practice” were removed in Unicode 11.0.0 (June 2018).
The W3C WHATWG page entitled Encoding Standard started in January 2013.
The constraints in the utf-8 decoder above match “Best Practices for Using
U+FFFD” from the Unicode standard. No other behavior is permitted per the
Encoding Standard (other algorithms that achieve the same result are
obviously fine, even encouraged).
Although Unicode withdrew the recommendation more than two years ago, to date (August 2020) that is not yet corrected in the WHATWG page.
30/07/2021
https://hsivonen.fi/broken-utf-8/
---
The Unicode Technical Committee retracted the change in its meeting on August 3
2017, so the concern expressed below is now moot.
---
Not all byte sequences are valid UTF-8. When decoding potentially invalid UTF-8
input into a valid Unicode representation, something has to be done about invalid input.
The naïve answer is to ignore invalid input until finding valid input again (i.e.
finding the next byte that has a lead-byte value), but this is dangerous and
should never be done. The danger is that silently dropping bogus bytes might
make a string that didn’t look dangerous with the bogus bytes present become
valid active content. Most simply, <scr�ipt> (� standing in for a bogus byte)
could become <script> if the error is ignored. So it’s non-controversial that
every sequence of bogus bytes should result in at least one REPLACEMENT CHARACTER
and that the next lead-valued byte is the first byte that’s no longer part of
the invalid sequence.
But how many REPLACEMENT CHARACTERs should be generated for a sequence of
multiple bogus bytes?
jlf: the answer is not clear to me...
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
UTF-8 decoder capability and stress test
Mojibake
https://github.com/LuminosoInsight/python-ftfy
ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters
that were clearly meant to be UTF-8 but were decoded as something else
03/07/2021
Notebook in python-ftfy:
Services such as Slack and Discord don't use Unicode for their emoji.
They use ASCII strings like :green-heart: and turn them into images.
These won't help you test anything.
I recommend getting emoji for your test cases by copy-pasting them from emojipedia.org.
https://emojipedia.org/
https://en.wikipedia.org/wiki/Mojibake
Filenames
https://opensource.apple.com/source/subversion/subversion-52/subversion/notes/unicode-composition-for-filenames.auto.html
2 problems follow:
1) We can't generally depend on the OS to give us back the
exact filename we gave it
2) The same filename may be encoded in different codepoints
https://linux.die.net/man/1/convmv
convmv - converts filenames from one encoding to another
https://news.ycombinator.com/item?id=33986655
jlf: discussion about text vs byte for filenames
https://news.ycombinator.com/item?id=33991506
Python already has the "surrogateescape" error handler [0] that performs
something similar to what you described: undecodable bytes are translated into
unpaired U+DC80 to U+DCFF surrogates. Of course, this isn't standardized in any
way, but I've found it useful myself for smuggling raw pathnames through Java.
[0] https://peps.python.org/pep-0383/
https://news.ycombinator.com/item?id=33988943
I’m a little confused, how can a file name be non-decodable? A file with that
name exists, so someone somewhere knows how to decode it. Why wouldn’t Python
just always use the same encoding as the OS it’s running on? Is this some
locale-related thing?
---
> A file with that name exists, so someone somewhere knows how to decode it.
No. A unix filename is just a bunch of bytes (two of them being off-limits).
There is no requirement that it be in any encoding.
You can always use a fallback encoding (an iso-8859) to get something out of
the garbage, but it's just that, garbage.
Windows has a similar issue, NTFS paths are sequences of UCS2 code units, but
there's no guarantee that they form any sort of valid UTF-16 string, you can
find random lone surrogates for instance.
And I'm sure network filesystems have invented their own even worse issues,
because being awful is what they do.
> Why wouldn’t Python just always use the same encoding as the OS it’s running on?
1. because OS don't really have encodings, Python has a function to try and
retrieve FS encoding[0] but per the above there's no requirement that it
is correct for any file, let alone the one you actually want to open
(hell technically speaking it's not even a property of the FS)
2. because OS lie and user configurations are garbage, you can't even trust
the user's locale to be configured properly for reading files (an other
mistake Python 3 made, incidentally)
3. because the user may not even have created the file, it might come from a
broken archive, or some random download from someone having fun with
filenames, or from fetching crap from an FTP or network share
There are a few FS / FS configurations which are reliable, in that case they
either error or pre-mangle the files on intake.
IIRC ZFS can be configured to only accept valid UTF-8 filenames, HFS(+)
requires valid unicode (stored as UTF-16) and APFS does as well (stored as UTF-8).
[0] https://docs.python.org/3/library/sys.html#sys.getfilesystem...
https://news.ycombinator.com/item?id=33986421
Stefan Karpinski:
On UNIX, paths are UTF-8 by convention, but not forced to be valid. Treating
paths as UTF-8 works very well as long as you hadn't also make the mistake of
requiring your UTF-8 strings to be valid (which Python did, unfortunately).
On Windows, paths are UTF-16 by convention, but also not forced to be valid.
However, invalid UTF-16 can be faithfully converted to WTF-8 and converted back
losslessly, so you can translate Windows paths to WTF-16 and everything Just
Works™ [1].
There aren't any operating systems I'm aware of where paths are actually
Shift-JIS by convention, so that seems like a non-issue. Using "UTF-8 by
convention" strings works on all modern OSes.
[1] Ok, here's why the WTF-8 thing works so well. If we write WTF-16 for
potentially invalid UTF-16 (just arbitrary sequences of 16-bit code units), then
the mapping between WTF-16 and WTF-8 space is a bijection because it's
losslessly round-trippable. But more importantly, this WTF-8/16 bijection is
also a homomorphism with respect to pretty much any string operation you can
think of. For example `utf16_concat(a, b) == utf8_concat(wtf8(a), wtf8(b))` for
arbitrary UTF-16 strings a and b. Similar identities hold for other string
operations like searching for substrings or splitting on specific strings.
---
> There aren't any operating systems I'm aware of where paths are actually
Shift-JIS by convention, so that seems like a non-issue. Using "UTF-8 by
convention" strings works on all modern OSes.
Nonsense. Unix paths use the system locale by convention, and it's entirely
normal for that to be Shift-JIS.
https://news.ycombinator.com/item?id=33985510
Stefan Karpinski:
Absolutely right. Deprecating direct string indexing would have been the right
move. Require writing `str.chars()` to get something that lets you slice by
Unicode characters (i.e. code points); provide `str.graphemes()` and
`str.grapheme_clusters()` to get something that lets you slice by graphemes and
grapheme clusters, respectively. Cache an index structure that lets you do that
kind of indexing efficiently once you've asked for it the first time. Provide an
API to clear the caches.
Not allowing strings to represent invalid Unicode is also a huge mistake (and
essentially forced by the representation strategy that they adopted). It forces
any programmer who wants to robustly handle potentially invalid string data to
use byte vectors instead. Which is exactly what they did with OS paths, but
that's far from the only place you can get invalid strings. You can get invalid
strings almost anywhere! Worse, since it's incredibly inconvenient to work with
byte vectors when you want to do stringlike stuff, no one does it unless forced
to, so this design choice effectively guarantees that all Python code that works
with strings will blow up if it encounters anything invalid—which is a very
common occurrence.
If only there was a type that behaves like a string and supports all the handy
string operations but which handles invalid data gracefully. Then you could
write robust string code conveniently. But at that point, you should just make
that the standard string type! This isn't hypothetical, it's exactly how Burnt
Sushi's bstr type [1] works in Rust and how the standard String type works in
Julia.
[1] https://github.com/BurntSushi/bstr
---
Jasper_
It's worth noting that Python str's are sequences of code points, not scalar
values. This was a truly horrendous mistake made mostly out of ignorance, but
now they rely upon it in surrogateescape to hide "invalid" data, so...
I have ranted for long hours go friends about the insanity of Python 3's text
model before. It's mostly the blind leading the blind.
---
Animats:
Unicode string indexing should have been made lazy, rather than deprecated.
Random access to strings is rare. Mostly, operations are moving forward linearly
or using saved positions.
So, only build the index for random access if needed. Optimize "advance one
glyph" and "back up one glyph" expressed as indexing, and you'll get most of the
frequently used cases. Have the "index" functions that return a string index
return an opaque type that's a byte index. Attempting to convert that to an
integer forces creation of the string index.
This preserves the user visible semantics but keeps performance.
PyPy does something like this.
WTF8
https://news.ycombinator.com/item?id=9611710
The WTF-8 encoding (simonsapin.github.io)
https://news.ycombinator.com/item?id=9613971
https://simonsapin.github.io/wtf-8/#acknowledgments
Thanks to Coralie Mercier for coining the name WTF-8.
---
The name is unserious but the project is very serious, its writer has responded
to a few comments and linked to a presentation of his on the subject[0].
It's an extension of UTF-8 used to bridge UTF-8 and UCS2-plus-surrogates:
while UTF8 is the modern encoding you have to interact with legacy systems,
for UNIX's bags of bytes you may be able to assume UTF8 (possibly ill-formed)
but a number of other legacy systems used UCS2 and added visible surrogates
(rather than proper UTF-16) afterwards.
Windows and NTFS, Java, UEFI, Javascript all work with UCS2-plus-surrogates.
Having to interact with those systems from a UTF8-encoded world is an issue
because they don't guarantee well-formed UTF-16, they might contain unpaired
surrogates which can't be decoded to a codepoint allowed in UTF-8 or UTF-32
(neither allows unpaired surrogates, for obvious reasons).
WTF8 extends UTF8 with unpaired surrogates (and unpaired surrogates only,
paired surrogates from valid UTF16 are decoded and re-encoded to a proper
UTF8-valid codepoint) which allows interaction with legacy UCS2 systems.
WTF8 exists solely as an internal encoding (in-memory representation),
but it's very useful there.
[0] http://exyr.org/2015/!!Con_WTF-8/slides.pdf
https://twitter.com/koalie/status/506821684687413248
Coralie Mercier
@koalie
I have a hunch we use "wtf-8" encoding.
Appreciate the irony of:
" the future of publishing at W3C"
16/07/2021
Windows allows unpaired surrogates in filenames
https://github.com/golang/go/issues/32334
syscall: Windows filenames with unpaired surrogates are not handled correctly #32334
https://github.com/rust-lang/rust/issues/12056
path: Windows paths may contain non-utf8-representable sequences #12056
I don't know the precise details, but there exist portions of Windows in which
paths are UCS2 rather than UTF-16. I ignored it because I thought it wasn't going
to be an issue but at some point someone (and I wish I could remember who) showed
me some output that showed that they were actually getting a UCS2 path from some
Windows call and Path was unable to parse it.
---
JLF: this is the birth of WTF-8 in 2014.
The result is:
https://simonsapin.github.io/wtf-8
Codepoint/grapheme indexation
https://nullprogram.com/blog/2019/05/29/
ObjectIcon
http://objecticon.sourceforge.net/Unicode.html
ucs (standing for Unicode character string) is a new builtin type, whose behaviour closely mirrors
that of the conventional Icon string. It operates by providing a wrapper around a conventional
conventional Icon string, which must be in utf-8 format. This has several advantages, and only one
serious disadvantage, namely that a utf-8 string is not randomly accessible, in the sense that one
cannot say where the representation for unicode character i begins. To alleviate this disadvantage,
the ucs type maintains an index of offsets into the utf-8 string to make random access faster. The
size of the index is only a few percent of the total allocation for the ucs object.
Jlf: I made a code review, but could not understand how they do that :-(
Not clear if it's a codepoint indexation or a grapheme indexation.
https://lwn.net/Articles/864994/
jlf: discussion about Raku NFG and its technical limitations.
It's also the traditional discussion about "why do you need a direct access to the graphemes".
Rope
See also ZenoString (from Alan Kay - Saxonica)
https://github.com/josephg/librope
Little C library for heavyweight utf-8 strings (rope).
https://news.ycombinator.com/item?id=8065608
Discussion about ropes, ideal of strings...
https://github.com/xi-editor/xi-editor/blob/e8065a3993b80af0aadbca0e50602125d60e4e38/doc/rope_science/rope_science_03.md
https://news.ycombinator.com/item?id=34948308
Several references to older papers
https://news.ycombinator.com/item?id=37820532
Text showdown: Gap Buffers vs. Ropes
https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation
Text Buffer Reimplementation
https://en.wikipedia.org/wiki/Piece_table
In computing, a piece table is a data structure typically used to represent a
text document while it is edited in a text editor.
Encoding title
https://www.iana.org/assignments/character-sets/character-sets.xhtml
Character Sets
(IANA Character Sets registry)
These are the official names for character sets that may be used in
the Internet and may be referred to in Internet documentation. These
names are expressed in ANSI_X3.4-1968 which is commonly called
US-ASCII or simply ASCII. The character set most commonly use in the
Internet and used especially in protocol standards is US-ASCII, this
is strongly encouraged. The use of the name US-ASCII is also
encouraged.
---
jlf: see encoding.spec.whatwg.org elsewhere in this document. They say:
"User agents have also significantly deviated from the labels listed in the
IANA Character Sets registry. To stop spreading legacy encodings further,
this specification is exhaustive about the aforementioned details and therefore
has no need for the registry."
https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape
OCTOBER 12, 2022
JeanHeyd Meneide
Project Editor for ISO/IEC JTC1 SC22 WG14 - Programming Languages, C.
The Wonderfully Terrible World of C and C++ Encoding APIs (with Some Rust)
---
jlf:
Is he criticizing the work of Zach Laine? ( https://github.com/tzlaine/text )
"someone was doing something wrong on the internet and I couldn’t let that pass:"
Same person:
https://github.com/ThePhD
https://github.com/soasis
Any Encoding, Ever - ztd.text and Unicode for C++ - JUNE 30, 2021 : https://thephd.dev/any-encoding-ever-ztd-text-unicode-cpp
Starting a Basis - Shepherd's Oasis and Text - MAY 01, 2020: https://thephd.dev/basis-shepherds-oasis-text-encoding
https://ztdtext.readthedocs.io/en/latest/index.html
ztd.text
The premiere library for handling text in different encoding forms and reducing
transcoding bugs in your C++ software.
List of encodings: https://ztdtext.readthedocs.io/en/latest/encodings.html
List of Unicode encodings: https://ztdtext.readthedocs.io/en/latest/known%20unicode%20encodings.html
Design Goals and Philosophy: https://ztdtext.readthedocs.io/en/latest/design.html
---
jlf: don't know what to think about that...
related to https://github.com/soasis
https://github.com/soasis/text
JeanHeyd Meneide
This repository is an implementation of an up and coming proposal percolating
through SG16, P1629 - Standard Text Encoding
( https://thephd.dev/_vendor/future_cxx/papers/d1629.html )
---
https://github.com/soasis
Shepherd's Oasis
Software Services and Consulting.
https://encoding.spec.whatwg.org/
Encoding
The Encoding Standard defines encodings and their JavaScript API.
---
The table below lists all encodings and their labels user agents must support.
User agents must not support any other encodings or labels.
<table>
---
Most legacy encodings make use of an index. An index is an ordered list of entries,
each entry consisting of a pointer and a corresponding code point. Within an index
pointers are unique and code points can be duplicated.
Note: An efficient implementation likely has two indexes per encoding.
One optimized for its decoder and one for its encoder.
https://www.git-tower.com/help/guides/faq-and-tips/faq/encoding/windows
Character encoding for commit messages
---
When Git creates and stores a commit, the commit message entered by the user is
stored as binary data and there is no conversion between encodings. The encoding
of your commit message is determined by the client you are using to compose the
commit message.
Git stores the name of the commit encoding if the config key "i18n.commitEncoding"
is set (and if it's not the default value "utf-8").
If you commit changes from the command line, this value must match the encoding
set in your shell environment. Otherwise, a wrong encoding is stored with the
commit and can result in garbled output when viewing the commit history.
If you view the commit log on the command line, the config value "i18n.logOutputEncoding"
(which defaults to "i18n.commitEncoding") needs to match your shell encoding as well.
The command converts messages from the commit encoding to the output encoding.
If your shell encoding does not match the output encoding, you will again receive
garbled output!
https://www.git-scm.com/docs/gitattributes/2.18.0#_working_tree_encoding
gitattributes - Defining attributes per path
working-tree-encoding
Git recognizes files encoded in ASCII or one of its supersets (e.g. UTF-8,
ISO-8859-1, …) as text files. Files encoded in certain other encodings (e.g.
UTF-16) are interpreted as binary and consequently built-in Git text processing
tools (e.g. git diff) as well as most Git web front ends do not visualize the
contents of these files by default.
In these cases you can tell Git the encoding of a file in the working directory
with the working-tree-encoding attribute. If a file with this attribute is added
to Git, then Git reencodes the content from the specified encoding to UTF-8.
Finally, Git stores the UTF-8 encoded content in its internal data structure
(called "the index"). On checkout the content is reencoded back to the specified
encoding.
---
jlf: there is a number of pitfalls, read the article.
https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text
How to determine the encoding of text
jlf: for Python, not reviewed, may bring interesting infos.
https://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file
How can I detect the encoding/codepage of a text file?
jlf: for C#, not reviewed, may bring interesting infos.
https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1
What is the difference between UTF-8 and ISO-8859-1?
jlf: the interesting part are the comments about ISO-8859-1.
---
ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters.
---
cp1252 is a superset of the ISO-8859-1, containing additional printable
characters in the 0x80-0x9F range, notably the Euro symbol € and the much
maligned "smart quotes" “”. This frequently leads to a mismatch where 8859-1 can
be displayed as 1252 perfectly fine, and 1252 may seem to display fine as 8859-1,
but will misbehave when one of those extra symbols shows up.
---
jlf: so the previous comment says that ISO-8859-1 is not defined in the 0x80-0x9F
range... IS IT or IS IT NOT???
---
One more important thing to realise: if you see iso-8859-1, it probably refers
to Windows-1252 rather than ISO/IEC 8859-1. They differ in the range 0x80–0x9F,
where ISO 8859-1 has the C1 control codes, and Windows-1252 has useful visible
characters instead.
For example, ISO 8859-1 has 0x85 as a control character (in Unicode, U+0085, ``),
while Windows-1252 has a horizontal ellipsis (in Unicode, U+2026 HORIZONTAL ELLIPSIS, …).
The WHATWG Encoding spec (as used by HTML) expressly declares iso-8859-1 to be
a label for windows-1252, and web browsers do not support ISO 8859-1 in any way.
https://www.mobilefish.com/tutorials/character_encoding/character_encoding_quickguide_iso8859_1.html
jlf: not sure this page is a good reference. The fact they wrote "Unicode, a 16-bit character set."
brings a doubt about the rest of their page...
I reference it for their definition of ISO-8859-1.
---
HTML and HTTP protocols make frequent reference to ISO Latin-1 and the character
code ISO-8859-1. The HTTP specification mandates the use of the code ISO-8859-1
as the default character code that is passed over the network.
ISO-8859-1 explicitly does not define displayable characters for positions 0-31
and 127-159, and the HTML standard does not allow those to be used for displayable
characters. The only characters in this range that are used are 9, 10 and 13,
which are tab, newline and carriage return respectively.
Note: ISO-8859-1 is also known as Latin-1.
---
jlf: so they say
- 00..1F is not defined except 09, 0A, 0D (so they are different from
https://en.wikipedia.org/wiki/ISO/IEC_8859-1) where all 00..1F is undefined.
- 7F..9F is not defined
Confirmed by their text file:
https://www.mobilefish.com/download/character_set/iso8859_1.txt
ICU title
https://icu.unicode.org
https://unicode-org.github.io/icu/
ICU documentation
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/
Entry point of API Reference
https://icu-project.org/docs/
ICU Documents and Papers
jlf: old?
https://unicode-org.atlassian.net/jira/software/c/projects/ICU/issues/?filter=allissues
ICU tickets
https://github.com/microsoft/icu
jlf: fork by Microsoft
http://stackoverflow.com/questions/8253033/what-open-source-c-or-c-libraries-can-convert-arbitrary-utf-32-to-nfc
What open source C or C++ libraries can convert arbitrary UTF-32 to NFC?
std::string normalize(const std::string &unnormalized_utf8) {
// FIXME: until ICU supports doing normalization over a UText
// interface directly on our UTF-8, we'll use the insanely less
// efficient approach of converting to UTF-16, normalizing, and
// converting back to UTF-8.
// Convert to UTF-16 string
auto unnormalized_utf16 = icu::UnicodeString::fromUTF8(unnormalized_utf8);
// Get a pointer to the global NFC normalizer
UErrorCode icu_error = U_ZERO_ERROR;
const auto *normalizer = icu::Normalizer2::getInstance(nullptr, "nfc", UNORM2_COMPOSE, icu_error);
assert(U_SUCCESS(icu_error));
// Normalize our string
icu::UnicodeString normalized_utf16;
normalizer->normalize(unnormalized_utf16, normalized_utf16, icu_error);
assert(U_SUCCESS(icu_error));
// Convert back to UTF-8
std::string normalized_utf8;
normalized_utf16.toUTF8String(normalized_utf8);
return normalized_utf8;
}
https://begriffs.com/posts/2019-05-23-unicode-icu.html
Unicode programming, with examples
https://en.wikipedia.org/wiki/Trie
Tries are a form of string-indexed look-up data structure, which is used to store a dictionary list
of words that can be searched on in a manner that allows for efficient generation of completion lists.
Tries can be efficacious on string-searching algorithms such as predictive text,
approximate string matching, and spell checking in comparison to a binary search trees.
A trie can be seen as a tree-shaped deterministic finite automaton.
https://icu.unicode.org/design/struct/utrie
ICU Code Point Tries
We use a form of "trie" adapted to single code points.
The bits in the code point integer are divided into two or more parts.
The first part is used as an array offset, the value there is used as a start offset into another array.
The next code point bit field is used as an additional offset into that array, to fetch another value.
The final part yields the data for the code point.
Non-final arrays are called index arrays or tables.
---
For a general-purpose structure, we want to be able to be able to store a unique value for every character.
This determines the number of bits needed in the last index table.
With 136,690 characters assigned in Unicode 10, we need at least 18 bits.
We allocate data values in blocks aligned at multiples of 4, and we use 16-bit index words shifted left by 2 bits.
This leads to a small loss in how densely the data table can be used, and how well it can be compacted, but not nearly as much as if we were using 32-bit index words.
https://icu.unicode.org/design/struct/tries/bytestrie
It maps from arbitrary byte sequences to 32-bit integers.
(Small non-negative integers are stored more efficiently. Negative integers are the least efficient.)
The BytesTrie and UCharsTrie structures are nearly the same, except that the UCharsTrie uses fewer, larger units.
https://icu.unicode.org/design/struct/tries/ucharstrie
Same design as a BytesTrie, but mapping any UnicodeString (any sequence of 16-bit units) to 32-bit integer values.
https://icu.unicode.org/charts/charset
https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/convrtrs.txt
ICU alias table
jlf: the ultimate reference?
---
# Here is the file format using BNF-like syntax:
#
# converterTable ::= tags { converterLine* }
# converterLine ::= converterName [ tags ] { taggedAlias* }'\n'
# taggedAlias ::= alias [ tags ]
# tags ::= '{' { tag+ } '}'
# tag ::= standard['*']
# converterName ::= [0-9a-zA-Z:_'-']+
# alias ::= converterName
---
standard
# The * after the standard tag denotes that the previous alias is the
# preferred (default) charset name for that standard. There can only
# be one of these default charset names per converter.
---
Affinity tags
If an alias is given to more than one converter, it is considered to be an
ambiguous alias, and the affinity list will choose the converter to use when
a standard isn't specified with the alias.
The general ordering is from specific and frequently used to more general
or rarely used at the bottom.
{ UTR22 # Name format specified by https://www.unicode.org/reports/tr22/
IBM # The IBM CCSID number is specified by ibm-*
WINDOWS # The Microsoft code page identifier number is specified by windows-*. The rest are recognized IE names.
JAVA # Source: Sun JDK. Alias name case is ignored, but dashes are not ignored.
IANA # Source: http://www.iana.org/assignments/character-sets
MIME # Source: http://www.iana.org/assignments/character-sets
}
https://github.com/unicode-org/icu/tree/main/icu4c/source/data/mappings
Encodings
https://unicode-org.atlassian.net/browse/ICU-22422
Collation folding
jlf: see Markus Scherer feedback
https://sourceforge.net/p/icu/mailman/icu-design/thread/SN6PR00MB04468327B475F4D6A19CF26FAFFFA%40SN6PR00MB0446.namprd00.prod.outlook.com/#msg38268251
[icu-design] Collation Folding Tables
jlf: this is a discussion related to ICU-22422
https://www.unicode.org/reports/tr10/#Collation_Folding
Collation Folding
Matching can be done by using the collation elements, directly, as discussed above.
However, because matching does not use any of the ordering information, the same
result can be achieved by a folding. That is, two strings would fold to the same
string if and only if they would match according to the (tailored) collation.
For example, a folding for a Danish collation would map both "Gård" and "gaard"
to the same value. A folding for a primary-strength folding would map "Resume"
and "résumé" to the same value. That folded value is typically a lowercase string,
such as "resume".
jlf:
Chrome matches "Gård" with "gard", but not with "gaard".
A comparison between folded strings cannot be used for an ordering of strings,
but it can be applied to searching and matching quite effectively. The data for
the folding can be smaller, because the ordering information does not need to be
included. The folded strings are typically much shorter than a sort key, and are
human-readable, unlike the sort key. The processing necessary to produce the
folding string can also be faster than that used to create the sort key.
Transliterate "micro sign" to "u" using Transliterator from icu4j
jlf: next is an answer on icu-support@lists.sourceforge.net
https://sourceforge.net/p/icu/mailman/message/58712806/
On Wed, Dec 13, 2023 at 7:52 PM <go.al.ni@gmail.com> wrote:
> Micro sign transliterated to "m" in one case, but not in another.
While I don't know enough about the Any-Latin transliteration rules to
be able to tell you why this happens, the thing that happens is that
when you have any preceding Greek letter the transliterator will
afterwards treat also the micro sign (U+00B5) as a Greek letter, while
it otherwise will leave it as-is, as any other symbol.
If you want to transliterate only Greek letters you could explicitly
create a Greek transliterator, which then will always treat also the
micro sign (U+00B5) as a Greek letter:
var tr = Transliterator.getInstance("Greek-Latin");
Or, if you want to first treat any symbols that are also Greek letters
explicitly as Greek letters and then perform the Any-Latin
transliteration:
var tr = Transliterator.getInstance("Greek-Latin; Any-Latin;");
Or, if you want just Any-Latin but with a special case for the micro
sign (U+00B5):
var tr = Transliterator.createFromRules("MyAnyLatin", "µ > m;
::Any-Latin;", Transliterator.FORWARD);
[icu-support] CollationKey for efficient collation-aware in-place substring comparison
Question
https://sourceforge.net/p/icu/mailman/message/58741675/
I have a question regarding the use of CollationKey
<https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/CollationKey.html>
to check whether one string "contains" the other (i.e. right string is
found anywhere in the left string, accounting for any specified rule-based
collation using ICU4J). With this, my use case in Java would be something
like: *contains(String left, String right, String collation)*. Suppose that
*collation* here is a parameter indicating the collation at hand (for
example: "Latin1_General_CS_AI"), and is used to get the appropriate
instance of *com.ibm.icu.text.Collator* (exact routing for this collation
is handled elsewhere in the codebase).
Problem description
Due to the nature of this operation, using *Collator.compare(String,
String)* proves inefficient for this problem, because it would require
allocating O(N) substrings of *left *before calling *compare(left.substring(),
right)*. Suppose N here is the length of the *left* string.
Example: *contains*("Abć", "a", "Latin1_General_CS_AI"); // returns false
- calls: *collator.compare("A", "a")* // returns false ("A" here is
"Abć".substring(0,1))
- calls: *collator.compare("b", "a")* // returns false ("b" here is
"Abć".substring(1,2))
- calls: *collator.compare("ć", "a")* // returns false ("ć" here is
"Abć".substring(2,3))
Here, this approach allocates *3 new strings* in order to do the
comparisons.
Using CollationKey
As I understood, *com.ibm.icu.text.CollationKey* is the way to go for
repeated comparison of strings. Here, I would like to compare strings in a
way that only requires generating one key for *left* (let's call it
*leftKey*) and one key for *right* (let's call it *rightKey*), and then
comparing these arrays in-place, byte per byte.
However, it doesn't seem that this operation is supported out-of-the-box
with *CollationKey*. While one can easily use two collation keys
for equality comparison and collation-aware ordering, I'm not sure if this
holds for substring operations as well? Given a collation key for "Abć", is
there a constant-time way to obtain collation keys for "A", "b", and "ć"?
Ideally, I would want to only traverse the "Abć" collation key (*leftKey*)
as a plain byte array, and do in-place comparison with the "ć" collation
key (*rightKey*) as a plain byte array. However, it doesn't seem
straightforward given the structure of the collation key (suffixes, etc.)
public boolean contains(String left, String right, String collation) {
> Collator collator = ...(collation);
> // get collation keys
> CollationKey leftKey = collator.getCollationKey(left);
> CollationKey rightKey = collator.getCollationKey(right);
> // get byte arrays
> byte[] lBytes = leftKey.toByteArray();
> byte[] rBytes = rightKey.toByteArray();
> // in-place comparison
> for (int i = 0; i <= lBytes.length - rBytes.length; i++) {
> if (compareKeys(lBytes, rBytes, i)) {
> return true;
> }
> }
> return false;
> }
Suppose there's a simple helper function such as:
> private boolean compareKeys(byte[] lBytes, byte[] rBytes, int offset) {
> int len = rBytes.length;
> // compare lBytes[i, i+len] to rBytes[0, len] in-place, byte by byte...
> }
Could you please provide any support regarding how to implement this
solution so that it fully takes into account the collation key byte array
structure? As of now, this simple comparison doesn't work because there are
some suffixes in both *leftKey* and *rightKey*, so exact comparison is not
possible, but I'm wondering if there is a way to go around this.
Alternative
It turns out that making use of *Collator.compare(Object, **Object**)* instead
of *Collator.compare(String, **String**)* doesn't prove to be any better
either, because it does *toString()* anyway, regressing the performance in
a similar fashion. Ideally, an implementation such as
*Collator.compare(Character, **Character**)* could do the trick, however
only under the condition that it would *not allocate* a new *String* for
the two arguments. This would allow traversing *left* and *right* strings
and comparing individual characters just by using *String.charAt* (with no
extra *String* allocation whatsoever).
However, I don't believe there is currently anything like
*Collator.compare(**Character**, **Character**)* that works exactly like
this. So for now, I'm trying to implement this functionality using
*CollationKey*.
Answer from Markus Sherer
https://sourceforge.net/p/icu/mailman/message/58741856/
Yes, but CollationKey is too low-level, and you would have to compute and
store the CollationKey for the entire left string at once, which could be
large.
“Don't do this at home” :-)
Please use class
<https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html>
StringSearch
<https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html>
https://unicode-org.github.io/icu/userguide/collation/string-search.html
I don't remember if StringSearch automatically loads "search" tailorings;
it's possible that you may have to request that explicitly.
https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback
https://www.unicode.org/reports/tr10/#Searching
https://docs.google.com/document/d/1_sbbFermCe24uK9Q59AOSY2XKnl2qQPCTrmUExrcDaI/edit#heading=h.86zz4h2lnsqb
[icu-design] Mixed locales
Mail from Rich Williams
The topic of mixed locales (aka region-based inheritance/fallback) came up in
this morning’s CLDR/iCU design meeting, and it sounded like maybe it was time to
restart the discussion we had on this topic a couple years ago.
I was asked to again send around the proposal I wrote up for this back then.
So here it is.
[icu-support] Rounding collated strings up/down
On Thu, May 9, 2024 at 2:56 PM Stefan Kandic <stefan.kandic@databricks.com> wrote:
| Let's say you are working with a string "abcdef" with utf8 binary collation
| and you want to truncate it to only 3 characters,
| rounded down version would just be "abc" => all strings that are greater than
| the original are also greater than the new one
Markus Scherer
The complication here would be contractions and expansions, and interactions with
normalization.
It would probably be best if you normalized the input to NFD, or at least to
something that fits "FCD" (which is supported in ICU), in order to avoid canonical
reordering.
Then you should be able to use a CollationElementIterator and observe when its
source text index moves. When you are inside an expansion, it should remain the
same for several collation elements.
And when it moves, it will have consumed multiple characters for a contraction.
For example, in Slovak it will move from before "ch" to after it, not in between.
| rounded up version would be "abd" or even "ab" + utf8 max character
| => all strings that are less than the original are also less than the new one
This kind of "rounding up" is supported in CLDR/ICU collation.
Take your lower bound and append a U+FFFF.
See https://www.unicode.org/reports/tr35/tr35-collation.html#tailored_noncharacter_weights
| Basically, idea is to get truncated values to save on storage but also to have
| sortKey(roundedDown) < sortKey(original) < sortKey(roundedUp) be as tight bound
| as possible.
Try to avoid "rounding down" because it's messy.
ICU uses several tricks to "compress" sort keys while maintaining their binary order.
Note that sort keys are not stable across Unicode/CLDR/ICU versions.
ICU demos
https://icu4c-demos.unicode.org/icu-bin/icudemos
todo: review
https://icu4c-demos.unicode.org/icu-bin/collation.html
ICU Collation Demo
https://icu4c-demos.unicode.org/icu-bin/convexp
Demo Converter Explorer
https://icu4c-demos.unicode.org/icu-bin/scompare
ICU Unicode String Comparison
Interactive demo application
ICU bindings
02/06/2021
https://gitlab.pyicu.org/main/pyicu
Python extension wrapping the ICU C++ libraries.
02/06/2021
https://docs.microsoft.com/en-us/windows/win32/intl/international-components-for-unicode--icu-
In Windows 10 Creators Update, ICU was integrated into Windows, making the C APIs and data publicly accessible.
The version of ICU in Windows only exposes the C APIs.
It is impossible to ever expose the C++ APIs due to the lack of a stable ABI in C++.
Getting started
1) Your application needs to target Windows 10 Version 1703 (Creators Update) or higher.
2) Add in the header:
#include <icu.h>
3) Link to:
icu.lib
Example:
void FormatDateTimeICU()
{
UErrorCode status = U_ZERO_ERROR;
// Create a ICU date formatter, using only the 'short date' style format.
UDateFormat* dateFormatter = udat_open(UDAT_NONE, UDAT_SHORT, nullptr, nullptr, -1, nullptr, 0, &status);
if (U_FAILURE(status))
{
ErrorMessage(L"Failed to create date formatter.");
return;
}
// Get the current date and time.
UDate currentDateTime = ucal_getNow();
int32_t stringSize = 0;
// Determine how large the formatted string from ICU would be.
stringSize = udat_format(dateFormatter, currentDateTime, nullptr, 0, nullptr, &status);
if (status == U_BUFFER_OVERFLOW_ERROR)
{
status = U_ZERO_ERROR;
// Allocate space for the formatted string.
auto dateString = std::make_unique<UChar[]>(stringSize + 1);
// Format the date time into the string.
udat_format(dateFormatter, currentDateTime, dateString.get(), stringSize + 1, nullptr, &status);
if (U_FAILURE(status))
{
ErrorMessage(L"Failed to format the date time.");
return;
}
// Output the formatted date time.
OutputMessage(dateString.get());
}
else
{
ErrorMessage(L"An error occured while trying to determine the size of the formatted date time.");
return;
}
// We need to close the ICU date formatter.
udat_close(dateFormatter);
}
http://www.boost.org/doc/libs/1_58_0/libs/locale/doc/html/index.html
Boost.Locale creates the natural glue between the C++ locales framework, iostreams, and the powerful ICU library
http://blog.lukhnos.org/post/6441462604/using-os-xs-built-in-icu-library-in-your-own
Using OS X’s Built-in ICU Library in Your Own Project
ICU4X title
https://icu4x.unicode.org/
lead by Shane Carr (https://www.sffc.xyz)
https://github.com/unicode-org/icu4x
https://docs.rs
jlf: if there is a version number in the path, you can replace it with "latest"
https://www.unicode.org/faq/unicode_license.html
jlf: ICU4X uses UNICODE LICENSE V3
The Unicode License is a permissive MIT type of license. However, there are
several additional considerations identified separately in the associated
Unicode Terms of Use (https://www.unicode.org/copyright.html).
---
Comparison with other licenses:
https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licenses
jlf: hum... the "unicode license" is not in this table...
https://www.reddit.com/r/rust/comments/q4xaig/icu_vs_rust_icu/
icu vs rust_icu
Oct 10, 2021
---
jlf : here "icu" is ICU4X and rust_icu is another crate.
Well... it's a mess, plenty of separated crates more or less finalized.
There is a comment from an ICU4X committer saying "ICU4X does not have normalization".
Of course, it's now supported but it's to say that ICU4X is far to be as complete
as ICU.
https://news.ycombinator.com/item?id=35608997
ICU4X 1.2: Now With Text Segmentation and More on Low-Resource Devices
MONDAY, APRIL 17, 2023
http://blog.unicode.org/2022/09/announcing-icu4x-10.html
SEPTEMBER 29, 2022
Announcing ICU4X 1.0
This week, after 2½ years of work by Google, Mozilla, Amazon, and community partners,
the Unicode Consortium has published ICU4X 1.0, its first stable release.
Lightweight:
ICU4X is Unicode's first library to support static data slicing and dynamic data loading.
Portable:
ICU4X supports multiple programming languages out of the box. ICU4X can be used
in the Rust programming language natively, with official wrappers in C++ via the
foreign function interface (FFI) and JavaScript via WebAssembly.
ICU4X does not seek to replace ICU4C or ICU4J; rather, it seeks to replace the large number
of non-Unicode, often-unmaintained, often-incomplete i18n libraries that have been written
to bring i18n to new programming languages and resource-constrained environments
One of the most visible departures that ICU4X makes from ICU4C and ICU4J is an
explicit data provider argument on most constructor functions.
ICU4X team member Manish Goregaokar wrote a blog post series detailing how the zero-copy deserialization works under the covers.
https://manishearth.github.io/blog/2022/08/03/zero-copy-1-not-a-yoking-matter/
https://manishearth.github.io/blog/2022/08/03/zero-copy-2-zero-copy-all-the-things/
https://manishearth.github.io/blog/2022/08/03/zero-copy-3-so-zero-its-dot-dot-dot-negative/
(jlf: Related to ICU4X, but should I read that ? It's internal Rust stuff)
https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/cpp.md
Using ICU4X from C++
https://www.reddit.com/r/programming/comments/xrmine/the_unicode_consortium_announces_icu4x_10_its_new/
The C and C++ APIs are header-only, you use them by linking to the icu_capi crate (more on this here).
https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/cpp.md
The C API is just not that idiomatic, so we don't advertise it as much.
It exists more as a crutch for other languages to be able to call in, and it's optimized for cross language interop.
That said, it has been pointed out to me that it's not that unidiomatic when you compare it with other large C libraries,
so perhaps that's okay. We do have some tests that use it directly and it's .... fine to work with.
Not an amazing experience, not terrible either.
---
jlf: to investigate
The C wrapper is probably better to use from Executor, because there is no hidden magic for memory management.
The C++ wrapper is difficult to understand (at least to me, for the moment) because it's modern C++.
https://www.reddit.com/r/rust/comments/xrh7h6/announcing_icu4x_10_new_internationalization/
icu_segmenter implements rule based segmentation, so you can actually customize
the segmentation rules based on your needs by writing some toml and feeding it to datagen.
The concept of a "character" or "word" has no single cross-linguistic meaning;
it is not uncommon to need to tailor these algorithms by use case or even just
the language being used. E.g. handling viramas in Indic scripts as a part of
grapheme segmentation is a thing people might need, but may also not need,
and UAX29 doesn't support that at the moment¹. CLDR contains a bunch of common
tailorings for specific locales here, but as I mentioned folks may tailor further based on use case.
Furthermore, icu_segmenter supports dictionary-based segmentation:
for languages like Japanese and Thai where spaces are not typically used,
you need a large dictionary to be able to segment them accurately
(and again, it's language-specific).
ICU4X's flexible data model means that you don't need to ship your application
with this data and instead fetch it when it's actually necessary.
We both support using dictionaries and an LSTM model depending on your code size/data size needs.
https://docs.google.com/document/d/1ojrOdIchyIHYbg2G9APX8j2p0XtmVLj0f9jPIbFYVUE/edit#heading=h.xy9pq2mk1ypz
ICU4X Segmenter Investigation
https://github.com/unicode-org/icu4x/issues/1397
Character names
jlf: Not yet supported by ICU4X, too bad... I need that for Executor.
https://github.com/unicode-org/icu4x/issues/545
Reconsider UTF-32 support
jlf: see also the comments about PyICU
https://github.com/unicode-org/icu4x/issues/131
Port BytesTrie to ICU4X #131
with feedback from Markus Scherer (ICU)
https://github.com/unicode-org/icu4x/issues/2721
Specialized zerovec collections for stringy types
Sketch of a potential AsciiTrie.
https://github.com/unicode-org/icu4x/pull/2722
Experimental AsciiTrie implementation
https://github.com/unicode-org/icu4x/issues/2755
Get word break type
When you iterate through text using the WordBreakIterator, you get the boundaries of words, spaces, punctuation, etc.
It does not appear to tell you what kind of token or break that is has found.
The C-language version of ICU has a function on the iterator called getRuleStatus()
that returns an enum that describes the last break it found. The documentation is here:
https://unicode-org.github.io/icu/userguide/boundaryanalysis/
https://github.com/unicode-org/icu4x/pull/2777/files
added initial benchmarks for normalizer.
https://github.com/unicode-org/icu4x/discussions/2877
How to use segmenter
https://github.com/unicode-org/icu4x/issues/2886
Consider providing mapping to Standardized Variants for CJK Compatibility Ideographs
Across GitHub, I found 3 users of this feature in unicode-normalization:
https://github.com/sunfishcode/basic-text (by the implementor of the unicode-normalization feature)
https://github.com/logannc/fuzzywuzzy-rs (unclear to me why you'd want this for a fuzzy match; I'd expect a fuzzy match not to want to distinguish the variations)
https://github.com/crlf0710/runestr-rs
https://github.com/unicode-org/icu4x/issues/2975
How supported do we consider non-keyextract users?
https://github.com/unicode-org/icu4x/issues/2908
Time zone needs for calendar application
Use case by team member of Mozilla Thunderbird
Not related to Unicode, but related to the fact I put the ICU4X cdylib in Executor github...
https://github.com/ankane/polars-ruby/blob/master/ext/polars/Cargo.toml
Is it a way to avoid bundling the original rust lib?
https://news.ycombinator.com/item?id=34425233
---
Not clear to me: for Python, are the lib binaries installed by
https://pypi.org/project/polars/
?
apparently yes, see https://pypi.org/project/polars/#files
---
For ruby, is it built by a github workflow?
https://github.com/ankane/polars-ruby/blob/master/.github/workflows/release.yml
https://github.com/unicode-org/icu4x/pull/2779/files
add collator initial bench
https://github.com/unicode-org/icu4x/issues/3151
icu_casemapping feature request: methods fold and full_fold should apply Turkic mappings depending on locale
---
Markus Scherer:
Applying Turkic case foldings automatically is dangerous.
While case mappings are intended for human consumption and take a locale parameter,
case foldings are used for processing (case-insensitive matching) not for display,
and in most cases it is very surprising when "IBM" and "ibm" don't match when the
locale is Turkish or Azerbaijani.
It is much safer to let the developer control this explicitly. (By comparison,
ICU4C/ICU4J have folding functions that take a boolean parameter for default vs.
Turkic foldings. This also models the boolean condition in the relevant Unicode
data file.)
---
lucatrv
If I understand correctly, icu_collator should be used when strings need to be sorted,
while a case-folding method of icu_casemapping should be used when strings need
just to be matched. However icu_collator can also be used to match strings, see
for instance examples using Ordering::Equal here, so it is not clear to me which
one to use in this case.
Finally, another source of confusion (at least for me) is that icu_casemapping
can be used for both case mapping and case folding, but its documentation mentions
only "Case mapping for Unicode characters and strings".
---
sffc
The collator does a fuzzier match. The example you cited shows that it considers
"às" and "as" to be equal, for example.
@markusicu is it safe to say that most users who are looking for a fuzzy string
comparison utility should favor the collator over casefold+nfd?
---
sffc
See also https://github.com/tc39/ecma402/issues/256
---
hsivonen
Casefold+NFD and ignoring combining diacritics after the NFD operation gives a
general case-insensitive, diacritic-insensitive match. To further match the root
search collation (apart from the Hangul aspect for which I don't understand the
use case), you'd have to also ignore certain Arabic marks and the Thai phinthu
(virama). (The Hebrew aspect of the search root is gone from CLDR trunk already.)
Apart from Turkic case-insensitivity, the key thing that the search collation
tailorings provide on top of the above is being able to have a diacritic-insensitive
mode where certain things that technically are diacritics but that are on a
per-language basis considered to form a distinct base letter are not ignored on
a locale-sensitive basis. For example, o and ö are distinct for Finnish, Swedish,
Icelandic, and Turkish (not sure if them being equal for Estonian search is
intentional or a CLDR bug) in collator-based search even when ignoring diacritics.
Based on observing the performance of Firefox's ctrl/cmd-f (not collator based)
relative to Chrome's and Safari's (collator-based), I believe that casefold+NFD
and ignoring certain things post-NFD will be faster than collator-based search.
However, if you also want not to ignore certain diacritics on a per-locale basis,
it's up to you to implement those rules. That is, ICU4X doesn't do it for you.
You can find out what the rules are by reading the CLDR search collation sources.
(FWIW, Firefox's ctrl/cmd-f does not have locale-dependent rules for diacritics.
The checkbox either ignores all of them or none.)
ECMA-402 and ICU4X don't have API surface for collator-based substring match.
You can only do full-string comparison, so you can search in the sense of
filtering a set/list of items by a search key.
---
Markus Scherer
> If I understand correctly, CaseMapping::to_full_fold applies full case folding
> + NFD and ignores combining diacritics.
I think not. I believe it just applies the “full” Case_Folding mappings to each
character, as opposed to the Simple_Case_Folding. Normalization and removing
diacritics etc. would be separate steps / function calls.
https://www.unicode.org/reports/tr44/#Case_Folding
> Therefore it actually provides the fuzziest match (general case-insensitive
> and diacritic-insensitive match). To my understanding this should be equivalent
> to the icu_collator primary strength level,
> https://icu4x.unicode.org/doc/icu_collator/enum.Strength.html#variant.Primary
No. Similar in effect, but as Henri said, collation mappings do a lot more, such
as ignoring control codes and variation selectors.
> which I guess is independent from locale
Not really. There are language-specific collation mappings, such as German "ä"="ae"
(on primary level), but of course for the majority of Unicode characters each
tailoring behaves like the Unicode default.
Collation also provides for a number of parametric settings, although most of
those are relevant for sorting, not for matching and searching. They do let you
select things like “ignore punctuation” and “ignore diacritics but not case”.
https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
---
lucatrv
Referring to Section 3.13, Default Case Algorithms in the Unicode Standard,
now I understand that CaseMapping::full_fold applies the toCasefold(X)
operation (R4 page 155), which is the Case_Folding property.
To allow proper caseless matching of strings interpreted as identifiers,
in my opinion another method CaseMapping::NFKC_full_fold should be added,
to apply the toNFKC_Casefold(X) operation (R5 page 155), which is the
NFKC_Casefold property. Then another method should be added to allow
identifier caseless matching, which could be either the combined function
toNFKC_Casefold(NFD(X)) (D147 page 158) or the lower level NFD(X)
normalization function. Otherwise to keep things simpler, maybe just a
method named CaseMapping::caseless could be added which applies
toNFKC_Casefold(NFD(X)) (D147 page 158). Do you agree, or otherwise how can
I perform proper caseless categorization and matching?
---
eggrobin
For case-insensitive identifier comparison (identifiers include programming
language identifiers, but also things like usernames: @EGGROBIN and @eggrobin
are the same person), Unicode provides the operation toNFKC_Casefold, used
in the definition of identifier caseless match (D147 in Default Caseless Matching).
Earlier versions of Unicode (prior to 5.2) recommended the use of NFKC and
casefolding directly, without the removal of default ignorable code points
performed by toNFKC_Casefold.
The foldings thus have stability guarantees that make them suitable for usage
in identifier comparison in conjunction with NFKC
(see https://www.unicode.org/policies/stability_policy.html#Case_Folding).
As @markusicu wrote above, since identifier systems typically need to use a
locale-independent comparison, the Turkic foldings need to be used with great
care: whether @eggrobin is the same as @EGGROBIN should not depend on
someone’s language.
@markusicu is it safe to say that most users who are looking for a fuzzy
string comparison utility should favor the collator over casefold+nfd?
^ @macchiati for advice on the most recommended way to perform fuzzy string
matching.
I am neither Markus nor Mark, but I would say that for general-purpose
matching that does not have stability requirements, something collation-based
is more appropriate. In particular, Chrome’s Ctrl+F search uses that.
This is, as has been mentioned, fuzzier (beyond the accents already mentioned,
note that ŒUF and œuf are primary-equal to oeuf, whereas they are not
identifier caseless matches).
An important consideration is that, being unstable (there is a somewhat
squishy stability policy, see https://www.unicode.org/policies/collation_stability.html
and https://www.unicode.org/collation/ducet-changes.html), fuzzy matching
based on collation can be improved. Most recently the UTC approved (in
consensus 174-C4) a change to the collation of punctuation marks that look
like the ASCII ' and ", which has the effect that O'Connor will now be
primary-equal to O’Connor.
https://github.com/unicode-org/icu4x/issues/3178
Consider supporting three layers of collation data for search collations
Markus Scherer
Outside of ICU4X we usually try to make code & data work according to the
algorithms, not according to what the known data looks like right now.
ICU4C/J allow users to build custom tailorings at build time and at runtime.
It should be possible to tailor relative to something that is tailored in the
intermediate root search.
https://github.com/unicode-org/icu4x/issues/3174#issuecomment-1624080765
Should search collation be a different data key + constructor? #3174
---
jlf
Don't know if this long comment brings something useful for Rexx.
They are searching for use-cases.
whole-string matching, collation, substring or prefix matching.
https://www.unicode.org/reports/tr10/#Searching: It's typically used for a substring
match, like Ctrl-F in a browser.
Why is collation the way it is? There's a use case for diacritic-insensitive string
matching. And there is also the observation that you need special handling for certain
diacritics like German umlauts.
It seems weird that Thai for example has certain tailorings that are not in other
Brahmic languages.
https://github.com/unicode-org/icu4x/discussions/3981#discussioncomment-6882618
String search with collators
references this ICU link:
https://unicode-org.github.io/icu/userguide/collation/string-search.html
https://github.com/unicode-org/icu4x/issues/3174#issuecomment-1624080765
Should search collation be a different data key + constructor?
jlf: referenced from #3981 with this comment:
We've had discussions about search collations in the past, such as #3174
Basically, we need a client with a clear and compelling use case who ideally can
make some contributions, and then the team can provide mentorship to help land
this type of feature.
icu_collator version 1.3.3 is released.
https://github.com/unicode-org/icu4x/releases/tag/ind%2Ficu_collator%401.3.3
https://docs.rs/icu_collator/latest/icu_collator/
Comparing strings according to language-dependent conventions.
jlf: with examples
jlf: implementation notes. https://docs.rs/icu_collator/latest/icu_collator/docs/index.html
They use NFD?
"The key design difference between ICU4C and ICU4X is that ICU4C puts the
canonical closure in the data (larger data) to enable lookup directly by
precomposed characters while ICU4X always omits the canonical closure and always
normalizes to NFD on the fly."
jlf: ok, on the fly, so part of their algorithm.
https://github.com/unicode-org/icu4x/discussions/3231#discussioncomment-5599221
@sffc , Will ICU4X Test Data provider give correct results for Lao language?
I was running segment_utf16 on Lao string but its results are not inline with ICU4C results.
The ICU4X Test Data provider supports Japanese and Thai. For the other languages,
you should follow the steps in the tutorial to generate your own data; in general
the testdata provider is intended for testing. You can also track #2945 which
will make it possible to get full data without needing to build it using the tool.
https://www.youtube.com/watch?v=ZzsbN7HBd7E
Rust Zürisee, Dec 2022: Next Generation i18n with Rust Using ICU4X
Talk by Shane Carr (starts at 11:20, with some intros from the organizers first)
https://github.com/unicode-org/icu4x/discussions/3522
Some word segmentation results are different than we get in ICU4C
- Khmer string មនុស្សទាំងអស់ is giving 13 index as a breakpoint in ICU4X while ICU4C gives 6
- ຮ່ສົ່ສີ 5 in ICU4C while 7 in ICU4X
- กระเพรา 3 in ICU4C while 7 in ICU4X
I'm using the full data blob with all keys and locales.
jlf: see the discussion, there is some code.
https://github.com/unicode-org/icu4x/issues/2945
Default constructors with full data
jlf: remember "close #2743 in favour of #2945. the solution we're working on there trivially extends to FFI."
sffc
We have built data providers as a first-class feature in ICU4X. We currently tutor
clients on how to build their data file and detail all the knobs at their disposal,
which is essential to ICU4X's mission.
https://github.com/unicode-org/icu4x/issues/3552#issuecomment-1600050638
/// ICU4C's TestGreekUpper
#[test]
fn test_greek_upper() {
let cm = CaseMapping::new_with_locale(&locale!("el"));
// https://unicode-org.atlassian.net/browse/ICU-5456
assert_eq!(cm.to_full_uppercase_string("άδικος, κείμενο, ίριδα"), "ΑΔΙΚΟΣ, ΚΕΙΜΕΝΟ, ΙΡΙΔΑ");
// https://bugzilla.mozilla.org/show_bug.cgi?id=307039
// https://bug307039.bmoattachments.org/attachment.cgi?id=194893
assert_eq!(cm.to_full_uppercase_string("Πατάτα"), "ΠΑΤΑΤΑ");
assert_eq!(cm.to_full_uppercase_string("Αέρας, Μυστήριο, Ωραίο"), "ΑΕΡΑΣ, ΜΥΣΤΗΡΙΟ, ΩΡΑΙΟ");
assert_eq!(cm.to_full_uppercase_string("Μαΐου, Πόρος, Ρύθμιση"), "ΜΑΪΟΥ, ΠΟΡΟΣ, ΡΥΘΜΙΣΗ");
assert_eq!(cm.to_full_uppercase_string("ΰ, Τηρώ, Μάιος"), "Ϋ, ΤΗΡΩ, ΜΑΪΟΣ");
assert_eq!(cm.to_full_uppercase_string("άυλος"), "ΑΫΛΟΣ");
assert_eq!(cm.to_full_uppercase_string("ΑΫΛΟΣ"), "ΑΫΛΟΣ");
assert_eq!(cm.to_full_uppercase_string("Άκλιτα ρήματα ή άκλιτες μετοχές"), "ΑΚΛΙΤΑ ΡΗΜΑΤΑ Ή ΑΚΛΙΤΕΣ ΜΕΤΟΧΕΣ");
// http://www.unicode.org/udhr/d/udhr_ell_monotonic.html
assert_eq!(cm.to_full_uppercase_string("Επειδή η αναγνώριση της αξιοπρέπειας"), "ΕΠΕΙΔΗ Η ΑΝΑΓΝΩΡΙΣΗ ΤΗΣ ΑΞΙΟΠΡΕΠΕΙΑΣ");
assert_eq!(cm.to_full_uppercase_string("νομικού ή διεθνούς"), "ΝΟΜΙΚΟΥ Ή ΔΙΕΘΝΟΥΣ");
// http://unicode.org/udhr/d/udhr_ell_polytonic.html
assert_eq!(cm.to_full_uppercase_string("Ἐπειδὴ ἡ ἀναγνώριση"), "ΕΠΕΙΔΗ Η ΑΝΑΓΝΩΡΙΣΗ");
assert_eq!(cm.to_full_uppercase_string("νομικοῦ ἢ διεθνοῦς"), "ΝΟΜΙΚΟΥ Ή ΔΙΕΘΝΟΥΣ");
// From Google bug report
assert_eq!(cm.to_full_uppercase_string("Νέο, Δημιουργία"), "ΝΕΟ, ΔΗΜΙΟΥΡΓΙΑ");
// http://crbug.com/234797
assert_eq!(cm.to_full_uppercase_string("Ελάτε να φάτε τα καλύτερα παϊδάκια!"), "ΕΛΑΤΕ ΝΑ ΦΑΤΕ ΤΑ ΚΑΛΥΤΕΡΑ ΠΑΪΔΑΚΙΑ!");
assert_eq!(cm.to_full_uppercase_string("Μαΐου, τρόλεϊ"), "ΜΑΪΟΥ, ΤΡΟΛΕΪ");
assert_eq!(cm.to_full_uppercase_string("Το ένα ή το άλλο."), "ΤΟ ΕΝΑ Ή ΤΟ ΑΛΛΟ.");
// http://multilingualtypesetting.co.uk/blog/greek-typesetting-tips/
assert_eq!(cm.to_full_uppercase_string("ρωμέικα"), "ΡΩΜΕΪΚΑ");
assert_eq!(cm.to_full_uppercase_string("ή."), "Ή.");
}
https://github.com/unicode-org/icu4x/discussions/3688#discussioncomment-6456010
Recommended data provider type for libraries depending on ICU4X
---
I finished creating a library that uses ICU4X as its backend, while learning Rust.
For my library I used the DataProvider for as the interface to CLDR data
(currently just using icu_testdata, though seen the page to generate customised
datasets).
So now I am wondering what would be the recommended data provider to use for a
library using ICU4X as its backend?
---
If you know the data you want at build time, I suggest using a baked data provider,
otherwise use a Blob one with postcard.
You can generate data using these steps
https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/data_management.md
In the 1.3 release there will be a compiled_data feature that lets you include
data by default, kinda like testdata but intended for production.
---
compiled_data feature may just be what my library could use without the need for
users to supply data provider for my library, if I understand the intended
purpose of this up coming feature. Where is this feature located in the master,
so I may start looking at it for design purposes, while waiting for 1.3 release?
---
jlf: this answer is seriously incomprehensible!
The feature is present on all of the component crates and it exposes functions
like DateTimeFormatter::try_new() that don't have a provider argument.
https://unicode-org.github.io/icu4x/docs/icu/datetime/struct.DateTimeFormatter.html#method.try_new
The crate also does contain an unstable baked provider that users can pass in
themselves, but note that it only implements data stuff from that particular
crate and they'll need to combine it with providers from other crates if the
type they are using uses data from everywhere (like DateTimeFormat: it uses
plurals and decimal data too)
https://unicode-org.github.io/icu4x/docs/icu/datetime/provider/struct.Baked.html
---
This is a good question; what should intermediate libraries expose to their
users? I'll schedule this for a discussion at an upcoming developers call.
https://github.com/unicode-org/icu4x/issues/3709
Chinese and Dangi inconsistent with ICU implementations for extreme dates
The current implementation of the Chinese calendar, as well as the Dangi calendar
in #3694, are not consistent with ICU for all dates; based on writing a number of
manual test cases (see the aforementioned PR), this seems to only be an issue for
dates very far in the past or far in the future (ex. year -3000 ISO).
Furthermore, the ICU4X Chinese/Dangi and astronomy functions are newly-written
and have several algorithms based on the most recent edition of Calendrical
Calculations, while the existing ICU code seems to be from 2000, incorporating
algorithms from the 1997 edition of Calendrical Calculations.
---
jlf: I take note of this because it's interesting to see the differences with ICU.
Calendars
in https://github.com/unicode-org/icu4x/pull/3744#discussion_r1277062568
they reference this common lisp code
https://github.com/EdReingold/calendar-code2/blob/main/calendar.l#L2352
---
jlf: I take note of this to remember
;;;; The Functions (code, comments, and definitions) contained in this
;;;; file (the "Program") were written by Edward M. Reingold and Nachum
;;;; Dershowitz (the "Authors")
;;;; These Functions are explained in the Authors'
;;;; book, "Calendrical Calculations", 4th ed. (Cambridge University
;;;; Press, 2016)
---
https://en.wikipedia.org/wiki/Calendrical_Calculations
https://reingold.co/calendars.shtml
The resource page for the book makes all the source code for the book available for download.
https://www.cambridge.org/ch/universitypress/subjects/computer-science/computing-general-interest/calendrical-calculations-ultimate-edition-4th-edition?format=PB&isbn=9781107683167#resources
The code has been ported to Python
https://github.com/espinielli/pycalcal
https://github.com/uni-algo/uni-algo/issues/31
L with stroke letter (U+0141, U+0142) doesn't normalize.
auto const polish = std::string{"ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ"};
auto norm = una::norm::to_unaccent_utf8(polish);
Everything is normalized except 'ł' and 'Ł'.
Everything is normalized except 'ł' and 'Ł'.
---
Strokes are not accents. As far as I know there is no data table in Unicode that
maps L with stroke to L so no plans to implemented it, you need to do it
manually if needed.
--
jlf: idem with utf8proc
"ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ"~text~nfc(stripmark:)= -- T'acełnoszz ACEŁNOSZZ'
"ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ"~text~nfd(stripmark:)= -- T'acełnoszz ACEŁNOSZZ'
---
https://en.wikipedia.org/wiki/%C5%81
Character Ł ł
Unicode 321 0141 322 0142
CP 852 157 9D 136 88
CP 775 173 AD 136 88
Mazovia 156 9C 146 92
Windows-1250, ISO-8859-2 163 A3 179 B3
Windows-1257, ISO-8859-13 217 D9 249 F9
Mac Central European 252 FC 184 B8
https://github.com/unicode-org/icu4x/issues/2715
Minor and patch release policy
https://docs.google.com/document/d/1yg_2l5FFo0aAuNi4jpgcIhIYjHqJyUoJWtMduyQ0vR8/edit
ICU4X Data Versioning Design
This document has been migrated to Markdown in
https://github.com/unicode-org/icu4x/pull/2919
jlf: I don't see any markdown...
https://github.com/unicode-org/icu4x/issues/1471
Decide on data file versioning policy
jlf: For the comment of Marcus Scherer
https://github.com/unicode-org/icu4x/issues/165
Data Version
jlf: maybe to read
As far as semantic versioning, I no longer give deference to it as the preferred
way to do versioning or see the topic so singularly after seeing this talk.
https://www.youtube.com/watch?v=oyLBGkS5ICk
jlf: Spec-ulation Keynote - Rich Hickey
The comments say it's good, did not watch.
DateTime
https://github.com/unicode-org/icu4x/issues/3347
DateTimeFormatter still lacks power user APIs
jlf: this ticket contains potentially interesting links:
Class hiearchy: https://github.com/unicode-org/icu4x/issues/380
Design doc: https://docs.google.com/document/d/1vJKR1s--RBmXLNIJSCtiTNPp08mab7ZwcTGxIZ9-ytI/edit#
https://github.com/unicode-org/icu4x/pull/4334#discussion_r1403198515
Add is_normalized_up_to to Normalizer
#4334
jlf remember:
the Web-exposed ICU4C-backed behavior of current String.prototype.normalize in
both SpiderMonkey and V8 retains unpaired surrogates in the normalization process
(even after the first point in the string that needs to change under normalization).
We've previously decided that ICU4X operates on the Unicode Scalar Value / Rust
char value space and, therefore, will perform replacement of unpaired surrogates
with the REPLACEMENT CHARACTER.
https://github.com/unicode-org/icu4x/issues/4365
Segmenter does not work correctly in some languages
"as `নমস্কাৰ, আপোনাৰ কি খবৰ?`"'0D'x"hi `हैलो, क्या हाल हैं?`"'0D'x"mai `नमस्ते अहाँ केना छथि?`"'0D'x"mr `नमस्कार, कसे आहात?`"'0D'x"ne `नमस्ते, कस्तो हुनुहुन्छ?`"'0D'x"or `ନମସ୍କାର ତୁମେ କେମିତି ଅଛ?`"'0D'x"sa `हे त्वं किदं असि?`"'0D'x"te `హాయ్, ఎలా ఉన్నారు?`"
icu4c: 151
rust: 161
executor: 151
---
ICU4X and ICU4C are just using different definitions of EGCs; ICU4C has had a
tailoring for years which has just been incorporated into Unicode 15.1, whereas
ICU4X implements the 15.0 version without that tailoring.
The difference is the handling of aksaras in some indic scripts:
in Unicode 15.1 (and in any recent ICU4C) क्या is one EGC, but it is two EGCs
(क्, या) in untailored Unicode 15.0 (and in ICU4X).
---
eggrobin
(For what it’s worth, क्या would be three legacy grapheme clusters, namely क्, य,
and ा, see Table 1a of UAX29, whereas it is two 15.0 extended grapheme clusters
and a single 15.1 extended grapheme cluster.)
---
Fixed by #4536
https://github.com/unicode-org/icu4x/pull/4334
is_normalized_up_to and unpaired surrogates
---
jlf: interesting discussion about the support of ill-formed strings
https://github.com/unicode-org/icu4x/pull/4389
Line breaking
---
jlf: they don't want to support a tailored line breaking, because this requires
more than one code point of lookahead.
https://github.com/unicode-org/icu4x/issues/4342
Add functions to get ICU4X, CLDR, and Unicode versions
---
jlf: strange that they did not consider that earlier...
https://github.com/unicode-org/icu4x/issues/2689
Consider exposing sort keys
---
jlf : interesting for the description of the use cases (encryption, xpath)
I created a section Xpath with their comments.
https://github.com/unicode-org/icu4x/issues/3336
Add support for Unicode BCP 47 locale identifiers
---
jlf: what is that?
it's defined in https://www.unicode.org/reports/tr35/
UNICODE LOCALE DATA MARKUP LANGUAGE (LDML)
Unicode LDML uses stable identifiers based on [BCP47] for distinguishing among
languages, locales, regions, currencies, time zones, transforms, and so on.
https://www.rfc-editor.org/rfc/bcp/bcp47.txt
https://github.com/unicode-org/icu4x/issues/3247#issuecomment-1856577508
This month @anba landed Intl.Segmenter in Firefox based on the ICU4X Segmenter impl, reviewed by @dminor
https://phabricator.services.mozilla.com/D195803
I had been under the impression that Intl.Segmenter was not implementable without
support for random access in order to implement the containing() function.
It looks like @anba's implementation loops from the start of the string and
repeatedly calls next() until we reach the index. While this strategy gets the
job done, I'm concerned about the performance of this with large strings where
we need to reach an index deep into the string. I therefore hope that we can
continue to prioritize this issue on the basis of 402 compatibility.
---
jlf: to watch
https://github.com/unicode-org/icu4x/issues/4523
Linebreak generated before CL (Close Punctuation)
---
https://www.unicode.org/reports/tr14/#CL
UNICODE LINE BREAKING ALGORITHM
https://github.com/typst/typst/issues/3082
Chinese punctuation is placed at the beginning of the line in some cases
---
jlf: Linebreak referenced from icu4x/issues/4523
The example is wrong, a better example is provided in icu4x/issues/4523.
https://github.com/unicode-org/icu4x/pull/4389
Fix Unicode 15.0 line breaking
jlf: Linebreak
https://github.com/unicode-org/icu4x/issues/4146
icu_segmenter::LineSegmenter incorrectly applies rule LB8a
---
jlf: Linebreak, for the examples of line breaks.
https://github.com/unicode-org/icu4x/discussions/4525#discussioncomment-8155602
Mapping between browser Intl and ICU4X
jlf: I don't understand what they are talking about, but there are maybe good
to know informations in this thread. In particular this URL:
"Sensitivity" in browsers maps to a combination of strength and case level.
https://searchfox.org/mozilla-central/rev/1aa61dcd48e128a8cbfbe59b7ba43d31bd3c248a/intl/components/src/Collator.cpp#171-185
https://github.com/unicode-org/icu4x/issues/3284#issuecomment-1911226051
Should the Segmenter types accept a locale?
---
Steven Loomis:
Please put it into the API.
I was doing planning on a work item to move this forward.
This is for example languages that want to keep "ch" together etc.
---
jlf: so it appears from the discussion that ICU4C implements specific rules that
are not part of UAX #29.
---
Markus Scherer:
No language parameter for grapheme cluster segmenter
+1
Language parameter for the other three segmenters
+1
---
sffc
The conclusions from the discussion of this issue with the CLDR design group:
- Grapheme clusters should not be language-specific; baked into much low-level
processing (e.g., Swift, font mappings) which we don’t want to be language-specific
- Content locale/text language parameter (not UI locale): Potential for accuracy;
make it optional, name it well
- Ok to leave the locale on the constructor; benefit: more specific data loading
even for existing dictionaries & models
My suggested path forward for this issue, then, is to add an options bag to the
WordSegmenter, LineSegmenter, and SentenceSegmenter constructors with an optional
content_locale field of type &LanguageIdentifier.
---
Steven Loomis
This makes no sense and contradicts the long standing requests.
I would have joined, did not realize this was coming up today.
---
sffc
Based on additional discussion in the email thread, I would like to move forward
with the recommendation in #3284 (comment), with the additional understanding
that we may add support for locale-based grapheme segmentation in the future if
CLDR adds data for this, but it might take the form of another (fifth) segmenter type.
Concretely:
- All segmenters retain a new or try_new function without an options bag
- Word, Sentence, and Line segmenters get a try_new_with_options function that
includes a content_locale option
https://github.com/unicode-org/icu4x/issues/58
Design a cohesive solution for supported locales
https://github.com/tc39/proposal-intl-segmenter/issues/133
Custom Dictionaries
and a political point of view from a Hong Kong immigrant.
https://github.com/unicode-org/icu4x/issues/3990
Consider supporting retrieval of the language preference list from the system
---
jlf: some infos and pointers, for general culture.
https://github.com/unicode-org/icu4x/issues/4705
Bridge the gap between icu::properties::Script and icu::locid::subtags::Script
---
jlf: this is about script names
---
Markus Scherer
Conversion is probably fine, but in the end they are just script codes, so
it also makes sense to define the full set once and have Unicode APIs use a
subset of the values.
The ones in the UCD are a subset of the full set.
And only the ones in the UCD have Unicode-defined long value names (identifiers).
Eggrobin
https://unicode.org/iso15924/codelists.html
https://unicode.org/iso15924/iso15924.txt
The PVA column is from https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt.
Markus Scherer
Also https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
look for Type: script
which becomes this in CLDR:
https://github.com/unicode-org/cldr/blob/main/common/validity/script.xml
Note that the CLDR list includes one or more private use script subtags:
https://www.unicode.org/reports/tr35/#unicode_script_subtag_validity
https://www.unicode.org/reports/tr35/#Private_Use_Codes
Qaag is current but yucky... Don't include Qaai which has become an alias for Zinh
https://github.com/unicode-org/icu4x/issues/3014
Provide the Numeric_Value character property
ICU4X is missing an API for querying the Numeric_Value property of a character.
Markus Scherer
Note that Numeric_Value is easy when Numeric_Type=Decimal or Numeric_Type=Digit.
And maybe you need/want it only if Numeric_Type=Decimal.
When Numeric_Type=Numeric, then the Numeric_Value can be negative, huge, or a fraction.
These are rarely useful. https://www.unicode.org/reports/tr44/#Numeric_Value
I would start with an API that returns the value of a decimal digit.
Markus Scherer
Most of the nt=digit characters are not part of a contiguous 0..9 range of characters.
In particular, there is often no zero.
Some of them are simply nt=digit because their nv is 0..9 although they are
part of a larger set of "numbered list bullets" where the nv>9 numbers have nt=numeric.
In UTS46, they are variously disallowed/mapped/valid.
See https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ant%3Ddigit%3A%5D&g=uts46&i=
It makes sense to me to have an API that returns the nv of nt=decimal but
the nv of other characters is rarely useful to programmers.
https://github.com/unicode-org/icu4x/issues/4771
LineBreakStrictness::Anywhere gives the wrong breakpoints for Arabic in icu_segmenter
I am aware this is probably a unicode spec issue, rather than a rust library issue, but I thought I would point it out regardless.
This is the minimal application I was using to test this behavior:
use icu_segmenter::{LineBreakOptions, LineBreakStrictness, LineSegmenter};
fn main() {
let test = "الخيل والليل";
let mut options = LineBreakOptions::default();
options.strictness = LineBreakStrictness::Anywhere;
let segmenter = LineSegmenter::new_auto_with_options(options);
let breakpoints = segmenter.segment_str(test);
for bp in breakpoints {
println!("{bp}: {}", &test[bp..]);
}
}
This gives the following output:
(jlf: bbedit doesn't support well this text, can't indent the whole block, can't
indent a single line)
0: الخيل والليل
2: لخيل والليل
4: خيل والليل
6: يل والليل
8: ل والليل
10: والليل
11: والليل
13: الليل
15: لليل
17: ليل
19: يل
21: ل
23:
as you can tell, it is breaking after every single letter, without respect to the letters' connections. However, as I am sure you are aware, the letters' connections are not optional.
The output I expected is the following:
0: الخيل والليل
2: لخيل والليل
10: والليل
11: والليل
13: الليل
15: لليل
23:
Putting the break points across the visual boundaries of the letters. This is not the current orthodoxy, but any looser breaks than that and you'd be rendering the text illegible and unnatural.
Note: This is how old written manuscripts break their words.
---
Closed as not planned
https://github.com/unicode-org/icu4x/issues/4780
Unexpected grapheme boundary with regional indicators (GB12)
use icu::segmenter::GraphemeClusterSegmenter;
fn main() {
let segmenter = GraphemeClusterSegmenter::new();
let text = "🇺🇸🏴";
segmenter
.segment_str(text)
.for_each(|i| println!("{}", i));
}
Reports the following break points:
0
4
8
36
which means "🇺🇸" is split into two graphemes, which should be disallowed per GB12
---
This is fixed by #4536.
---
jlf: utf8proc is ok
"🇺🇸"~graphemes==
a CharacterSupplier
1 : T'🇺🇸'
"🇺🇸"~unicodecharacters==
an Array (shape [2], 2 items)
1 : ( "🇺" U+1F1FA So 1 "REGIONAL INDICATOR SYMBOL LETTER U" )
2 : ( "🇸" U+1F1F8 So 1 "REGIONAL INDICATOR SYMBOL LETTER S" )
#4536
https://github.com/unicode-org/icu4x/pull/4536
Update grapheme cluster break rules to Unicode 15.1
jlf: lot of discussions about stability that I did not try to understand.
#4859
https://github.com/unicode-org/icu4x/issues/4859
Make the normalizer work with new Unicode 16 normalization behaviors
---
jlf: they have this reference
See topic 5.1 in https://www.unicode.org/L2/L2024/24009r-utc178-properties-recs.pdf
utf8proc title
https://codeberg.org/dnkl/foot/pulls/100
Grapheme shaping using libutf8proc #100
jlf tag: character width
jlf: to read?
https://github.com/twitter/twitter-text
Twitter Text Libraries. This code is used at Twitter to tokenize and parse text
to meet the expectations for what can be used on the platform.
https://swiftpack.co/package/nysander/twitter-text
This is the Swift implementation of the twitter-text parsing library.
The library has methods to parse Tweets and calculate length, validity, parse @mentions, #hashtags, URLs, and more.
terminal / console / cmd
https://www.reddit.com/r/bash/comments/wfbf3w/determine_if_the_termconsole_supports_utf8/
Determine if the term/console supports UTF8?
https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line
jlf: with my current version of Windows (21H2 - 10.0.19044), I have the input bug describe below:
In general using codepage 65001 will only work without bugs in Windows 10 with the Creators update.
In Windows 7 it will have both output and input bugs.
In Windows 8 and older versions of Windows 10 it only has the input bug, which limits input to 7-bit ASCII.
Eryk Sun Sep 9, 2017 at 13:43
jlf: the sentence above is not true, I have the input bug with my version of Windows which is AFTER Creators update.
http://archives.miloush.net/michkap/archive/2006/03/13/550191.html
Who broke the UTF-8 support?
by Michael S. Kaplan, published on 2006/03/13 03:21 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/13/550191.aspx
---
jlf : we are in 2022 and the UTF-8 support in cmd is still broken...
https://stackoverflow.com/questions/39736901/chcp-65001-codepage-results-in-program-termination-without-any-error
jlf : Thanks to this post, I suddenly understood why ooRexxShell no longer supports UTF-8 input.
It's because I deactivated readline on Dec 20, 2020.
When readline is on, ooRexxShell delegates to cmd to read a line:
set /p inputrx="My prompt> "
This input mode is not impacted by the UTF-8 input bug!
https://stackoverflow.com/questions/10651975/unicode-utf-8-with-git-bash
git-bash (Windows)
https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window
Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10)
Describes how to set the system locale (language for non-Unicode programs) to UTF-8.
Optional reading: Why the Windows PowerShell ISE is a poor choice
---
jlf: this is a clear description of the UTF-8 input bug.
For ReadFile from the console, even in Windows 10, you'll be limited to 7-bit ASCII if the
input codepage is set to UTF-8, due to buggy assumptions in the console host, conhost.exe.
In Windows 10, it returns non-ASCII characters as null ("\0") in the buffer.
In older versions, the read succeeds with 0 bytes read, which looks like EOF.
Eryk Sun Jul 21, 2019 at 13:31
https://stackoverflow.com/questions/49476326/displaying-unicode-in-powershell/49481797#49481797
Displaying Unicode in Powershell
https://akr.am/blog/posts/using-utf-8-in-the-windows-terminal
Using UTF-8 in the Windows Terminal
https://github.com/microsoft/terminal
https://github.com/Microsoft/Cascadia-Code
https://github.com/PowerShell/PowerShell/issues/7233
Make console windows fully UTF-8 by default on Windows, in line with the behavior on Unix-like platforms
mklement0 opened this issue on Jul 5, 2018
---
jlf still opened on 2023.08.08
https://github.com/contour-terminal/terminal-unicode-core
Unicode Core specification for Terminal (grapheme clusters, character widths, ...)
jlf: only a poor tex file... dead? no commit since 2 years.
https://news.ycombinator.com/item?id=37804829
ZERO comment in HN
QT Title
https://bugreports.qt.io/browse/QTBUG-48726
Combining diacritics misplaced when using monospace fonts
jlf tag: character width
IBM OS
https://www.ibm.com/docs/en/personal-communications/15.0?topic=pages-contents#ToC
Host Code Page Reference
Referenced in a IBM-MAIN thread about "TN3270, EBCDIC and ASCII"
>>There are a few layers to getting the codepages right for using a terminal
>>emulator and ISPF Edit and Browse on the host.
>>For example, in Personal Communications I first define my host codepage. I
>>have a lot of choices. From 420 (Arabic) to 1130 (Vietnamese). I tend to
>>use 1047 (U.S.) to get my square brackets right.
jlf: tables of character codes
https://www.ibm.com/docs/en/zos/3.1.0?topic=317-zos-unix-directory-list-utility-line-commands
z/OS UNIX directory list utility line commands
Referenced in a IBM-MAIN thread about "TN3270, EBCDIC and ASCII"
>>Then on the host side. If you are using the ISPF UDLIST interface to Unix
>>(OMVS) you can use either EBCDIC, ASCII, or UTF8 for EDIT or VIEW.
Actions:
E—edit regular file
EA—edit ASCII file
EU—edit UTF-8 file
V—view regular file
VA—view ASCII file
VU—view UTF8 file
https://www.ibm.com/docs/en/zos/3.1.0?topic=information-pdf-browse-primary-commands
PDF Browse primary commands
Referenced in a IBM-MAIN thread about "TN3270, EBCDIC and ASCII"
>>In ISPF Browse, you can use the DISPLAY command to view data as UTF8,
>>UTF32, UCS2, UNICODE, ASCII, USASCII, and EBCDIC, or specify the numeric
>>CCSID.
Syntax diagram DISPLAY CCSIDccsid_number
ASCII
USASCII
EBCDIC
UCS2
UTF8
UTF16
UTF32
Syntax diagram FIND
UTF8
ASCII
USASCII
https://www.ibm.com/docs/en/zos/3.1.0?topic=sequences-ebcdic
Table 1. EBCDIC Collating Sequence
Table 1 shows the collating sequence for EBCDIC character and unsigned decimal data.
The collating sequence ranges from low (00000000) to high (11111111).
The bit configurations which do not correspond to symbols (that is, 0 through 73,
81 through 89, and so forth) are not shown. Some of these correspond to control
commands for the printer and other devices.
ALTSEQ, CHALT, and LOCALE can be used to select alternate collating sequences
for character data.
Packed decimal, zoned decimal, fixed-point, and normalized floating-point data
are collated algebraically, that is, each quantity is interpreted as having a sign.
IBM RPG Lang
https://www.ibm.com/docs/en/i/7.4?topic=cdt-processing-string-data-by-natural-size-each-character
Processing string data by the natural size of each character
String data can have characters of different sizes.
- UTF-8 data can have characters with 1, 2, 3, or 4 bytes.
For example, the character 'a' has one byte, and the character 'á' has two bytes.
UTF-8 data is defined as alphanumeric with CCSID(*UTF8) or CCSID(1208).
- UTF-16 data can have characters with 2 or 4 bytes.
UTF-16 data is defined as UCS-2 with CCSID(*UTF16) or CCSID(1200).
- EBCDIC mixed SBCS/DBCS data can have characters with 1 or 2 bytes.
Additionally, double-byte data is surrouned by shift bytes.
The shift-out byte x'0E' begins a section of DBCS data and the shift-in
byte x'0F' ends the section of DBCS data.
- ASCII mixed SBCS/DBCS data can have characters with 1 or 2 bytes.
ASCII mixed SBCS/DBCS data is defined as alphanumeric with a CCSID that
represents mixed SBCS/DBCS data such as 950.
Default behaviour, CHARCOUNT STDCHARSIZE
By default, data is processed using the standard-character-size mode.
The compiler processes string data by bytes or double bytes without regard
for size of each character.
When CHARCOUNT NATURAL is in effect:
The compiler processes string operations by the natural size of each character.
The compiler sets the CHARCOUNT NATURAL mode for a file if the CHARCOUNT is
not specified for the file.
The CHARCOUNT mode for the file affects the movement of data from RPG fields
to the output buffer and key buffer used for the file operations.
https://www.ibm.com/docs/en/i/7.4?topic=fdk-charcountnatural-stdcharsize
CHARCOUNT(*NATURAL | *STDCHARSIZE)
The CHARCOUNT keyword controls how RPG handles string truncation when moving
data from RPG program variables to the output buffer and key buffer for the file.
*NATURAL
If the data type of the field in the output buffer or key buffer is relevant
according to the CHARCOUNTTYPES Control keyword, any necessary truncation
when data is moved is done according to the CHARCOUNT NATURAL mode for assignment.
*STDCHARSIZE
Any necessary truncation when data is moved is done by bytes or double bytes,
without regard for the size of each character.
When the CHARCOUNT keyword is not specified, the current CHARCOUNT setting
is used for the file, as determined by the CHARCOUNT Control keyword or the
most recent /CHARCOUNT directive preceding the definition for the file.
https://www.ibm.com/docs/en/i/7.4?topic=keywords-charcounttypesutf8-utf16-jobrun-mixedebcdic-mixedascii
CHARCOUNTTYPES(*UTF8 *UTF16 *JOBRUN *MIXEDEBCDIC *MIXEDASCII)
The Control keyword CHARCOUNTTYPES specifies the types of data that are
processed by characters rather than by bytes or double bytes when
CHARCOUNT NATURAL mode is in effect.
*UTF8
Specify *UTF8 if your module might work with UTF-8 data which has characters
of different lengths. For example, the UTF-8 character 'a' has one byte, and
the UTF-8 character 'á' has two bytes.
*UTF16
Specify *UTF16 if your module might work with UTF-16 data which has some
4-byte characters.
*JOBRUN
Specify *JOBRUN if your job CCSID might support mixed SBCS and DBCS data,
and the RPG variables in your module defined to have the job CCSID might
contain some DBCS data.
*MIXEDEBCDIC
Specify *MIXEDEBCDIC if your module might work with EBCDIC data which
supports both SBCS and DBCS characters. This includes data defined with
CCSID(*JOBRUNMIX) and data defined with a mixed SBCS/DBCS CCSID such as 937.
*MIXEDASCII
Specify *ASCII if your module might work with ASCII data which supports
both SBCS and DBCS characters.
IBM z/OS
https://www.ibm.com/docs/en/zos/2.5.0?topic=mvs-zos-unicode-services-users-guide-reference
Unicode services
https://www.ibm.com/docs/en/zos/2.5.0?topic=reference-application-programmer-information
Character conversion
Case conversion
Normalization
Collation
Bidi transformation
Stringprep conversion
---
jlf:
There is this note at the begining of the page "Bidi transformation":
"IBM does not intend to enhance the bidi transformation service. Instead, it is
recommended that you use the character conversion 'extended bidi support' for all
new development and for the highest level of bidi support."
Can't find where is described this 'extended bidi support'.
https://www-40.ibm.com/servers/resourcelink/svc00100.nsf/pages/zOSV2R5IndexFile/$file/index.html
search Ctrl+F "unicode": only one result:
cunu100_v2r5.pdf SA38-0680-50 z/OS Unicode Services User's Guide and Reference
https://www.ibm.com/docs/en/zos/2.5.0
Search "Unicode" in z/OS 2.5 documentation:
https://www.ibm.com/docs/en/search/unicode?scope=SSLTBW_2.5.0
jlf: not sure it's very interesting... All the links are just one page with few informations.
https://listserv.ua.edu/cgi-bin/wa?A2=IBM-MAIN;5304fbc3.2304&S=
Re: TSO Rexx C2X Incorrect Output
Events such as this affirm my belief in minimal munging of user data by default.
jlf: this sentence is to remember when designing how Unicode should be supported
by Rexx...
https://stackoverflow.com/questions/76569347/what-are-the-supported-code-points-for-special-characters-for-valid-z-os-datas
What are the supported code points for 'special characters' for valid z/OS datasets?
jlf: the link above was given in this IBM-MAIN thread
https://listserv.ua.edu/cgi-bin/wa?A2=ind2307&L=IBM-MAIN&D=0&P=121856
---
Matt Hogstrom:
I did some testing by creating a file in USS in CP047 with the characters “@#$”
and then used iconv to convert them to a variety of code pages and compare the
results. Some conversions failed but when looking at the code pages that failed
they didn’t appear to me to be what I would consider mainstream. For the ones
I’m familiar with they all converted correctly.
The command was
'iconv -f 1047 -t 37 special > converted;chtag -t -c 37 converted;cmp special converted’
I changed to the encoding of 37 to other code pages and most worked fine.
You can find the list of cps supported by issuing 'iconv -l’ and there are a lot
of them.
https://listserv.ua.edu/cgi-bin/wa?A2=ind2307&L=IBM-MAIN&D=0&P=183611
Python 3.11 on z/OS - UTF-8 errors
---
I am trying to get a python package (psutil) to run on z/OS.
I downloaded the package from github and then tar'ed it and uploaded it binary
to my home-dir in OMVS.
In my homedir I untar'ed to files and ran the command "chtag -tc IBM-1047 *' to
set the files to UTF-8.
I got make to work by converting the tab char to x'05' - no problem - and I got
the C compiler to work also.
Now my problem is that I can not make Python compile the setup.py file.
It dies with a UTF-error on a char x'97' in statement 48 pos 2:
from _common import AIX # NOQA
---
It's this package
https://github.com/giampaolo/psutil/blob/master/INSTALL.rst
---
I believe UTF-8 is IBM-1208.
---
Have you tried the z/OS Open Tools phytonport - https://github.com/ZOSOpenTools
---
Have you considered cloning the repository and utilizing Git's file
tagging feature? It can handle the tagging process for you. If you don't
have internet access, a suggestion would be to tag all the files as
ISO8859-1. It's advisable to avoid using UTF-8, as it may cause issues
with some ported tools that will not work. That includes the majority of
Rocket ported tools. If you list the IBM Python runtime library you will
notice that all source files are tagged "iso8859-1" even though Python
mandates UTF-8.
---
I'm doing this on the company sandbox so I can not make a git clone.
And trying 8859-1 (cp 819) does not change anything:
/home/bc6608/psutil:chtag -p setup.py
t ISO8859-1 T=on setup.py
PYTHONWARNINGS=all python3 setup.py build_ext -i `python3 -c "import sys, os; py36 = sys.version_info[:2] >= (3, 6); cpus = os.cpu_count() or 1 if py36 else 1; print('--parallel %s' % cpus if cpus > 1 else '')"`
Traceback (most recent call last):
File "/home/bc6608/psutil/setup.py", line 47, in <module>
from _common import AIX # NOQA
^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 2: invalid start byte
---
Found the error.
The error was not the codepage of the setup.py, but the codepage of the imported file _common .
Once it got chtag -tc 1047 _common.py I got further.
---
I can’t recreate your problem but I used a different method. I downloaded a zip file from Github, uploaded it to z/OS and followed these steps:
jar xf psutill-master.zip
cd psutil-master
chtag -R -tc iso8859-1 .
python3 setup.py
---
A quick question -
Will the same chtag command work for, say, Java packages/projects?
Answer: yes
Or, would I have to use chtag -R -tc UTF-8 if a project expects to things to be in UTF8?
Answer:
I'd like to understand your reasons for wanting to encode your Java source
files in UTF-8. It's important to note that the default encoding on z/OS is
IBM-1047 (EBCDIC). We typically use ISO8859-1 and have to specify the
"-encoding iso8859-1" option when using the javac compiler. As mentioned
earlier, tagging files as UTF-8 can lead to unexpected issues, which is why
it's not commonly done. If you examine the file attributes of modern
languages like Python, Node.js, Go, etc., you'll notice that their source
files are tagged as ISO8859-1.
A while ago, one of our ported tools developers provided me with a detailed
explanation regarding the challenges associated with UTF-8 for ported tools.
Although I don't recall all the specifics, it had something to do with
double conversions. Therefore, the general rule of thumb is to avoid using
UTF-8 unless it is necessary, such as when embedding a YAML document into a
Java JAR file.
---
We specify <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> in
our Maven builds as most of the time we are building off host on machines with
UTF8 locales. However, we tag our files ISO8859-1 on z/OS other then some YAML
docs that must be tagged UTF-8 or else SnakeYaml barfs when reading it from the
class path which doesn’t support tags :). The server runs with file.encoding=ISO8859-1
as well. If we cared about the euro sign we could change it to ISO8859-15 which
is still an 8-bit character set. It’s those pesky codes above 0x7F in UTF-8 that
cause the issues.
https://www.ibm.com/support/pages/system/files/inline-files/Managing%20the%20code%20page%20conversion%20when%20migrating%20zOS%20source%20files%20to%20Git%20-%201.0.pdf
(PDF)
Managing the code page conversion when migrating z/OS source files to Git
---
Git has proven to be the de-facto standard in the Open Source world, and the
z/OS platform can interact with Git through the z/OS Git client, which is
maintained by Rocket Software in its “Open Source Languages and Tools for z/OS”
package.
https://www.ibm.com/docs/en/zos/2.2.0?topic=server-different-end-line-characters-in-text-files
Different end of line characters in text files
---
In general, z/OS UNIX text files contain a newline character at the end of each
line. In ASCII, newline is X'0A'. In EBCDIC, newline is X'15'. (For example,
ASCII code page ISO8859-1 and EBCDIC code page IBM-1047 translate back and forth
between these characters.) Windows programs normally use a carriage return
followed by a line feed character at the end of each line of a text file. In
ASCII, carriage return/line feed is X'0D'/X'0A'. In EBCDIC, carriage return/line
feed is X'0D'/X'15'. The tr command shown in the preceding example deletes all
of the carriage return characters. (Line feed and newline characters have the
same hexadecimal value.) The SMB server can translate end of line characters
from ASCII to EBCDIC and back but it does not change the type of delimiter (PC
versus z/OS UNIX) nor the number of characters in the file.
https://www.ibm.com/docs/en/zos/2.5.0?topic=options-record-format-recfm
Record Format (RECFM)
RECFM specifies the characteristics of the records in the data set as fixed-length (F),
variable-length (V), ASCII variable-length (D), or undefined-length (U). Blocked
records are specified as FB, VB, or DB. Spanned records are specified as VS, VBS,
DS, or DBS. You can also specify the records as fixed-length standard by using
FS or FBS. You can request track overflow for records other than standard format
by adding a T to the RECFM parameter (for example, by coding FBT). Track overflow
is ignored for PDSEs.
The type of print control can be specified to be in ANSI format-A, or in machine code
format-M. See
Using Optional Control Characters (https://www.ibm.com/docs/en/SSLTBW_2.5.0/com.ibm.zos.v2r5.idad400/occ.htm#occ)
and z/OS DFSMS Macro Instructions for Data Sets (https://www.ibm.com/docs/en/SSLTBW_2.5.0/com.ibm.zos.v2r5.idad500/abstract.htm)
for information about control characters.
https://docs.tibco.com/pub/mftps-zos/8.0.0/doc/html/GUID-A0CF702B-C126-43BE-86B2-8DF589FAD6BF.html
TIBCO® Managed File Transfer Platform Server for z/OS
RECFM={ F | FB | V | VB | U | VS | VBS}
Default=V
This parameter defines the significance of the character logical record length
(semantics of LRECL boundaries). You can specify fixed, variable, or system default
The valid values are as follows:
- F: each string contains exactly the number of characters defined by the string length parameter.
- FB: all blocks and all logical record are fixed in size.
One or more logical records reside in each block.
- V: the length of each string is less than or equal to the string length parameter.
- VB: blocks as well as logical record length can be of any size.
One or more logical records reside in each block.
- U: blocks are of variable size. No logical records are used.
The logical record length is displayed as zero.
This record format is usually only used in load libraries.
Block size must be used if you are specifying U.
- VS: records are variable and can span logical blocks.
RECFM=VS is not supported when checkpoint restart is used.
- VBS: blocks as well as logical record length can be of any size.
One or more logical records reside in each block.
Records are variable and can span logical blocks.
RECFM=VBS is not supported when checkpoint restart is used.
macOS OS
you can enter emoji (and other Unicode characters) using standard operating
system tools—like ctrl cmd space.
https://eclecticlight.co/2021/05/08/explainer-unicode-normalization-and-apfs/
Explainer: Unicode, normalization and APFS
hoakley May 8, 2021
---
One of the oldest problems with Apple’s APFS file system is how it encodes file
and directory names using Unicode.
Windows OS
https://learn.microsoft.com/en-us/windows/win32/intl/international-support
jlf: I search which functionalities are available only to unicode apps...
- can be multilingual without managing code pages
- IME? not sure if it's only for unicode apps
- other?
https://stackoverflow.com/questions/59404120/what-is-the-difference-in-using-cstringw-cstringa-and-ct2w-ct2a-to-convert-strin
What is the difference in using CStringW/CStringA and CT2W/CT2A to convert strings?
CString offers a number of conversion constructors to convert between ANSI and
Unicode encoding. They are as convenient as they are dangerous, often masking bugs.
By contrast, the Cs2d macros (where s = source, d = destination) work on raw
C-style strings; no CString instances are created in the process of converting
between character encodings.
Both of the above perform a conversion with an implied ANSI codepage (either
CP_THREAD_ACP or CP_ACP in case the _CONVERSION_DONT_USE_THREAD_LOCALE
preprocessor symbol is defined). CP_ACP is particularly troublesome, as it's a
process-global setting, that any thread can change at any time.
Which one should you choose for your conversions? Neither of the above. Use the
EX versions instead (see string and text classes for a full list).
https://learn.microsoft.com/en-us/cpp/atl/string-and-text-classes?view=msvc-170
String and Text Classes
https://stackoverflow.com/questions/15362859/getclipboarddata-cf-unicodetext
GetClipboardData (CF_UNICODETEXT)
https://jerrington.me/posts/2015-12-31-windows-debugging-for-fun-and-profit.html
jlf: I reference this page for the code related to clipboard. Search for "locale".
https://learn.microsoft.com/en-us/windows/win32/dataxchg/standard-clipboard-formats
Standard Clipboard Formats
CF_LOCALE
Locale identifier (LCID) associated with text in the clipboard.
The system uses the code page associated with CF_LOCALE to implicitly
convert from CF_TEXT to CF_UNICODETEXT.
CF_TEXT
Text format. Each line ends with a carriage return/linefeed (CR-LF) combination.
A null character signals the end of the data. Use this format for ANSI text.
CF_UNICODETEXT
Unicode text format. Each line ends with a carriage return/linefeed (CR-LF)
combination. A null character signals the end of the data.
Locale
https://learn.microsoft.com/en-us/windows/win32/intl/language-identifiers
A language identifier is a standard international numeric abbreviation
for the language in a country or geographical region. Each language has
a unique language identifier (data type LANGID), a 16-bit value that
consists of a primary language identifier and a sublanguage identifier.
+-------------------------+-------------------------+
| SubLanguage ID | Primary Language ID |
+-------------------------+-------------------------+
15 10 9 0 bit
https://learn.microsoft.com/en-us/windows/win32/intl/sort-order-identifiers
A sort order identifier is defined in the form "_sortorder", at the end
of the locale name used in the identifier, for example, "de-DE_phoneb",
where "phoneb" is the sort order.
The corresponding locale identifier is created as follows:
MAKELCID(MAKELANGID(LANG_GERMAN, SUBLANG_GERMAN), SORT_GERMAN_PHONE_BOOK).
https://learn.microsoft.com/en-us/windows/win32/intl/locale-identifiers
Each locale has a unique identifier, a 32-bit value that consists of a
language identifier and a sort order identifier.
+-------------+---------+-------------------------+
| Reserved | Sort ID | Language ID |
+-------------+---------+-------------------------+
31 20 19 16 15 0 bit
https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuilanguageinfo
https://learn.microsoft.com/en-us/previous-versions/windows/embedded/ms930130(v=msdn.10)?redirectedfrom=MSDN
Locale Code Table
jlf: obsolete, but for the moment I don't haver better.
Correspondance Locale identifier (LCID) <--> Default code page
---
LCID Code page Language: sublanguage
0x0436 1252 Afrikaans: South Africa
0x041c 1250 Albanian: Albania
0x1401 1256 Arabic: Algeria
0x3c01 1256 Arabic: Bahrain
etc...
https://devblogs.microsoft.com/oldnewthing/20161007-00/?p=94475
How can I get the default code page for a locale?
UINT GetAnsiCodePageForLocale(LCID lcid)
{
UINT acp;
int sizeInChars = sizeof(acp) / sizeof(TCHAR);
if (GetLocaleInfo(lcid,
LOCALE_IDEFAULTANSICODEPAGE |
LOCALE_RETURN_NUMBER,
reinterpret_cast<LPTSTR>(&acp),
sizeInChars) != sizeInChars) {
// Oops - something went wrong
}
return acp;
}
https://www.w3.org/TR/ltli/#dfn-locale-neutral
Locale neutral
jlf: je pige que dalle
Locale-neutral. A non-linguistic field is said to be locale-neutral when it is
stored or exchanged in a format that is not specifically appropriate for any
given language, locale, or culture and which can be interpreted unambiguously
for presentation in a locale aware way.
Many specifications use a serialization scheme, such as those provided by
[XMLSCHEMA11-2] or [JSON-LD], to provide a locale neutral encoding of
non-linguistic fields in document formats or protocols.
A locale-neutral representation might itself be linked to a specific cultural
preference, but such linkages should be minimized.
http://archives.miloush.net/michkap/archive/2005/04/18/409095.html
A few of the gotchas of WideCharToMultiByte
by Michael S. Kaplan, published on 2005/04/18 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/18/409095.aspx
http://archives.miloush.net/michkap/archive/2005/04/19/409566.html
A few of the gotchas of MultiByteToWideChar
by Michael S. Kaplan, published on 2005/04/19 04:30 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/19/409566.aspx
---
jlf: I reached this page because the flag MB_COMPOSITE is not working!
This page brings the answer: the Microsoft doc has this note
Note For UTF-8 or code page 54936 (GB18030, starting with Windows Vista),
dwFlags must be set to either 0 or MB_ERR_INVALID_CHARS. Otherwise, the function
fails with ERROR_INVALID_FLAGS.
Uh?
http://archives.miloush.net/michkap/archive/2005/02/26/381020.html
What the &%#$ does MB_USEGLYPHCHARS do?
by Michael S. Kaplan, published on 2005/02/26 15:26 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/26/381020.aspx
https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
Use UTF-8 code pages in Windows apps
https://mastodon.gamedev.place/@AshleyGullen/111109299141510319
what it takes to pass a file path to a Windows API in C++
https://github.com/neacsum/utf8
This library simplifies usage of UTF-8 encoded strings under Win32
Related articles:
https://www.codeproject.com//Articles/5252037/Doing-UTF-8-in-Windows
https://www.codeproject.com/Articles/5259868/Doing-UTF-8-in-Windows-Part-2-Tolower-or-Not-to-Lo
https://www.codeproject.com/Tips/5263944/UTF-8-in-Windows-INI-Files
---
Reddit review: https://www.reddit.com/r/cpp/comments/174ee8q/doing_utf8_in_windows/
---
This article about UTF-8 in Windows that does not discuss how to use a manifest
to get UTF-8 process ANSI codepage, directs people back to the 1990's.
Or pre-2019, at any rate.
---
Something else to note, if you're in the habit of keeping UTF-8 strings in
`std::string`, is that the Visual C++ version of `std::filesystem::path`
initialized from a `std::string` will use the default codepage for the process
to convert the path to UTF-16. That will result in interesting failures on
systems whose default codepage is MBCS. All without a single Windows API to be
seen in your source.
The solution to this is to upgrade to C++20 and use `std::u8string`, or to keep
filenames in `std::wstring` if you don't want to deal with the odd and
occasionally surprising limitations of `std::u8string`.
https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests#activeCodePage
Application manifests - activeCodePage
---
On Windows 10, this element forces a process to use UTF-8 as the process code page.
On Windows 10, the only valid value for activeCodePage is UTF-8.
Starting in Windows 11, this element also allows selection of either the legacy
non-UTF-8 code page, or code pages for a specific locale for legacy application
compatibility. Modern applications are strongly encouraged to use Unicode.
On Windows 11, activeCodePage may also be set to the value Legacy or a locale
name such as en-US or ja-JP.
https://devblogs.microsoft.com/oldnewthing/20210527-00/?p=105255
How can I convert between IANA time zones and Windows registry-based time zones?
A copy of ICU has been included with Windows since Windows 10 Version 1703 (build 15063).
All you have to do is include icu.h, and you’re off to the races.
An advantage of using the version that comes with Windows is that it is actively
maintained and updated by the Windows team. If you need to run on older systems,
you can build your own copy from their fork of the ICU repo,
https://github.com/microsoft/icu
but the job of servicing the project is now on you.
Language comparison
https://blog.kdheepak.com/my-unicode-cheat-sheet
Vim, Python, Julia and Rust.
Regular expressions
https://regex101.com/
Testing a regular expression.
There is even a debugger!
https://www.regular-expressions.info/unicode.html
\X matches a grapheme
https://www.regular-expressions.info/posixbrackets.html
POSIX Bracket Expressions
jlf: see the table in the section Character Classes
https://pypi.org/project/regex/
>>> a = "बिक्रम मेरो नाम हो"
>>> regex.findall(r'\X', a)
['बि', 'क्', 'र', 'म', ' ', 'मे', 'रो', ' ', 'ना', 'म', ' ', 'हो']
---
https://regex101.com/r/eD0eZ9/1
---
jlf: the results above are correct extended grapheme clusters, but tailored
grapheme clusters will group 'क्' 'र' in one cluster क्र
https://blog.burntsushi.net/ripgrep/
ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}
search for "unicode" and read...
https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions
Character classes in regular expressions
https://github.com/micromatch/posix-character-classes
POSIX character classes for creating regular expressions.
jlf: careful, not official. Looks similar to the table at
https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions
POSIX class Equivalent to Matches
[:alnum:] [A-Za-z0-9] digits, uppercase and lowercase letters
[:alpha:] [A-Za-z] upper- and lowercase letters
[:ascii:] [\x00-\x7F] ASCII characters
[:blank:] [ \t] space and TAB characters only
[:cntrl:] [\x00-\x1F\x7F] Control characters
[:digit:] [0-9] digits
[:graph:] [^ [:cntrl:]] graphic characters (all characters which have graphic representation)
[:lower:] [a-z] lowercase letters
[:print:] [[:graph:] ] graphic characters and space
[:punct:] [-!"#$%&'()*+,./:;<=>?@[]^_`{|}~] all punctuation characters (all graphic characters except letters and digits)
[:space:] [ \t\n\r\f\v] all blank (whitespace) characters, including spaces, tabs, new lines, carriage returns, form feeds, and vertical tabs
[:upper:] [A-Z] uppercase letters
[:word:] [A-Za-z0-9_] word characters
[:xdigit:] [0-9A-Fa-f] hexadecimal digits
https://unicode-org.github.io/icu/userguide/icu/posix.html
C/POSIX Migration
Character classes, point 7:
For more about the problems with POSIX character classes in a Unicode context
see Annex C: Compatibility Properties in Unicode Technical Standard #18: Unicode Regular Expressions
http://www.unicode.org/reports/tr18/#Compatibility_Properties
and see the mailing list archives for the unicode list (on unicode.org).
See also the ICU design document about C/POSIX character classes
https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/main/design/posix_classes.html
https://stackoverflow.com/questions/50570322/regex-pattern-matching-in-right-to-left-languages
Regex pattern matching in right-to-left languages
---
jlf: only one answer. Why control characters?
What I understand is that the bytes are in the spelling order of the characters.
The "/"
ooRexx returns the same sequence of bytes under macOS.
---
/Store/عرمنتجات/عرع
2F53746F72652F D8B9D8B1D985D986D8AAD8ACD8A7D8AA 2F D8B9D8B1D8B9
|--------------| |--------------------------------| |--| |------------|
"/Store/" عرمنتجات / i عرع
/Store/عرع/عرمنتجات
2F53746F72652F D8B9D8B1D8B9 2F D8B9D8B1D985D986D8AAD8ACD8A7D8AA
|--------------| |------------| |--| |--------------------------------|
"/Store/" عرع / i عرمنتجات
/Store/عرمنتجات/whatever
2F53746F72652F D8B9D8B1D985D986D8AAD8ACD8A7D8AA 2F 7768617465766572
|------------| |------------------------------| |--| |--------------|
"/Store/" عرمنتجات / whatever
https://stackoverflow.com/questions/20641297/unicode-characters-in-regex
Unicode characters in Regex
Test cases, test-cases, tests files
https://github.com/lemire/unicode_lipsum
font bold, italic, strikethrough, underline, backwards, upside down
I remember I saw an open-sourced implementation, but forgot to note it.
The URLs below are not providing a link to an open-sourced implementation, to
remove sooner or later.
https://convertcase.net/unicode-text-converter/
https://yaytext.com/
https://capitalizemytitle.com/
https://capitalizemytitle.com/fancy-text-generator/
http://slothsoft.net/UnicodeMapper/
https://www.fontgenerator.org/
https://peterwunder.de/projects/prettify/
https://texteditor.com/
https://gwern.net/utext
https://news.ycombinator.com/item?id=38016735
Utext: Rich Unicode Documents (gwern.net)
An esoteric document proposal: abuse Unicode to create the fanciest possible ‘plain text’ documents.
https://fonts.google.com/noto
https://www.unicode.org/mail-arch/unicode-ml/y2019-m01/0121.html
Encoding italic (was: A last missing link)
youtube
https://www.youtube.com/playlist?list=PLMc927ywQmTNQrscw7yvaJbAbMJDIjeBh
Videos from Unicode's Overview of Internationalization and Unicode Projects
xxx lang
https://rosettacode.org/wiki/Unicode_strings
https://langdev.stackexchange.com/questions/1493/how-have-modern-language-designs-dealt-with-unicode-strings
How have modern language designs dealt with Unicode strings?
Asked 2023-06-13
Answer for
- Swift
- Rust
- Python 3
- Treat it as a (mostly) library issue
jlf: the Swift part is interesting, the rest is bof.
In order to speed up repeated accesses to utf16, UTF-8 strings may put a breadcrumbs pointer after the null terminator:
https://github.com/apple/swift/blob/1532fb188c55f29de7bf8aaee94896557b3a3db1/stdlib/public/core/StringStorage.swift#L157
The breadcrumbs are a list of the UTF-8 offsets of every 64th UTF-16 code unit:
https://github.com/apple/swift/blob/483087a47dfb56e78fcc20ef2b43085ebfb48ea0/stdlib/public/core/StringBreadcrumbs.swift
A string stores whether it has breadcrumbs in an unused bit in its capacity field:
https://github.com/apple/swift/blob/1532fb188c55f29de7bf8aaee94896557b3a3db1/stdlib/public/core/StringStorage.swift#L45
http://xahlee.info/comp/unicode_essays_index.html
Unicode for Programers
jlf: this page contains several URL for programming languages. Short articles
but there is maybe something to learn. [later] After review, no so many things
to learn, the articles are very very short...
Ada lang
https://docs.adacore.com/live/wave/xmlada/html/xmlada_ug/unicode.html
http://www.dmitry-kazakov.de/ada/strings_edit.htm
UXStrings Ada Unicode Extended Strings
https://www.reddit.com/r/ada/comments/t4hpip/ann_uxstrings_package_available_uxs_20220226/
https://github.com/Blady-Com/UXStrings
---
2023.10.14 https://groups.google.com/g/comp.lang.ada/c/rWqDxiOwa1g
[ANN] Release of UXStrings 0.6.0
- Add string convenient subprograms [2]: Contains, Ends_With,Starts_With,
[2] https://github.com/Blady-Com/UXStrings/blob/master/src/uxstrings3.ads#L346
jlf: see https://github.com/Blady-Com/UXStrings/blob/master/src/uxstrings3.adb
After a quick look, I still don't know which kind of position is managed.
There is a parameter Case_Sensitivity, but I never see it used with a position
(that's the tricky part)
https://github.com/AdaForge/Thematics/wiki/Unicode-and-String-manipulations
Unicode and String manipulations in UTF-8, UTF-16, ...
https://stackoverflow.com/questions/48829940/utf-8-on-windows-with-ada
UTF-8 on Windows with Ada
https://github.com/AdaCore/VSS/
High level string and text processing library
https://blog.adacore.com/vss-cursors-iterators-and-markers
VSS (Virtual String Subsystem): Cursors, Iterators and Markers
jlf: bof...
Awk lang
Brian Kernighan adds Unicode support to Awk
https://github.com/onetrueawk/awk/commit/9ebe940cf3c652b0e373634d2aa4a00b8395b636
https://github.com/onetrueawk/awk/tree/unicode-support
https://news.ycombinator.com/item?id=32534173
C++ lang, cpp lang, Boost
https://en.cppreference.com/w/cpp/language/string_literal
String literal
(referenced by Adrian)
Some examples:
https://learn.microsoft.com/en-us/cpp/cpp/string-and-character-literals-cpp
https://www.youtube.com/watch?v=iQWtiYNK3kQ
A Crash Course in Unicode for C++ Developers - Steve Downey - [CppNow 2021]
jlf: good video for pronunciation
57:16 Algorithms
1:12:27 The future for C++ (you can stop here, not very interesting)
02/06/2021
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1238r1.html
SG16 initial Unicode direction and guidance for C++20 and beyond.
https://github.com/sg16-unicode/sg16
SG16 is an ISO/IEC JTC1/SC22/WG21 C++ study group tasked with improving Unicode and text processing support within the C++ standard.
https://github.com/sg16-unicode/sg16-meetings
Summaries of SG16 virtual meetings
https://lists.isocpp.org/mailman/listinfo.cgi/sg16
SG16 mailing list
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1629r1.html
P1629R1
Transcoding the 🌐 - Standard Text Encoding
Published Proposal, 2020-03-02
---
jlf: referenced by Zach Laine in P2728R0.
[P1629R1] from JeanHeyd Meneide is a much more ambitious proposal that aims to
standardize a general-purpose text encoding conversion mechanism. This proposal
is not at odds with P1629; the two proposals have largely orthogonal aims. This
proposal only concerns itself with UTF interconversions, which is all that is
required for Unicode support. P1629 is concerned with those conversions, plus a
lot more. Accepting both proposals would not cause problems; in fact, the APIs
proposed here could be used to implement parts of the P1629 design.
01/06/2021
Zach Laine
https://www.youtube.com/watch?v=944GjKxwMBo
https://tzlaine.github.io/text/doc/html/boost_text__proposed_/the_text_layer.html
https://tzlaine.github.io/text/doc/html/index.html
The Text Layer
https://tzlaine.github.io/text/doc/html/
Chapter 1. Boost.Text (Proposed) - 2018
https://github.com/tzlaine/text
last commit :
master 26/09/2020
boost_serialization 24/10/2019
coroutines 25/08/2020
experimental 13/11/2019
gh-pages 04/09/2020
optimization 27/10/2019
rope_free_fn_reimplementation 26/07/2020
No longer working on this project ?
---
Restart working on 22/03/2022
Zach's library was last discussed at the 2023-05-10 SG16 meeting; see
https://github.com/sg16-unicode/sg16-meetings#may-10th-2023.
---
https://www.youtube.com/watch?v=AoLl\_ZZqyOk
Applying the Lessons of std::ranges to Unicode in the C++ Standard Library - Zach Laine CppNow 2023
https://isocpp.org/files/papers/P2728R0.html (see more recent version below)
Unicode in the Library, Part 1: UTF Transcoding
Document #: P2728R0
Date: 2022-11-20
Reply-to: Zach Laine <whatwasthataddress@gmail.com>
---
New version:
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2728r5.html
Document #: P2728R5
Date: 2023-07-05
---
latest published version:
https://wg21.link/p2728
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2729r0.html
Unicode in the Library, Part 2: Normalization
Document #: P2729R0
Date: 2022-11-20
Reply-to: Zach Laine <whatwasthataddress@gmail.com>
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2773r0.pdf
paper D2773R0 by Corentin Jabot
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1949r7.html
C++ Identifier Syntax using Unicode Standard Annex 31
Document #: P1949R7
Date: 2021-04-12
---
Adopt Unicode Annex 31 as part of C++ 23.
- That C++ identifiers match the pattern (XID_Start + _ ) + XID_Continue*.
- That portable source is required to be normalized as NFC.
- That using unassigned code points be ill-formed.
This proposal also recommends adoption of Unicode normalization form C (NFC)
for identifiers to ensure that when compared, identifiers intended to be the
same will compare as equal. Legacy encodings are generally naturally in NFC when
converted to Unicode. Most tools will, by default, produce NFC text.
Some scripts require the use of characters as joiners that are not allowed by
base UAX #31, these will no longer be available as identifiers in C++.
As a side-effect of adopting the identifier characters from UAX #31, using emoji
in or as identifiers becomes ill-formed.
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2528r0.html
C++ Identifier Security using Unicode Standard Annex 39
Document #: P2538R0
Date: 2022-01-22
14/06/2021
https://hsivonen.fi/non-unicode-in-cpp/
Same contents in sg16 mailing list + feedbacks
https://lists.isocpp.org/sg16/2019/04/0309.php
03/07/2021
https://news.ycombinator.com/item?id=27695412
Any Encoding, Ever – ztd.text and Unicode for C++
14/07/2021
https://hsivonen.fi/non-unicode-in-cpp/
It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++
The Microsoft Code Page 932 Issue
https://stackoverflow.com/questions/58878651/what-is-the-printf-formatting-character-for-char8-t/58895428#58895428.
What is the printf() formatting character for char8_t *?
jlf: todo read it? not sure yet if it's useful to read.
Referenced from https://corp.unicode.org/pipermail/unicode/2020-April/008579.html
Basic Unicode character/string support absent even in modern C++
https://github.com/nemtrif/utfcpp/
referenced from https://corp.unicode.org/pipermail/unicode/2020-April/008582.html
Basic Unicode character/string support absent even in modern C++
https://www.boost.org/doc/libs/1_80_0/libs/locale/doc/html/index.html
Boost.Locale
Boost.Locale uses the-state-of-the-art Unicode and Localization library: ICU - International Components for Unicode.
https://github.com/uni-algo/uni-algo
Unicode Algorithms Implementation for C/C++
https://www.reddit.com/r/cpp/comments/xspvn4/unialgo_v050_modern_unicode_library/
uni-algo v0.5.0: Modern Unicode Library
https://www.reddit.com/r/cpp/comments/vtgckq/new_unicode_library/
Older post with more infos
https://github.com/uni-algo/uni-algo-single-include
Single include version for Unicode Algorithms Implementation
This repository contains single include version of uni-algo library.
https://www.reddit.com/r/cpp/comments/14t2lzm/unialgo_v100_modern_unicode_library/
uni-algo v1.0.0: Modern Unicode Library
---
jlf: see the critics of Zach Laine's library... mg152 has good arguments.
---
jlf: this library is referenced in the comments
https://github.com/hikogui/hikogui/tree/main/src/hikogui/unicode
https://github.com/hikogui/hikogui/tree/main/tools/ucd
https://github.com/hikogui/hikogui/tree/main/src/hikogui/unicode
https://github.com/hikogui/hikogui
Modern accelerated GUI
jlf: the point is not the GUI, but the tools to parse Unicode UCD.
See https://github.com/hikogui/hikogui/tree/main/tools
---
Comment of the author in https://www.reddit.com/r/cpp/comments/vtgckq/new_unicode_library/
I recently discovered a way to compress the unicode-data-set, while still being
able to do quick lookups, with a single associative indirection.
Basically you chunk the data in groups of 32 entries. Then you de-duplicate
these chunks and make a index table (about 64kbyte) that points to the chunks.
This works because a code-point is only 21 bits, which you can split in 16 bit
msb and 5 bit lsb. This means that the index table has less than 64k uint16_t
entries.
My data is including the index around 700 KByte. With the following data:
general category: 5 bit
grapheme cluster break: 4
line break class: 6
word break property: 5
sentence break property: 4
east asian width: 3
bidi class:5
bidi bracket type: 2
bidi mirroring glyph: 16
ccc: 8
script: 8
decomposition type: 5
decomposition index: 21 (decomposition table not included in the 700kbyte)
composition index: 14 (composition table not included in the 700kbyte)
Of the 128 bits per entry, 22 bits are currently unused. It is also possible to
compress a single entry. For example ccc is always zero for non-composing
code-points, so it could share those bits with properties that are only allowed
for non-composing code-points.
https://news.ycombinator.com/item?id=38424689
Bjarne Stroustrup Quotes (stroustrup.com)
---
Interesting discussion about strings (not limited to C++): search for "string".
https://www.sandordargo.com/blog/2023/11/29/cpp23-unicode-support
C++23: Growing unicode support
---
The standardization committee has accepted (at least) four papers which clearly
show a growing Unicode support in C++23.
- C++ Identifier Syntax using Unicode Standard Annex 31
- Remove non-encodable wide character literals and multicharacter wide character literals
- Delimited escape sequences
- Named universal character escapes
U'\N{LATIN CAPITAL LETTER A WITH MACRON}' // Equivalent to U'\u0100'
u8"\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}" // Equivalent to u8"\u0100\u0300"
One of the concerns was the sheer size of the Unicode name database that
contains the codes (e.g. U+0100) and the names (e.g. {LATIN CAPITAL LETTER A
WITH MACRON}). It’s around 1.5 MiB which can significantly impact the size
of compiler distributions. The authors proved that a non-naive
implementation can be around 300 KiB or even less.
jlf: next point sounds discutable, no?
Another open question was how to accept the Unicode-assigned names.
Is {latin capital letter a with macron} just as good as
{LATIN CAPITAL LETTER A WITH MACRON}?
Or what about {LATIN_CAPITAL_LETTER_A_WITH_MACRON}?
While the Unicode consortium standardized an algorithm called UAX44-LM2 for
that purpose and it’s quite permissive, language implementors barely follow
it. C++ is going to require an exact match with the database therefore the
answer to the previous question is no, {latin capital letter a with macron}
is not the same as {LATIN CAPITAL LETTER A WITH MACRON}. On the other hand,
if there will be a strong need, the requirements can be relaxed in a later
version.
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2071r2.html
Named universal character escapes
---
jlf: they don't want to support UAX44-LM2
jlf: todo, read the section "Design considerations"
cRexx lang
cRexx uses this library:
https://github.com/sheredom/utf8.h
---
Codepoint Case
Various functions provided will do case insensitive compares, or transform utf8
strings from one case to another. Given the vastness of unicode, and the authors
lack of understanding beyond latin codepoints on whether case means anything,
the following categories are the only ones that will be checked in case insensitive code:
ASCII
Latin-1 Supplement
Latin Extended-A
Latin Extended-B
Greek and Coptic
Cyrillic
DotNet, CoreFx
28/07/2021
https://github.com/dotnet/corefxlab/issues/2368
Scenarios and Design Philosophy - UTF-8 string support
https://gist.github.com/GrabYourPitchforks/901684d0aa1d2440eb378d847cfc8607 (jlf: replaced by the following URL)
https://github.com/dotnet/corefx/issues/34094 (go directly to next URL)
https://github.com/dotnet/runtime/issues/28204
Motivations and driving principles behind the Utf8Char proposal
https://github.com/dotnet/runtime/issues/933
The NuGet package generally follows the proposal in dotnet/corefxlab#2350, which
is where most of the discussion has taken place. It's a bit aggravating that the
discussion is split across so many different forums, I know. :(
ceztko
I noticed dotnet/corefxlab#2350 just got closed. Did the discussion moved
somewhere else about more UTF8 first citizen support efforts?
@ceztko The corefxlab repo was archived, so open issues were closed to
support that effort. That thread also got so large that it was difficult
to follow. @krwq is working on restructuring the conversation so that we
can continue the discussion in a better forum.
jlf
Not clear where the discussion is continued...
This URL just show some tags, one of them is "Future".
https://github.com/orgs/dotnet/projects/7#card-33368432
https://github.com/dotnet/corefxlab/issues/2350
Utf8String design discussion - last edited 14-Sep-19
Tons of comments, with this conclusion:
The discussion in this issue is too long and github has troubles rendering it.
I think we should close this issue and start a new one in dotnet/runtime.
https://github.com/dotnet/runtime/tree/main
.Net runtime
jlf: could be useful
https://github.com/dotnet/runtime/blob/main/src/libraries/System.Console/src/System/Console.cs
Dafny lang
https://corp.unicode.org/pipermail/unicode/2021-May/009434.html
Dafny natively supports expressing statements about sets
and contract programming and a toy implementation turned out to be a fairly
rote translation of the Unicode spec. Dafny is also transpilation focused,
so the primary interface must be highly functional and encoding neutral.
Dart lang
Dart SDK uses ICU4X?
jlf: to investigate...
---
On Fuchsia, the Dart SDK uses createTimeZone() with metazone names obtained from the OS (usage site).
ICU4X currently only supports this stuff with BCP-47 ids. We should have a way to go from metazone names to BCP-47 ids.
I suspect this is already part of the plan but I'm not sure if there's a specific issue filed (@nordzilla?)
---
In the link you posted, it shows "America/New_York", which is an IANA time zone name, not a metazone name.
Did you mean to ask about IANA-to-BCP47 mapping? That would be #2909
https://github.com/dart-lang/sdk/blob/main/sdk/lib/core/string.dart
https://github.com/dart-lang/sdk/blob/e995cb5f7cd67d39c1ee4bdbe95c8241db36725f/pkg/analyzer/lib/source/source_range.dart
https://github.com/dart-lang/
https://github.com/dart-lang/language
https://github.com/dart-lang/sdk
https://dart.dev/guides/language/language-tour#strings
A Dart string (String object) holds a sequence of UTF-16 code units.
https://dart.dev/guides/language/language-tour#runes-and-grapheme-clusters
In Dart, runes expose the Unicode code points of a string.
You can use the characters package to view or manipulate user-perceived
characters, also known as Unicode (extended) grapheme clusters.
https://dart.dev/guides/libraries/library-tour#strings-and-regular-expressions
https://pub.dev/packages/characters
Characters are strings viewed as sequences of user-perceived characters,
also known as Unicode (extended) grapheme clusters.
The Characters class allows access to the individual characters of a string,
and a way to navigate back and forth between them using a CharacterRange.
https://medium.com/dartlang/dart-string-manipulation-done-right-5abd0668ba3e
Like many other programming languages designed before emojis started to dominate
our daily communications and the rise of multilingual support in commercial apps,
Dart represents a string as a sequence of UTF-16 code units.
---
jlf: they say that the Dart users are not aware of the Characters package.
They try to improve the situation in the Flutter framework, but they are not
very happy of the situation:
Those mitigations can help, but they are limited to string manipulations
performed in the context of a Flutter project. We need to carefully measure
their effectiveness after they become available. A more complete solution at the
Dart language level will likely require migration of at least some existing code,
although a few options (for example, static extension types) might make breaking
changes manageable.
More technical investigation is needed to fully understand the trade-offs.
https://github.com/robertbastian/icu4x/tree/dart/ffi/capi/dart/package
jlf: A fork with DART FFI
Elixir lang
https://elixir-lang.org/
"Elixir" |> String.graphemes() |> Enum.frequencies() %{"E" => 1, "i" => 2, "l" => 1, "r" => 1, "x" => 1}
---
"Elixir"~text~reduce(by: "characters", initial: .stem~new~~put(0)){accu[item] += 1}=
a Stem (5 items)
'E' : 1
'i' : 2
'l' : 1
'r' : 1
'x' : 1
https://hexdocs.pm/elixir/String.html
Strings in Elixir are UTF-8 encoded binaries.
Works at grapheme level.
The functions in this module rely on the Unicode Standard, but do not contain
any of the locale specific behaviour.
To act according to the Unicode Standard, many functions in this module run in
linear time, as they need to traverse the whole string considering the proper
Unicode code points.
For example, String.length/1 will take longer as the input grows.
On the other hand, Kernel.byte_size/1 always runs in constant time (i.e.
regardless of the input size).
---
Interesting: they manage correctly the upper/lower without using a locale.
upcase(string, mode \\ :default)
Converts all characters in the given string to uppercase according to mode.
mode may be :default, :ascii, :greek or :turkic.
The :default mode considers all non-conditional transformations outlined in the Unicode standard.
:ascii uppercases only the letters a to z.
:greek includes the context sensitive mappings found in Greek.
:turkic properly handles the letter i with the dotless variant.
https://hexdocs.pm/elixir/unicode-syntax.html
Strings are UTF-8 encoded.
Charlists are lists of Unicode code points. In such cases, the contents are kept
as written by developers, without any transformation.
Elixir allows Unicode characters in its variables, atoms, and calls.
From now on, we will refer to those terms as identifiers.
The characters allowed in identifiers are the ones specified by Unicode.
Elixir normalizes all characters to be the in the NFC form.
Mixed-script identifiers are not supported for security reasons.
аdmin
"аdmin"~text~unicodecharacters==
an Array (shape [5], 5 items)
1 : ( "а" U+0430 Ll 1 "CYRILLIC SMALL LETTER A" )
2 : ( "d" U+0064 Ll 1 "LATIN SMALL LETTER D" )
3 : ( "m" U+006D Ll 1 "LATIN SMALL LETTER M" )
4 : ( "i" U+0069 Ll 1 "LATIN SMALL LETTER I" )
5 : ( "n" U+006E Ll 1 "LATIN SMALL LETTER N" )
The character must either be all in Cyrillic or all in Latin.
The only mixed-scripts that Elixir allows, according to the Highly Restrictive
Unicode recommendations, are:
Latin and Han with Bopomofo
Latin and Japanese
Latin and Korean
Elixir will also warn on confusable identifiers in the same file.
For example, Elixir will emit a warning if you use both variables а (Cyrillic)
and а (Latin) in your code.
Elixir implements the requirements outlined in the Unicode Annex #31
(https://www.unicode.org/reports/tr31/)
Elixir does not allow the use of ZWJ or ZWNJ in identifiers and therefore does
not implement R1a.
Bidirectional control characters are also not supported.
R1b is guaranteed for backwards compatibility purposes.
Elixir supports only code points \t (0009), \n (000A), \r (000D) and \s (0020)
as whitespace and therefore does not follow requirement R3.
R3 requires a wider variety of whitespace and syntax characters to be supported.
Factor lang
http://docs.factorcode.org/content/article-unicode.html
http://useless-factor.blogspot.fr/2007/02/doing-unicode-right-part-1.html
JLF : bof...
http://useless-factor.blogspot.fr/2007/02/doing-unicode-right-part-2.html
http://useless-factor.blogspot.fr/2007/08/unicode-implementers-guide-part-3.html
http://useless-factor.blogspot.fr/2007/08/unicode-implementers-guide-part-4.html
grapheme breaking
http://useless-factor.blogspot.fr/2007/08/r-597-rs-unicode-library-is-broken.html
http://useless-factor.blogspot.fr/2007/02/more-string-parsing.html
UTF-8/16 encoder/decoder
I used a design pattern known as a sentinel, which helps me cross-cut pointcutting concerns
by instantiating objects which encapsulate the state of the parser. I never mutate these,
and the program is purely functional except for the use of make (which could trivially be
changed into a less efficient map [ ] subset, sacrificing efficiency and some terseness
but making it functional).
TUPLE: new ;
TUPLE: double val ;
TUPLE: quad2 val ;
TUPLE: quad3 val ;
: bad-char CHAR: ? ;
GENERIC: (utf16le) ( char state -- state )
M: new (utf16le)
drop <double> ;
M: double (utf16le)
over -3 shift BIN: 11011 = [
over BIN: 100 bitand 0 =
[ double-val swap BIN: 11 bitand 8 shift bitor <quad2> ]
[ 2drop bad-char , <new> ] if
] [ double-val swap 8 shift bitor , <new> ] if ;
M: quad2 (utf16le)
quad2-val 10 shift bitor <quad3> ;
M: quad3 (utf16le)
over -2 shift BIN: 110111 = [
swap BIN: 11 bitand 8 shift
swap quad3-val bitor HEX: 10000 + , <new>
] [ 2drop bad-char , <new> ] if ;
: utf16le ( state string -- state string )
[ [ swap (utf16le) ] each ] { } make ;
https://re.factorcode.org/2023/05/unicode.html
jlf: very basic, but may be useful to write little tests
https://re.factorcode.org/2023/05/case-conversion.html
snake_case
camelCase
kebab-case
PascalCase
Ada_Case
Train-Case
COBOL-CASE
MACRO_CASE
UPPER CASE
lower case
Title Case
Sentence case
dot.case
Fortran lang
https://fortran-lang.discourse.group/t/using-unicode-characters-in-fortran/2764
jlf: hum... it's a blind support of UTF-8, as we do with current Rexx.
There is no support of Unicode.
In the unicode_len.f90 example:
chars = 'Fortran is 💪, 😎, 🔥!'
if (len(chars) /= 28) error stop
28 is the lentgh in bytes...
In the unicode_index.f90 example:
chars = '📐: 4.0·tan⁻¹(1.0) = π'
i = index(chars, 'n')
if (i /= 14) error stop
i = index(chars, '¹')
if (i /= 18) error stop
14 and 18 are byte positions...
GO lang
https://go.dev/
https://go.dev/ref/spec#Conversions_to_and_from_a_string_type
jlf: worth reading, they cover all the possible conversions between bytes, rune
and string.
https://go.dev/play/
The Go Playground
https://github.com/traefik/yaegi
Another Elegant Go Interpreter
---
rlwrap yaegi
https://yourbasic.org/golang/
Tutorial, a selection related to strings
---
[]byte("Noël") // [78 111 195 171 108]
// 1. Using the string() constructor
string([]byte{78, 111, 195, 171, 108}) // Noël
// 2. Go provides a package called bytes with a function called NewBuffer(), which
// creates a new Buffer and then uses the String() method to get the string output.
bytes.NewBuffer([]byte{78, 111, 195, 171, 108}).String() // Noël
// 3. Using fmt.Sprintf() function
fmt.Sprintf("%s", []byte{78, 111, 195, 171, 108}) // Noël
// String building
fmt.Sprintf("Size: %d MB.", 85) // Size: 85 MB.
// High-performance string concatenation
var b strings.Builder
b.Grow(32) // preallocate memory when the maximum size of the string is known
for i, p := range []int{2, 3, 5, 7, 11, 13} {
fmt.Fprintf(&b, "%d:%d, ", i+1, p)
}
s := b.String() // no copying
s = s[:b.Len()-2] // no copying (removes trailing ", ")
fmt.Println(s) // 1:2, 2:3, 3:5, 4:7, 5:11, 6:13
// Convert string to runes
// For an invalid UTF-8 sequence, the rune value will be 0xFFFD for each invalid byte.
[]rune("Noël") // [78 111 235 108]
// Convert runes to string
// When you convert a slice of runes to a string, you get a new string that
// is the concatenation of the runes converted to UTF-8 encoded strings.
// Values outside the range of valid Unicode code points are converted to
// \uFFFD, the Unicode replacement character �.
string([]rune{'\u004E', '\u006F', '\u00EB', '\u006C'}) // Noël
// String iteration by runes
// the range loop iterates over Unicode code points.
// The index is the first byte of a UTF-8-encoded code point;
// the second value, of type rune, is the value of the code point.
// For an invalid UTF-8 sequence, the second value will be 0xFFFD,
// and the iteration will advance a single byte.
for i, ch := range "日本語" {
fmt.Printf("%#U starts at byte position %d\n", ch, i)
}
// Output:
U+004E 'N' starts at byte position 0
U+006F 'o' starts at byte position 1
U+00EB 'ë' starts at byte position 2
U+006C 'l' starts at byte position 4
// String iteration by bytes
const s = "Noël"
for i := 0; i < len(s); i++ {
fmt.Printf("%x ", s[i])
}
// Output: 4e 6f c3 ab 6c
https://pkg.go.dev/strings
Package strings implements simple functions to manipulate UTF-8 encoded strings.
jlf: BIFs
https://go.dev/blog/slices
Arrays, slices (and strings): The mechanics of 'append'
Rob Pike
26 September 2013
---
jlf: prerequisite to understand how strings are managed
Next blog also helps (no relation with Unicode, but...)
https://teivah.medium.com/slice-length-vs-capacity-in-go-af71a754b7d8
https://go.dev/blog/strings
Strings, bytes, runes and characters in Go
Rob Pike
23 October 2013
---
In Go, a string is in effect a read-only slice of bytes.
A string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text,
or any other predefined format.
const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
Indexing a string accesses individual bytes, not characters.
for i := 0; i < len(sample); i++ {
fmt.Printf("%x ", sample[i]) # bd b2 3d bc 20 e2 8c 98
}
A shorter way to generate presentable output for a messy string is to use the %x
(hexadecimal) format verb of fmt.Printf. It just dumps out the sequential bytes
of the string as hexadecimal digits, two per byte.
fmt.Printf("%x\n", sample) # bdb23dbc20e28c98
fmt.Printf("% x\n", sample) # bd b2 3d bc 20 e2 8c 98
The %q (quoted) verb will escape any non-printable byte sequences in a string so
the output is unambiguous.
fmt.Printf("%q\n", sample) # "\xbd\xb2=\xbc ⌘"
fmt.Printf("%+q\n", sample) # "\xbd\xb2=\xbc \u2318"
The Go language defines the word rune as an alias for the type int32, so programs
can be clear when an integer value represents a code point.
A for range loop decodes one UTF-8-encoded rune on each iteration. Each time around
the loop, the index of the loop is the starting position of the current rune, measured
in bytes, and the code point is its value.
const nihongo = "日本語"
for index, runeValue := range nihongo {
fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}
The output shows how each code point occupies multiple bytes:
U+65E5 '日' starts at byte position 0
U+672C '本' starts at byte position 3
U+8A9E '語' starts at byte position 6
https://go.dev/pkg/unicode/utf8/
Unicode/utf8 package
https://go.dev/blog/normalization
Text normalization in Go
Marcel van Lohuizen
26 November 2013
---
To write your text as NFC, use the https://pkg.go.dev/golang.org/x/text/unicode/norm
package to wrap your io.Writer of choice:
wc := norm.NFC.Writer(w)
defer wc.Close()
// write as before...
If you have a small string and want to do a quick conversion, you can use this simpler form:
norm.NFC.Bytes(b)
https://cs.opensource.google/go/x/text
This repository holds supplementary Go libraries for text processing, many involving Unicode.
https://pkg.go.dev/golang.org/x/text/collate
The collate package, which can sort strings in a language-specific way, works
correctly even with unnormalized strings
https://pkg.go.dev/golang.org/x/text/encoding
Package encoding defines an interface for character encodings, such as Shift JIS
and Windows 1252, that can convert to and from UTF-8.
Encoding implementations are provided in other packages, such as
golang.org/x/text/encoding/charmap
golang.org/x/text/encoding/japanese.
A Decoder converts bytes to UTF-8. It implements transform.Transformer.
Transforming source bytes that are not of that encoding will not result in
an error per se. Each byte that cannot be transcoded will be represented in
the output by the UTF-8 encoding of '\uFFFD', the replacement rune.
---
jlf: strange... I was expecting a more conservative conversion, since the
core language supports any bytes in a string.
An Encoder converts bytes from UTF-8. It implements transform.Transformer.
Each rune that cannot be transcoded will result in an error. In this case,
the transform will consume all source byte up to, not including the offending
rune. Transforming source bytes that are not valid UTF-8 will be replaced by
`\uFFFD`.
---
jlf: the previous description seems contradictory.
"up to, not including the offending rune"
"not valid UTF-8 will be replaced by `\uFFFD`"
https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/charmap
Package charmap provides simple character encodings such as IBM Code Page 437
and Windows 1252.
CodePage037 is the IBM Code Page 037 encoding.
CodePage1047 is the IBM Code Page 1047 encoding.
CodePage1140 is the IBM Code Page 1140 encoding.
CodePage437 is the IBM Code Page 437 encoding.
CodePage850 is the IBM Code Page 850 encoding.
CodePage852 is the IBM Code Page 852 encoding.
CodePage855 is the IBM Code Page 855 encoding.
CodePage858 is the Windows Code Page 858 encoding.
CodePage860 is the IBM Code Page 860 encoding.
CodePage862 is the IBM Code Page 862 encoding.
CodePage863 is the IBM Code Page 863 encoding.
CodePage865 is the IBM Code Page 865 encoding.
CodePage866 is the IBM Code Page 866 encoding.
ISO8859_1 is the ISO 8859-1 encoding.
ISO8859_10 is the ISO 8859-10 encoding.
ISO8859_13 is the ISO 8859-13 encoding.
ISO8859_14 is the ISO 8859-14 encoding.
ISO8859_15 is the ISO 8859-15 encoding.
ISO8859_16 is the ISO 8859-16 encoding.
ISO8859_2 is the ISO 8859-2 encoding.
ISO8859_3 is the ISO 8859-3 encoding.
ISO8859_4 is the ISO 8859-4 encoding.
ISO8859_5 is the ISO 8859-5 encoding.
ISO8859_6 is the ISO 8859-6 encoding.
ISO8859_7 is the ISO 8859-7 encoding.
ISO8859_8 is the ISO 8859-8 encoding.
ISO8859_9 is the ISO 8859-9 encoding.
KOI8R is the KOI8-R encoding.
KOI8U is the KOI8-U encoding.
Macintosh is the Macintosh encoding.
MacintoshCyrillic is the Macintosh Cyrillic encoding.
Windows1250 is the Windows 1250 encoding.
Windows1251 is the Windows 1251 encoding.
Windows1252 is the Windows 1252 encoding.
Windows1253 is the Windows 1253 encoding.
Windows1254 is the Windows 1254 encoding.
Windows1255 is the Windows 1255 encoding.
Windows1256 is the Windows 1256 encoding.
Windows1257 is the Windows 1257 encoding.
Windows1258 is the Windows 1258 encoding.
Windows874 is the Windows 874 encoding.
https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/japanese
Package japanese provides Japanese encodings such as EUC-JP and Shift JIS.
EUCJP is the EUC-JP encoding.
ISO2022JP is the ISO-2022-JP encoding.
ShiftJIS is the Shift JIS encoding, also known as Code Page 932 and Windows-31J.
https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/korean
Package korean provides Korean encodings such as EUC-KR.
EUCKR is the EUC-KR encoding, also known as Code Page 949.
https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/simplifiedchinese
Package simplifiedchinese provides Simplified Chinese encodings such as GBK.
HZGB2312 is the HZ-GB2312 encoding.
https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/traditionalchinese
Package traditionalchinese provides Traditional Chinese encodings such as Big5.
Big5 is the Big5 encoding, also known as Code Page 950.
https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/unicode
Package unicode provides Unicode encodings such as UTF-16.
UTF8 is the UTF-8 encoding. It neither removes nor adds byte order marks.
UTF8BOM is an UTF-8 encoding where the decoder strips a leading byte order mark while the encoder adds one.
UTF16 returns a UTF-16 Encoding for the given default endianness and byte order mark (BOM) policy.
func UTF16(e Endianness, b BOMPolicy) encoding.Encoding
https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/unicode/utf32
Package utf32 provides the UTF-32 Unicode encoding.
UTF32 returns a UTF-32 Encoding for the given default endianness and byte order mark (BOM) policy.
func UTF32(e Endianness, b BOMPolicy) encoding.Encoding
https://go.dev/blog/matchlang
Language and Locale Matching in Go
The Go package https://golang.org/x/text/language implements the BCP 47 standard
for language tags and adds support for deciding which language to use based on
data published in the Unicode Common Locale Data Repository (CLDR).
https://github.com/unicode-org/icu4x/issues/2882
https://cs.opensource.google/go/x/text
The golang x-text library has re-implemented most of ICU from scratch,
and some of their algorithms and data structures might be interesting
for the icu4x project
(afaik x-text was not just a port of the ICU codebase to another language,
but an actual re-implementation).
You might want to have a look at their code, or talk to @mpvl who wrote most of it.
https://github.com/golang/go/blob/master/src/cmd/compile/internal/syntax/scanner.go
Implementation of Golang’s lexer
Identifier is made up of letters and digits (where first is always a letter) and
letter is an arbitrary Unicode code point.
package main
import "fmt"
func 隨機名稱() {
fmt.Println("It works!")
}
func main() {
隨機名稱()
źdźbło := 1
fmt.Println(źdźbło)
}
https://henvic.dev/posts/go-utf8/
UTF-8 strings with Go: len(s) isn't enough
jlf: in his initial post, the guy was not aware of graphemes
and it's after a feedback on Reddit that he addded stuff about graphemes.
https://github.com/rivo/uniseg
Unicode Text Segmentation, Word Wrapping, and String Width Calculation in Go
https://pkg.go.dev/github.com/rivo/uniseg#hdr-Monospace_Width
Monospace Width
Monospace width, as referred to in this package, is the width of a string in a monospace font.
This package differs from wcswidth() in a number of ways, presumably to generate more visually pleasing results.
Note that whether these widths appear correct depends on your application's render engine,
to which extent it conforms to the Unicode Standard, and its choice of font.
---
Rules implemented by uniseg:
we assume that every code point has a width of 1, with the following exceptions:
- Code points with grapheme cluster break properties Control, CR, LF, Extend, and ZWJ have a width of 0.
- U+2E3A, Two-Em Dash, has a width of 3.
- U+2E3B, Three-Em Dash, has a width of 4.
- Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide" (W)
have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both have a width of 1.)
- Code points with grapheme cluster break property Regional Indicator have a width of 2.
- Code points with grapheme cluster break property Extended Pictographic have
a width of 2, unless their Emoji Presentation flag is "No", in which case the width is 1.
- For Hangul grapheme clusters composed of conjoining Jamo and for Regional Indicators
(flags), all code points except the first one have a width of 0.
- For grapheme clusters starting with an Extended Pictographic, any additional
code point will force a total width of 2, except if the Variation Selector-15
(U+FE0E) is included, in which case the total width is always 1.
- Grapheme clusters ending with Variation Selector-16 (U+FE0F) have a width of 2.
---
jlf: mouais, in conclusion there is no guarantee that the result will be good.
---
uniseg.StringWidth("🇩🇪🏳️🌈!") -- uniseg returns 5
utf8proc:
"🇩🇪🏳️🌈!"~text~unicodeCharacters~each("charWidth")= -- [ 1, 1, 1, 0, 0, 2, 1]
"🇩🇪🏳️🌈!"~text~unicodeCharacters==
an Array (shape [7], 7 items)
1 : ( "🇩" U+1F1E9 So 1 "REGIONAL INDICATOR SYMBOL LETTER D" )
2 : ( "🇪" U+1F1EA So 1 "REGIONAL INDICATOR SYMBOL LETTER E" )
3 : ( "🏳" U+1F3F3 So 1 "WAVING WHITE FLAG" )
4 : ( "️" U+FE0F Mn 0 "VARIATION SELECTOR-16", "VS16" )
5 : ( "" U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" )
6 : ( "🌈" U+1F308 So 2 "RAINBOW" )
7 : ( "!" U+0021 Po 1 "EXCLAMATION MARK" )
https://www.reddit.com/r/golang/comments/1d19uon/why_can_the_go_string_type_contain_invalid_utf8/
Why can the Go `string` type contain invalid UTF-8 data?
jlf: nothing interesting, it's Go vs Rust, and nobody wins.
jRuby lang
https://github.com/jruby/jruby/blob/master/core/src/main/java/org/jruby/RubyString.java
jlf: big file, more than 7000 lines.
https://github.com/jruby/jruby/blob/master/core/src/main/java/org/jruby/RubyEncoding.java
https://github.com/jruby/jruby/blob/master/lib/ruby/stdlib/unicode_normalize/normalize.rb
https://github.com/jruby/jruby/blob/master/spec/ruby/core/string/unicode_normalize_spec.rb
Java lang
https://docs.oracle.com/en/java/javase/
https://docs.oracle.com/en/java/javase/20/docs/api/java.base/java/text/BreakIterator.html
java.text.BreakIterator
The default implementation of the character boundary analysis conforms to the
Unicode Consortium's Extended Grapheme Cluster breaks. For more detail, refer to
Grapheme Cluster Boundaries section in the Unicode Standard Annex #29.
https://docs.oracle.com/en/java/javase/20/intl/internationalization-overview.html
Internationalization Overview
https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/
Java has supported Unicode since its first release and strings are internally
represented using UTF-16 encoding. UTF-16 is a variable length encoding scheme.
For characters that can fit into the 16 bits space, it uses 2 bytes to represent
them. For all other characters, it uses 4 bytes.
For a character that requires more than 16 bits, like these emojis 👦👩, the
char methods like someString.charAt(0) or someString.substring(0,1) will break
and give you only half the code point.
https://docs.oracle.com/javase/tutorial/i18n/text/unicode.html
When the specification for the Java language was created, the Unicode standard
was accepted and the char primitive was defined as a 16-bit data type, with
characters in the hexadecimal range from 0x0000 to 0xFFFF.
Because 16-bit encoding supports 216 (65,536) characters, which is insufficient
to define all characters in use throughout the world, the Unicode standard was
extended to 0x10FFFF, which supports over one million characters. The definition
of a character in the Java programming language could not be changed from 16
bits to 32 bits without causing millions of Java applications to no longer run
properly. To correct the definition, a scheme was developed to handle characters
that could not be encoded in 16 bits.
The characters with values that are outside of the 16-bit range, and within the
range from 0x10000 to 0x10FFFF, are called supplementary characters and are
defined as a pair of char values.
https://openjdk.org/jeps/400
JEP 400: UTF-8 by Default
A quick way to see the default charset of the current JDK is with the following command:
java -XshowSettings:properties -version 2>&1 | grep file.encoding
As envisaged by the specification of Charset.defaultCharset(), the JDK will allow
the default charset to be configured to something other than UTF-8.
java -Dfile.encoding=COMPAT
the default charset will be the charset chosen by the algorithm in JDK 17 and earlier,
based on the user's operating system, locale, and other factors.
The value of file.encoding will be set to the name of that charset.
java -Dfile.encoding=UTF-8
the default charset will be UTF-8.
This no-op value is defined in order to preserve the behavior of existing command lines.
The treatment of values other than "COMPAT" and "UTF-8" are not specified.
They are not supported, but if such a value worked in JDK 17 then it will likely continue to work in JDK 18.
https://www.baeldung.com/java-remove-accents-from-text
Remove Accents and Diacritics From a String in Java
- We will perform the compatibility decomposition represented as the Java enum NFKD.
because it decomposes more ligatures than the canonical method (for example, ligature “fi”).
- We will remove all characters matching the Unicode Mark category using the \p{M} regex expression.
Test:
assertEquals("\\u0066 \\u0069", StringNormalizer.unicodeValueOfNormalizedString("fi"));
assertEquals("\\u0061 \\u0304", StringNormalizer.unicodeValueOfNormalizedString("ā"));
assertEquals("\\u0069 \\u0308", StringNormalizer.unicodeValueOfNormalizedString("ï"));
assertEquals("\\u006e \\u0301", StringNormalizer.unicodeValueOfNormalizedString("ń"));
Compare Strings Including Accents Using Collator.
Java provides four strength values for a Collator:
PRIMARY: comparison omitting case and accents
SECONDARY: comparison omitting case but including accents and diacritics
TERTIARY: comparison including case and accents
IDENTICAL: all differences are significant
https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/io/DataInput.html#modified-utf-8
Implementations of the DataInput and DataOutput interfaces represent Unicode strings in a format
that is a slight modification of UTF-8.
- Characters in the range '\u0001' to '\u007F' are represented by a single byte.
- The null character '\u0000' and characters in the range '\u0080' to '\u07FF' are represented by a pair of bytes.
- Characters in the range '\u0800' to '\uFFFF' are represented by three bytes.
The differences between this format and the standard UTF-8 format are the following:
- The null byte '\u0000' is encoded in 2-byte format rather than 1-byte,
so that the encoded strings never have embedded nulls.
- Only the 1-byte, 2-byte, and 3-byte formats are used.
- Supplementary characters are represented in the form of surrogate pairs.
Decomposition of ligature
In Java, you'll need to use the Normalizer class and the NFKC form:
---
String ff ="\uFB00";
String normalized = Normalizer.normalize(ff, Form.NFKC);
System.out.println(ff + " = " + normalized);
---
This will print
ff = ff
https://stackoverflow.com/questions/9699071/what-is-the-javas-internal-represention-for-string-modified-utf-8-utf-16
You can confirm the following by looking at the source code of the relevant version of the java.lang.String class in OpenJDK.
Prior to Java 9, the standard in-memory representation for a Java String is UTF-16 code-units held in a char[].
private final char value[];
With Java 6 update 21 and later, there was a non-standard option (-XX:UseCompressedStrings) to enable compressed strings. This feature was removed in Java 7.
For Java 9 and later, the implementation of String has been changed to use a compact representation by default.
private final byte[] value;
private final byte coder; // LATIN1 (0) or UTF16 (1)
https://docs.oracle.com/en/java/javase/20/docs/specs/man/java.html#advanced-runtime-options-for-java
-XX:-CompactStrings
Disables the Compact Strings feature.
By default, this option is enabled.
When this option is enabled, Java Strings containing only single-byte characters are internally represented
and stored as single-byte-per-character Strings using ISO-8859-1 / Latin-1 encoding.
This reduces, by 50%, the amount of space required for Strings containing only single-byte characters.
For Java Strings containing at least one multibyte character:
these are represented and stored as 2 bytes per character using UTF-16 encoding.
Disabling the Compact Strings feature forces the use of UTF-16 encoding as the internal representation for all Java Strings.
As of 2023, see JEP 254: Compact Strings https://openjdk.org/jeps/254
https://howtodoinjava.com/java9/compact-strings/
https://stackoverflow.com/questions/44178432/difference-between-compact-strings-and-compressed-strings-in-java-9
In Java 9 on the other hand, compact strings are fully integrated into the JDK source.
String is always backed by byte[], where characters use one byte if they are Latin-1 and otherwise two.
Most operations do a check to see which is the case, e.g. charAt:
public char charAt(int index) {
if (isLatin1()) {
return StringLatin1.charAt(value, index);
} else {
return StringUTF16.charAt(value, index);
}
}
Compact strings are enabled by default and can be partially disabled - "partially"
because they are still backed by a byte[] and operations returning chars must still
put them together from two separate bytes
public int length() {
return value.length >> coder();
}
If our String is Latin1 only, coder is going to be zero, so length of value (the byte array) is the size of chars.
For non-Latin1 divide by two.
https://www.baeldung.com/java-string-encode-utf-8
Encoding With Core Java
// First, we get the String bytes, and then we create a new one using the retrieved bytes and the desired charset:
String rawString = "Entwickeln Sie mit Vergnügen";
byte[] bytes = rawString.getBytes(StandardCharsets.UTF_8);
String utf8EncodedString = new String(bytes, StandardCharsets.UTF_8);
assertEquals(rawString, utf8EncodedString);
Encoding With Java 7 StandardCharsets
// First, we'll encode the String into bytes, and second, we'll decode it into a UTF-8 String:
String rawString = "Entwickeln Sie mit Vergnügen";
ByteBuffer buffer = StandardCharsets.UTF_8.encode(rawString);
String utf8EncodedString = StandardCharsets.UTF_8.decode(buffer).toString();
assertEquals(rawString, utf8EncodedString);
https://www.baeldung.com/java-string-to-byte-array
Convert String to Byte Array and Reverse in Java
Converting a String to Byte Array
A String is stored as an array of Unicode characters in Java.
To convert it to a byte array, we translate the sequence of characters into a sequence of bytes.
For this translation, we use an instance of Charset.
This class specifies a mapping between a sequence of chars and a sequence of bytes.
We refer to the above process as encoding.
Using String.getBytes()
The String class provides three overloaded getBytes methods to encode a String into a byte array:
- getBytes() – encodes using platform's default charset
---
String inputString = "Hello World!";
byte[] byteArrray = inputString.getBytes();
---
The above method is platform-dependent, as it uses the platform's default charset. We can get this charset by calling Charset.defaultCharset().
- getBytes (String charsetName) – encodes using the named charset
- getBytes (Charset charset) – encodes using the provided charset
Using Charset.encode()
The Charset class provides encode(), a convenient method that encodes Unicode characters into bytes.
This method always replaces invalid input and unmappable-characters using the charset's default replacement byte array.
---
String inputString = "Hello ਸੰਸਾਰ!";
Charset charset = StandardCharsets.US_ASCII;
byte[] byteArrray = charset.encode(inputString).array();
---
CharsetEncoder
CharsetEncoder transforms Unicode characters into a sequence of bytes for a given charset.
Moreover, it provides fine-grained control over the encoding process.
---
String inputString = "Hello ਸੰਸਾਰ!";
CharsetEncoder encoder = StandardCharsets.US_ASCII.newEncoder();
encoder.onMalformedInput(CodingErrorAction.IGNORE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.replaceWith(new byte[] { 0 });
byte[] byteArrray = encoder.encode(CharBuffer.wrap(inputString)).array();
---
Converting a Byte Array to String
We refer to the process of converting a byte array to a String as decoding.
Similar to encoding, this process requires a Charset.
However, we can't just use any charset for decoding a byte array.
In particular, we should use the charset that encoded the String into the byte array.
https://retrocomputing.stackexchange.com/questions/26535/why-do-java-classfiles-and-jni-use-a-frankensteins-monster-encoding-crossin
Why do Java classfiles (and JNI) use a "Frankenstein's monster" encoding crossing UTF-8 and UTF-16?
jlf: interesting for the history. If I understand correctly, Java uses the CESU-8
encoding to store strings in classfiles and JNI payloads.
https://en.wikipedia.org/wiki/CESU-8
CESU-8 = Compatibility Encoding Scheme for UTF-16: 8-Bit
- A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point
in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8
- A Unicode supplementary character, i.e. a code point in the range U+10000 to
U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each
surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3
bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four.
CESU-8 is not an official part of the Unicode Standard, because Unicode Technical
Reports are informative documents only. It should be used exclusively for internal
processing and never for external data exchange.
Supporting CESU-8 in HTML documents is prohibited by the W3C and WHATWG HTML
standards, as it would present a cross-site scripting vulnerability.
What is the level of support of surrogates?
Java.lang.Character.isSurrogatePair()
Java.lang.Character.toCodePoint(char high, char low) : int
String.codePointAt()
Character.codePointAt()
http://hauchee.blogspot.com/2015/05/surrogate-characters-mechanism.html
Neither String or StringBuilder working properly. To avoid the issue above, use
java.text.BreakIterator to determine the correct position.
jlf: the code below show how to pass from logical position to real position.
public static void main(String[] args) {
String text = "a\uD834\uDD60s\uD834\uDD60\uD834\uDD60©₂"; // text: a텠s텠텠©₂
int startIndex = 2;
int endIndex = 5;
BreakIterator charIterator = BreakIterator.getCharacterInstance();
System.out.println(
subString(charIterator, text, startIndex, endIndex)); // output: s텠텠
}
private static String subString(BreakIterator charIterator,
String target, int start, int end) {
int realStart = 0;
int realEnd = 0;
charIterator.setText(target);
int boundary = charIterator.first();
int i = 0;
while (boundary != BreakIterator.DONE) {
if (i == start) {
realStart = boundary;
}
if (i == end) {
realEnd = boundary;
break;
}
boundary = charIterator.next();
i++;
}
return target.substring(realStart, realEnd);
}
https://github.com/s-u/rJava/issues/51
R to Java interface
Error on UTF-16 surrogate pairs
Java uses UTF-16 internally and encodes Unicode characters above U+FFFF with
surrogate pairs. When strings containing such characters are converted to UTF-8
by rJava they are encoded as a pair of 3 byte sequences rather than as the correct
4 byte sequence. This is not valid UTF-8 and will result in "invalid multibyte string" errors.
https://www.unicode.org/faq/utf_bom.html#utf8-4
https://bugs.openjdk.org/browse/JDK-8291660
https://youtrack.jetbrains.com/issue/IDEA-197555
\b{g} not supported in regex
In the docs for java.util.regex.Pattern (https://docs.oracle.com/javase/10/docs/api/java/util/regex/Pattern.html):
\b{g} is listed under the “Boundary matchers” section: “\b{g} A Unicode extended grapheme cluster boundary”
https://www.reddit.com/r/LanguageTechnology/comments/af0ice/seeking_lightweight_java_graphemetophoneme_g2p/
Seeking lightweight Java grapheme-to-phoneme (G2P) model
Succeeded at getting jg2p working. It's doing pretty well in terms of
pronunciation quality but the model is very large for an Android app and takes
forever to load.
https://github.com/steveash/jg2p/
jg2p
Java implementation of a general grapheme to phoneme toolkit using a pipeline of
CRFs, a log-loss re-ranker, and a joint "graphone" language model.
https://horstmann.com/unblog/2023-10-03/index.html
Stop Using char in Java. And Code Points
jlf: moderately interesting...
jlf: idem for the related HN comments https://news.ycombinator.com/item?id=37822967
Since Java 20, there is a way of iterating over the grapheme clusters of a string,
using the BreakIterator class from Java 1.1.
String s = "Ciao 🇮🇹!";
BreakIterator iter = BreakIterator.getCharacterInstance();
iter.setText(s);
int start = boundary.first();
int end = boundary.next();
while (end != BreakIterator.DONE) {
String gc = s.substring(start, end);
start = end;
end = boundary.next();
process(gc);
}
Here is a much simpler way, clearly not as efficient. I was stunned to find out
that this worked since Java 9!
s.split("\\b{g}"); // An array withments "C", "i", "a", "o", " ", "🇮🇹", "!"
Or, to get a stream:
Pattern.compile("\\X").matcher(s).results().map(MatchResult::group)
JavaScript lang
https://certitude.consulting/blog/en/invisible-backdoor/
THE INVISIBLE JAVASCRIPT BACKDOOR
https://www.npmjs.com/package/tty-strings
A one stop shop for working with text displayed in the terminal.
The goal of this project is to alleviate the headache of working with Javascript's
internal representation of unicode characters, particularly within the context of
displaying text in the terminal for command line applications.
---
jlf tag: character width
https://github.com/foliojs/linebreak
A JS implementation of the Unicode Line Breaking Algorithm (UAX #14)
It is used by PDFKit (https://github.com/foliojs/pdfkit) for line wrapping text
in PDF documents.
https://github.com/codebox/homoglyph
A big list of homoglyphs and some code to detect them
Julia lang
Remember: search in issues with "utf8proc in:title,body"
https://bkamins.github.io/julialang/2020/08/13/strings.html
The String, or There and Back Again
https://docs.julialang.org/en/v1/manual/strings/
You can input any Unicode character in single quotes using \u followed by up to
four hexadecimal digits or \U followed by up to eight hexadecimal digits
(the longest valid value only requires six):
julia> '\u0'
'\0': ASCII/Unicode U+0000 (category Cc: Other, control)
julia> '\u78'
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
julia> '\u2200'
'∀': Unicode U+2200 (category Sm: Symbol, math)
julia> '\U10ffff'
'\U10ffff': Unicode U+10FFFF (category Cn: Other, not assigned)
julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"
https://docs.julialang.org/en/v1/base/strings/
jlf: search for "ß" in this page with Chrome, you will see it matches with "ss"
It doesn't match the β here: isless("β", "α")
"β"~text~characters= -- ( "β" U+03B2 Ll 1 "GREEK SMALL LETTER BETA" )
https://juliapackages.com/p/strs
jlf: the string implemention of Scott P Jones
Seems quiet since last year...
This uses Swift-style \ escape sequences, such as \u{xxxx} for Unicode constants,
instead of \uXXXX and \UXXXXXXXX, which have the advantage of not having to worry
about some digit or letter A-F or a-f occurring after the last hex digit of the Unicode constant.
It also means that $, a very common character for LaTeX strings or output of currencies,
does not need to be in a string quoted as '$'
It uses \(expr) for interpolation like Swift, instead of $name or $(expr), which
also has the advantage of not having to worry about the next character in the
string someday being allowed in a name.
It allows for embedding Unicode characters using a variety of easy to remember
names, instead of hex codes: \:emojiname: \<latexname> \N{unicodename} \&htmlname;
Examples of this are:
f"\<dagger> \¥ \N{ACCOUNT OF} \:snake:", which returns the string: "† ¥ ℀ 🐍 "
https://discourse.julialang.org/t/stupid-question-on-unicode/27674/7
Discussion about escape sequence
https://docs.julialang.org/en/v1/stdlib/Unicode/
Unicode.julia_chartransform(c::Union{Char,Integer})
Unicode.isassigned(c) -> Bool
isequal_normalized(s1::AbstractString, s2::AbstractString; casefold=false, stripmark=false, chartransform=identity)
Unicode.normalize(s::AbstractString; keywords...)
boolean keywords options (which all default to false except for compose)
- compose=false: do not perform canonical composition
- decompose=true: do canonical decomposition instead of canonical composition (compose=true is ignored if present)
- compat=true: compatibility equivalents are canonicalized
- casefold=true: perform Unicode case folding, e.g. for case-insensitive string comparison
- newline2lf=true, newline2ls=true, or newline2ps=true: convert various newline sequences (LF, CRLF, CR, NEL) into a linefeed (LF), line-separation (LS), or paragraph-separation (PS) character, respectively
- stripmark=true: strip diacritical marks (e.g. accents)
- stripignore=true: strip Unicode's "default ignorable" characters (e.g. the soft hyphen or the left-to-right marker)
- stripcc=true: strip control characters; horizontal tabs and form feeds are converted to spaces; newlines are also converted to spaces unless a newline-conversion flag was specified
- rejectna=true: throw an error if unassigned code points are found
- stable=true: enforce Unicode versioning stability (never introduce characters missing from earlier Unicode versions)
Unicode.normalize(s::AbstractString, normalform::Symbol)
normalform can be :NFC, :NFD, :NFKC, or :NFKD.
utf8proc doesn't support language-sensitive case-folding
Julia, which uses utf8proc, has decided to remain locale-independent.
See https://github.com/JuliaLang/julia/issues/7848
https://github.com/JuliaLang/julia/pull/42493
This PR adds a function isequal_normalized to the Unicode stdlib to check whether
two strings are canonically equivalent (optionally casefolding and/or stripping combining marks).
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/13
julia> '\ub5'
'µ': Unicode U+00b5 (category Ll: Letter, lowercase)
julia> '\uff'
'ÿ': Unicode U+00ff (category Ll: Letter, lowercase)
julia> Base.Unicode.uppercase("ÿ")[1]
'Ÿ': Unicode U+0178 (category Lu: Letter, uppercase)
julia> Base.Unicode.uppercase("µ")[1]
'Μ': Unicode U+039c (category Lu: Letter, uppercase)
jlf: I find the next thead interesting from a social point of view...
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/40
Yet another Stefan Karpinski against Scott P Jones...
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/42
jlf: helping Scott P Jones
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/46
Referencing https://github.com/JuliaLang/julia/pull/25021
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/72
jlf: Stefan Karpinski not happy
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/79
jlf: Stefan Karpinski not happy
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/88
jlf: Scott P Jones not happy
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/130
Using a hybrid encoding like Python 3’s strings or @ScottPJones’s UniStr
means that not only do you need to look at every byte of incoming data, but
you also have to transcode it in general. This is a total performance
nightmare for dealing with large text files.
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/133
jlf: Interesting points of Stefan Karpinski regarding the validation of strings.
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/138
jlf: not sure if Scott P Jones says that graphemes are no needed...
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/144
jlf: révolte!
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/148
jlf: "This is a plea for the thread to stop."
https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/154
jlf: very upset guy
https://github.com/JuliaLang/julia/pull/25021
Move Unicode-related functions to new Unicode stdlib package
jlf: nothing interesting in the comments, but this is this PR that Scott P Jones
describes as a bomb.
https://github.com/JuliaLang/julia/pull/19469#issuecomment-264810748
AFAICT does the currently implemented lowercase also not follow the spec.
I do not know anything about Turkish but the following behaviour in Greek
julia> lowercase("OΔΥΣΣΕΥΣ")
"oδυσσευσ" # wrong
"oδυσσευς" # would be correct
is wrong, i.e. the lowercase sigma at the end is the non-final form σ but should be the final form ς instead.
https://github.com/JuliaStrings/utf8proc/issues/54
Feature request: Full Case Folding #54
opened in 2015, still opened in 2022
jlf: related to utf8proc
---
https://github.com/JuliaStrings/utf8proc/issues/54#issuecomment-141545196
our case is to make a perfect search in MAPS.ME :)
In general, we need to preprocess a lot of raw strings added by community of OpenStreetMap,
and match these strings effectively on mobile device, for any language and any input.
This includes stripping out all diacritics, full case folding, and even some special
conversions which are not covered in Unicode standard but are important for users trying to find something.
I've already mentioned ß=>ss conversion,
there are also non-standard Ł=>L, й=>и,
famous turkish İ and ı conversions,
all very important if you don't have a Ł key on your keyboard, for example,
and trying to enter it as L (and find some Polish street for example).
Now we have our own highly-optimized implementation for NFKD and Case Folding.
---
jlf: made a search in https://github.com/mapsme/omim, but could not find where
they handle Ł=>L
Found NormalizeAndSimplifyString, but it doesn't simplify Ł=>L.
https://github.com/JuliaStrings/utf8proc/pull/102
Fixes allowing for “Full” folding and NFKC_CaseFold compliance. #102
---
jlf: this is the creation of the function NFKC_Casefold in utf8proc
---
https://github.com/JuliaStrings/utf8proc/pull/133
Case folding fixes #133
Updated version of #102:
Restores the original behavior of IGNORE so that this PR is non-breaking, adds new STRIPNA flag.
Renames the new function to utf8proc_NFKC_Casefold instead of utf8proc_NFKC_CF
Adds a minimal test.
Updates the utf8proc_data.c file.
jlf: this explains why the the options in utf8proc are like that.
jlf: "NFKC_CF" seems the name to search to get useful infos about utf8proc_NFKC_Casefold.
https://unicode-org.github.io/icu/userguide/transforms/normalization/
NFKC_Casefold: ICU 4.4 supports the combination of NFKC, case folding
and removing ignorable characters which was introduced with Unicode 5.2.
https://docs.tibco.com/pub/enterprise-runtime-for-R/5.0.0/doc/html/Language_Reference/terrUtils/normalizeUnicode.html
normalizeUnicode(x, form = "NCF")
form: a character string specifying the type of Unicode normalization to be used.
Should be one of the strings "NFC", "NFD", "NFKC", "NFKD", "NFKC_CF" or "NFKC_Casefold".
The forms "NFKC_CF" or "NFKC_Casefold" (which are equivalent) are described in https://www.unicode.org/reports/tr31/.
https://www.lanqiao.cn/library/elasticsearch-definitive-guide-cn/220_Token_normalization/40_Case_folding
Case folding is the act of converting words into a (usually lowercase) form that does not necessarily result in the correct spelling, but does allow case-insensitive comparisons.
jlf: they say "The default normalization form that the icu_normalizer token filter uses is nfkc_cf"
https://github.com/JuliaLang/julia/issues/52408
isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false
---
jlf: see the comments and new code
---
This strings is really not well supported by bbedit!
"בְּ"~text~unicodeCharacters==
an Array (shape [3], 3 items)
1 : ( "ב" U+05D1 Lo 1 "HEBREW LETTER BET" )
2 : ( "ּ" U+05BC Mn 0 "HEBREW POINT DAGESH OR MAPIQ" )
3 : ( "ְ" U+05B0 Mn 0 "HEBREW POINT SHEVA" )
"בְּ"~text~c2x= -- D791 D6BC D6B0
"בְּ"~text~nfc~c2x= -- D791 D6B0 D6BC
https://github.com/JuliaStrings/utf8proc/issues/257
normalization does not commute with case-folding?
julia> using Unicode: normalize
julia> s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c"
"J"
julia> normalize(s, casefold=true) == normalize(normalize(s), casefold=true)
false
julia> normalize(normalize(s, casefold=true)) == normalize(normalize(s), casefold=true)
false
Not sure if this is a bug or just a weird behavior of Unicode.
---
I get something similar in Python 3:
>>> import unicodedata
>>> s = "J\u0f72\u0ec8\u0345\u0315\u05bf\u05bb\U0001d16d\u05b0\u0334\u035c"
>>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", s).casefold()
False
>>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", unicodedata.normalize("NFC", s).casefold())
False
So I guess this is a weird quirk of Unicode?
---
Executor idem:
s = "J\u0f72\u0ec8\u0345\u0315\u05bf\u05bb\U0001d16d\u05b0\u0334\u035c"~text~unescape
s~nfc(casefold:.true) == s~nfc~nfc(casefold:.true)= -- 0
s~nfc(casefold:.true)~c2x= -- 6A E0BB88 E0BDB2 CEB9 CCB4 D6B0 D6BB D6BF F09D85AD CC95 CD9C
s~nfc~nfc(casefold:.true)~c2x= -- 6A CCB4 D6B0 D6BB D6BF E0BB88 E0BDB2 F09D85AD CC95 CD9C CEB9
s~nfc(casefold:.true)~nfc == s~nfc~nfc(casefold:.true)= -- 0
s~nfc(casefold:.true)~nfc~c2x= -- 6A E0BB88 E0BDB2 CEB9 CCB4 D6B0 D6BB D6BF F09D85AD CC95 CD9C
s~nfc~nfc(casefold:.true)~c2x= -- 6A CCB4 D6B0 D6BB D6BF E0BB88 E0BDB2 F09D85AD CC95 CD9C CEB9
https://github.com/JuliaStrings/utf8proc/issues/101#issuecomment-1876151702
jlf: maybe this example of Julia code could be useful for Executor?
function _isequal_normalized!
> I agree that a fast case-folded/normalized comparison function that requires
> no buffers seems possible to write and could be useful, even for Julia;
Note that such a function was implemented in Julia, and could be ported to C:
https://github.com/JuliaLang/julia/blob/0f6c72c71bc947282ae18715c09f93a22828aab7/stdlib/Unicode/src/Unicode.jl#L268-L340
Kotlin lang
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/
https://github.com/JetBrains/kotlin/tree/master/libraries/stdlib/jvm/src/kotlin/text
Lisp lang
14/09/2021
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html
name
Corresponds to the Name Unicode property.
The value is a string consisting of upper-case Latin letters A to Z, digits,
spaces, and hyphen ‘-’ characters. For unassigned codepoints, the value is nil.
general-category
Corresponds to the General_Category Unicode property.
The value is a symbol whose name is a 2-letter abbreviation of the character’s
classification. For unassigned codepoints, the value is Cn.
canonical-combining-class
Corresponds to the Canonical_Combining_Class Unicode property.
The value is an integer. For unassigned codepoints, the value is zero.
bidi-class
Corresponds to the Unicode Bidi_Class property.
The value is a symbol whose name is the Unicode directional type of the
character. Emacs uses this property when it reorders bidirectional text for
display (see Bidirectional Display). For unassigned codepoints, the value
depends on the code blocks to which the codepoint belongs: most unassigned
codepoints get the value of L (strong L), but some get values of AL (Arabic
letter) or R (strong R).
decomposition
Corresponds to the Unicode properties Decomposition_Type and Decomposition_Value.
The value is a list, whose first element may be a symbol representing a
compatibility formatting tag, such as small18; the other elements are
characters that give the compatibility decomposition sequence of this
character. For characters that don’t have decomposition sequences, and for
unassigned codepoints, the value is a list with a single member, the
character itself.
decimal-digit-value
Corresponds to the Unicode Numeric_Value property for characters whose
Numeric_Type is ‘Decimal’. The value is an integer, or nil if the character
has no decimal digit value. For unassigned codepoints, the value is nil,
which means NaN, or “not a number”.
digit-value
Corresponds to the Unicode Numeric_Value property for characters whose
Numeric_Type is ‘Digit’. The value is an integer. Examples of such characters
include compatibility subscript and superscript digits, for which the value
is the corresponding number. For characters that don’t have any numeric value,
and for unassigned codepoints, the value is nil, which means NaN.
numeric-value
Corresponds to the Unicode Numeric_Value property for characters whose
Numeric_Type is ‘Numeric’. The value of this property is a number. Examples
of characters that have this property include fractions, subscripts,
superscripts, Roman numerals, currency numerators, and encircled numbers.
For example, the value of this property for the character U+2155 VULGAR
FRACTION ONE FIFTH is 0.2. For characters that don’t have any numeric value,
and for unassigned codepoints, the value is nil, which means NaN.
mirrored
Corresponds to the Unicode Bidi_Mirrored property.
The value of this property is a symbol, either Y or N. For unassigned
codepoints, the value is N.
mirroring
Corresponds to the Unicode Bidi_Mirroring_Glyph property.
The value of this property is a character whose glyph represents the mirror
image of the character’s glyph, or nil if there’s no defined mirroring glyph.
All the characters whose mirrored property is N have nil as their mirroring
property; however, some characters whose mirrored property is Y also have nil
for mirroring, because no appropriate characters exist with mirrored glyphs.
Emacs uses this property to display mirror images of characters when
appropriate (see Bidirectional Display). For unassigned codepoints, the
value is nil.
paired-bracket
Corresponds to the Unicode Bidi_Paired_Bracket property.
The value of this property is the codepoint of a character’s paired bracket,
or nil if the character is not a bracket character. This establishes a
mapping between characters that are treated as bracket pairs by the Unicode
Bidirectional Algorithm; Emacs uses this property when it decides how to
reorder for display parentheses, braces, and other similar characters (see
Bidirectional Display).
bracket-type
Corresponds to the Unicode Bidi_Paired_Bracket_Type property.
For characters whose paired-bracket property is non-nil, the value of this
property is a symbol, either o (for opening bracket characters) or c (for
closing bracket characters). For characters whose paired-bracket property is
nil, the value is the symbol n (None). Like paired-bracket, this property is
used for bidirectional display.
old-name
Corresponds to the Unicode Unicode_1_Name property.
The value is a string. For unassigned codepoints, and characters that have
no value for this property, the value is nil.
iso-10646-comment
Corresponds to the Unicode ISO_Comment property.
The value is either a string or nil. For unassigned codepoints, the value
is nil.
uppercase
Corresponds to the Unicode Simple_Uppercase_Mapping property.
The value of this property is a single character. For unassigned codepoints,
the value is nil, which means the character itself.
lowercase
Corresponds to the Unicode Simple_Lowercase_Mapping property.
The value of this property is a single character. For unassigned codepoints,
the value is nil, which means the character itself.
titlecase
Corresponds to the Unicode Simple_Titlecase_Mapping property.
Title case is a special form of a character used when the first character of
a word needs to be capitalized. The value of this property is a single
character. For unassigned codepoints, the value is nil, which means the
character itself.
special-uppercase
Corresponds to Unicode language- and context-independent special upper-casing
rules. The value of this property is a string (which may be empty). For
example mapping for U+00DF LATIN SMALL LETTER SHARP S is "SS". For characters
with no special mapping, the value is nil which means uppercase property
needs to be consulted instead.
special-lowercase
Corresponds to Unicode language- and context-independent special lower-casing
rules. The value of this property is a string (which may be empty). For
example mapping for U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE the value is
"i\u0307" (i.e. 2-character string consisting of LATIN SMALL LETTER I followed
by U+0307 COMBINING DOT ABOVE). For characters with no special mapping, the
value is nil which means lowercase property needs to be consulted instead.
special-titlecase
Corresponds to Unicode unconditional special title-casing rules.
The value of this property is a string (which may be empty). For example
mapping for U+FB01 LATIN SMALL LIGATURE FI the value is "Fi". For characters
with no special mapping, the value is nil which means titlecase property
needs to be consulted instead.
Mathematica lang
https://www.youtube.com/watch?v=yiwLBvirm7A
Live CEOing Ep 426: Language Design in Wolfram Language [Unicode Characters & WFR Suggestions]
At the begining, there are a few minutes about character properties.
https://writings.stephenwolfram.com/2022/06/launching-version-13-1-of-wolfram-language-mathematica/#emojis-and-more-multilingual-support
Launching Version 13.1 of Wolfram Language & Mathematica 🙀🤠🥳
Emojis! And More Multilingual Support
Original 16-bit Unicode is “plane 0”. Now there are up to 16 additional planes.
Not quite 32-bit characters, but given the way computers work, the approach now
is to allow characters to be represented by 32-bit objects. It’s far from trivial
to do that uniformly and efficiently. And for us it’s been a long process to
upgrade everything in our system—from string manipulation to notebook rendering—
to handle full 32-bit characters. And that’s finally been achieved in Version 13.1.
---
You can have wolf and ram variables:
In:= Expand[(🐺 + 🐏)^8]
In:= Expand[(\|01f43a + \|01f40f)^8]
8 7 6 2 5 3 4 4 3 5 2 6 7 8
Out= 🐏 + 8 🐏 🐺 + 28 🐏 🐺 + 56 🐏 🐺 + 70 🐏 🐺 + 56 🐏 🐺 + 28 🐏 🐺 + 8 🐏 🐺 + 🐺
The 🐏 sorts before the 🐺 because it happens to have a numerically smaller character code:
In:= ToCharacterCode["🐺🐏"]
In:= ToCharacterCode["\|01f43a\|01f40f"]
Out= {128058, 128015}
---
In:= Grid[Outer[StringJoin[#1, "\:200d", #2] & , {"👩", "👨"}, {"🔬", "🏫", "🎓", "🍳", "🚀", "🔧"}]]
In:= Grid[Outer[StringJoin[#1, "\:200d", #2] & , {"\|01f469", "\|01f468"}, {"\|01f52c", "\|01f3eb", "\|01f393", "\|01f373", "\|01f680", "\|01f527"}]]
Out= 👩🔬 👩🏫 👩🎓 👩🍳 👩🚀 👩🔧
👨🔬 👨🏫 👨🎓 👨🍳 👨🚀 👨🔧
---
No outer product in Executor, only element-wise operators
("👩", "👨")~each{(item || .Unicode["zero width joiner"]~text) || ("🔬", "🏫", "🎓", "🍳", "🚀", "🔧")}==
an Array (shape [2], 2 items)
1 : [T'👩🔬',T'👩🏫',T'👩🎓',T'👩🍳',T'👩🚀',T'👩🔧']
2 : [T'👨🔬',T'👨🏫',T'👨🎓',T'👨🍳',T'👨🚀',T'👨🔧']
---
In:= CharacterRange[74000, 74050]
Out= {𒄐, 𒄑, 𒄒, 𒄓, 𒄔, 𒄕, 𒄖, 𒄗, 𒄘, 𒄙, 𒄚, 𒄛, 𒄜, 𒄝, 𒄞, 𒄟, 𒄠, 𒄡, 𒄢, 𒄣,
> 𒄤, 𒄥, 𒄦, 𒄧, 𒄨, 𒄩, 𒄪, 𒄫, 𒄬, 𒄭, 𒄮, 𒄯, 𒄰, 𒄱, 𒄲, 𒄳, 𒄴, 𒄵, 𒄶, 𒄷,
> 𒄸, 𒄹, 𒄺, 𒄻, 𒄼, 𒄽, 𒄾, 𒄿, 𒅀, 𒅁, 𒅂}
In:= FromCharacterCode[{2361, 2367}]
Out= हि
In:= Characters["हि"]
In:= Characters["\:0939\:093f"]
Out= {ह, ि}
In:= Characters["o\:0308"]
Out= {o, ̈}
In:= CharacterNormalize["o\:0308", "NFC"]
Out= ö
In:= ToCharacterCode[%]
Out= {246}
netrexx lang
https://groups.io/g/netrexx/topic/93734685
Unicode Examples
(this not NetRexx, but this answer is useful for NetRexx)
https://stackoverflow.com/questions/63410278/code-point-and-utf-16-code-units-are-the-same-thing
code point and UTF-16 code units are the same thing?
No, they are different.
I know, MDN uses the rarely used "code units" term, which confuses people a lot.
Code points are the number given to a Unicode element (character).
This is independent to the encoding, and it can be as high as 0x10FFFF.
UTF-32 code units are equivalent to Unicode code points (if you are using the correct endianess).
Code units in UTF-16 are units of 16bit data.
UTF-16 uses 1 or 2 code units to describe a code point, depending on its value.
Code points below (or equal) to 0xFFFF (the old limit/expectation of Unicode
that such numbers were enough to encode all characters) use just 1 code unit,
and its value is the same as the code point.
Unicode expanded the code point space, so now code points between 0x010000..0x10FFFF require 2 code units
(and we use "surrogates" to encode such characters), 4 bytes total.
So, code points are not the same as code units.
For UTF-16, code units are 16bit long, and code points could be 1 or 2 code units.
(this is JavaScript, but this answer is useful for NetRexx)
https://exploringjs.com/impatient-js/ch_unicode.html#:~:text=Code%20units%20are%20numbers%20that,has%208%2Dbit%20code%20units.
each UTF-16 code unit is always either a leading surrogate, a trailing surrogate, or encodes a BMP code point
BMP = Basic Multilingual Plane (0x0000–0xFFFF)
(this is JavaScript, but this answer is useful for NetRexx)
https://www.w3schools.com/jsref/jsref_codepointat.asp#:~:text=Difference%20Between%20charCodeAt()%20and%20codePointAt()&text=charCodeAt()%20returns%20a%20number,value%20greather%200xFFFF%20(65535).
Difference Between charCodeAt() and codePointAt()
charCodeAt() is UTF-16, codePointAt() is Unicode.
charCodeAt() returns a number between 0 and 65535.
Both methods return an integer representing the UTF-16 code of a character,
but only codePointAt() can return the full value of a Unicode value greather 0xFFFF (65535).
(this is Unicode, but this answer is useful for NetRexx)
https://www.unicode.org/faq/utf_bom.html#:~:text=Surrogates%20are%20code%20points%20from,DC0016%20to%20DFFF16.
What are surrogates?
Surrogates are code points from two special ranges of Unicode values, reserved
for use as the leading, and trailing values of paired code units in UTF-16.
Leading surrogates, also called high surrogates, are encoded from D800 to DBFF,
and trailing surrogates, or low surrogates, from DC00 to DFFF.
They are called surrogates, since they do not represent characters directly, but only as a pair.
What is the difference between UCS-2 and UTF-16?
UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1,
before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided.
UCS-2 does not describe a data format distinct from UTF-16, because both use exactly the same 16-bit code unit representations.
However, UCS-2 does not interpret surrogate code points, and thus cannot be used to conformantly represent supplementary characters.
Sometimes in the past an implementation has been labeled “UCS-2” to indicate that it does not support supplementary characters
and doesn’t interpret pairs of surrogate code points as characters.
Such an implementation would not handle processing of character properties,
code point boundaries, collation, etc. for supplementary characters, nor would it
be able to support most emoji, for example. [AF]
(this is Unicode, but this answer is useful for NetRexx)
Unicode standard
How the word "surrogate" is used
surrogate pair
surrogate code unit
surrogate code point
leading surrogate
trailing surrogate
high-surrogate code point
high-surrogate code unit
low-surrogate code point
low-surrogate code unit
Surrogates
D71
High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.
D72
High-surrogate code unit: A 16-bit code unit in the range D800 to DBFF, used in
UTF-16 as the leading code unit of a surrogate pair.
D73
Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.
D74
Low-surrogate code unit: A 16-bit code unit in the range DC00 to DFFF,
used in UTF-16 as the trailing code unit of a surrogate pair.
UTF-16
In the UTF-16 encoding form, non-surrogate code points in the range U+0000..U+FFFF
are represented as a single 16-bit code unit; code points in the supplementary planes,
in the range U+10000..U+10FFFF, are represented as pairs of 16-bit code units.
These pairs of special code units are known as surrogate pairs. The values of the code units
used for surro- gate pairs are completely disjunct from the code units used for the
single code unit representations, thus maintaining non-overlap for all code point representations in UTF-16.
Code Points Unassigned to Abstract Characters
C1
A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character.
• The high-surrogate and low-surrogate code points are designated for surrogate code units in the UTF-16 character encoding form.
They are unassigned to any abstract character.
(this is Java, but this answer is useful for NetRexx)
https://stackoverflow.com/questions/39955169/which-encoding-does-java-uses-utf-8-or-utf-16/39957184#39957184
Which encoding does Java uses UTF-8 or UTF-16?
Note that
new String(bytes, StandardCharsets.UTF_16);
does not "convert it to UTF-16 explicitly".
This string constructor takes a sequence of bytes, which is supposed to be in
the encoding that you have given in the second argument, and converts it to the
UTF-16 representation of whatever characters those bytes represent in that encoding.
You can't tell Java how to internally store strings.
It always stores them as UTF-16.
The constructor String(byte[],Charset) tells Java to create a UTF-16 string from
an array of bytes that is supposed to be in the given character set.
The method getBytes(Charset) tells Java to give you a sequence of bytes that
represent the string in the given encoding (charset).
And the method getBytes() without an argument does the same - but uses your
platform's default character set for the conversion.
Edit: in fact, Java 9 introduced just such a change in internal representation
of strings, where, by default, strings whose characters all fall in the ISO-8859-1
range are internally represented in ISO-8859-1, whereas strings with at least one
character outside that range are internally represented in UTF-16 as before.
So indeed, getBytes(StandardCharsets.UTF_16) no longer returns the internal representation.
(this thread contains answers useful for NetRexx)
https://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/
UCS vs UTF-8 as Internal String Encoding
jlf: Very good introduction 100% applicable to NetRexx
https://news.ycombinator.com/item?id=9618306
jlf: comments about the blog above.
Interesting comments about the need or non-need to have direct access to code
units or "characters" in constant time.
---
Unicode provides 3 classes of grapheme clusters (legacy, extended and tailored)
at least one of which (tailored) is locale-dependent (`ch` is a single tailored
grapheme cluster under the Slovak locale, because it's the ch digraph).
---
A text-editing control is thinking in terms of "grapheme clusters", not in terms
of codepoints.
jlf: not true. BBEdit works at codepoint level.
👩👨👩👧🎅
2 graphemes, 8 codepoints, 29 bytes
c2x = 'F09F91A9 E2808D F09F91A8 E2808D F09F91A9 E2808D F09F91A7 F09F8E85'
c2u = 'U+1F469 U+200D U+1F468 U+200D U+1F469 U+200D U+1F467 U+1F385'
c2g = 'F09F91A9E2808DF09F91A8E2808DF09F91A9E2808DF09F91A7 F09F8E85'
In BBEdit, I see 8 "characters" and can move the cursor between each "character".
The ZERO WIDTH JOINER codepoints are visible.
In VSCode, I see 2 "characters".
---
What is the practicality of an unbounded number of possible graphemes?
The standard itself doesn't mention any bounds but there is Unicode Standard Annex #15 -
Unicode Normalization Forms which defines the Stream-Safe Text Format.
UAX15-D3. Stream-Safe Text Format: A Unicode string is said to be in
Stream-Safe Text Format if it would not contain any sequences of
non-starters longer than 30 characters in length when normalized
to NFKD.
---
This sub-part of the thread is exactly what we are discussing for NetRexx
https://news.ycombinator.com/item?id=9620112
---
jlf: next description is exactly what I do in the Executor prototype.
What you really want is constant-time dereferencing of designators for semantically
meaningful substrings. But no language AFAIK actually has that. The fundamental
problem is that most languages have painted themselves into a corner by carving
into stone the fact that strings can be dereferenced by integers. Once you've
done that, you're pretty much screwed. It's not that you can't make it work,
it's just that it requires an awful lot of machinery. You basically need to build
an index for every string you construct, and that can get very expensive.
Fixed-width representations are a less-than-perfect-but-still-not-entirely-unreasonable
engineering solution to this problem.
(this is Python, but this link could be useful for NetRexx c2x and x2c)
https://docs.python.org/3/library/struct.html
Interpret bytes as packed binary data
Oracle
https://docs.oracle.com/database/121/NLSPG/ch5lingsort.htm#NLSPG288
Database Globalization Support Guide
5 Linguistic Sorting and Matching
Complex! Did not read in details, maybe I should...
https://docs.oracle.com/database/121/NLSPG/ch6unicode.htm#NLSPG323
Database Globalization Support Guide
6 Supporting Multilingual Databases with Unicode
https://docs.oracle.com/database/121/NLSPG/ch7progrunicode.htm#NLSPG346
Database Globalization Support Guide
7 Programming with Unicode
Perl lang (Perl 6 has been renamed to Raku)
https://swigunicode.wordpress.com/2021/10/18/example-post-3/
SWIG and Perl: Unicode C Library
Part 1. Small Intro to SWIG
https://swigunicode.wordpress.com/2021/10/22/part-2-c-header-file/
Part 2. C Header File
https://swigunicode.wordpress.com/2021/10/24/part-3-c-source-file/
Part 3. C Source File
https://swigunicode.wordpress.com/2021/10/25/part-4-perl-source-file/
Part 4. Perl Source File
https://swigunicode.wordpress.com/2021/10/26/part-5-build-and-run-scripts/
Part 5. Build and Run Scripts
https://swigunicode.wordpress.com/2021/10/27/part-6-swig-interface-file/
Part 6. SWIG Interface File
https://lwn.net/Articles/667684/
An article about NFG.
Unless one specifies otherwise, Perl 6 normalizes a text string to NFC when it's not NFG.
PHP lang
https://github.com/nicolas-grekas/Patchwork-UTF8
Extensive, portable and performant handling of UTF-8 and grapheme clusters for PHP
https://kunststube.net/encoding/
jlf: First, a general introduction to encoding. Then a focus on PHP.
https://www.php.net/manual/en/function.iconv.php
iconv — Convert a string from one character encoding to another
iconv(string $from_encoding, string $to_encoding, string $string): string|false
https://www.php.net/manual/en/book.mbstring.php
Multibyte String
replicates all important string functions in a multi-byte aware fashion.
Because the mb_ functions now have to actually think about what they're doing,
they need to know what encoding they're working on. Therefore every mb_ function
accepts an $encoding parameter as well. Alternatively, this can be set globally
for all mb_ functions using mb_internal_encoding.
https://www.php.net/manual/en/mbstring.overload.php
Warning
This feature has been DEPRECATED as of PHP 7.2.0, and REMOVED as of PHP 8.0.0.
Relying on this feature is highly discouraged.
---
mbstring supports a 'function overloading' feature which enables you to add
multibyte awareness to such an application without code modification by
overloading multibyte counterparts on the standard string functions. For
example, mb_substr() is called instead of substr() if function overloading is
enabled. This feature makes it easy to port applications that only support
single-byte encodings to a multibyte environment in many cases.
---
jlf: the few user's comments are all negative.
hum... this is one of the choices we foresee for Rexx. Bad idea?
Example: "In short, only use mbstring.func_overload if you are 100% certain that
nothing on your site relies on manipulating binary data in PHP."
Search PHP souces: grapheme
https://heap.space/search?project=PHP-8.2&full=grapheme&defs=&refs=&path=&hist=&type=
https://news-web.php.net/group.php?group=php.i18n
php.i18n
Most recent: 08 Feb 2018 (?)
https://news-web.php.net/php.i18n/1439
Unicode support with UString abstraction layer
21/03/2012 by Umberto Salsi
jlf: no URL
https://wiki.php.net/rfc/ustring
UString is much quicker than mbstring thanks to the use of ICU.
https://www.reddit.com/r/PHP/comments/2jvvol/rfc_ustring/
Low enthusiasm on reddit...
https://github.com/krakjoe/ustring
UnicodeString for PHP7
dead, last commit on Mar 17, 2016
https://github.com/nicolas-grekas/Patchwork-UTF8
Patchwork-UTF8
Extensive, portable and performant handling of UTF-8 and grapheme clusters for PHP
Dead? last commit was on May 18, 2016
https://blog.internet-formation.fr/2022/08/nettoyer-et-remplacer-les-homographes-et-homoglyphes-dun-texte-en-php/
Nettoyer et remplacer les homographes (et homoglyphes) d’un texte en PHP
Python lang
https://github.com/dabeaz-course/practical-python/blob/master/Notes/01_Introduction/04_Strings.md
Practical Python Programming. A course by David Beazley
jlf:good introduction to Python strings.
https://www.youtube.com/watch?v=Nfqh6lr3frQ
The Guts of Unicode in Python
Benjamin Peterson
This talk will examine how Python's internal Unicode representation has changed
from its introduction through the latest major changes in Python 3.3.
jlf: not too long (28 min), good overview.
10/08/2021
List of Python PEPS related to string.
https://www.python.org/dev/peps/
Other Informational PEPs
I 257 Docstring Conventions Goodger, GvR
I 287 reStructuredText Docstring Format Goodger
Accepted PEPs (accepted; may not be implemented yet)
SA 675 Arbitrary Literal String Type
SA 686 Make UTF-8 mode default
SA 701 Syntactic formalization of f-strings
Open PEPs (under consideration)
Finished PEPs (done, with a stable interface)
SF 100 Python Unicode Integration Lemburg
SF 260 Simplify xrange() GvR
SF 261 Support for "wide" Unicode characters Prescod
SF 263 Defining Python Source Code Encodings Lemburg, von Löwis
SF 277 Unicode file name support for Windows NT Hodgson
SF 278 Universal Newline Support Jansen
SF 292 Simpler String Substitutions Warsaw
SF 331 Locale-Independent Float/String Conversions Reis
SF 383 Non-decodable Bytes in System Character Interfaces von Löwis
SF 393 Flexible String Representation v. Löwis
SF 414 Explicit Unicode Literal for Python 3.3 Ronacher, Coghlan
SF 498 Literal String Interpolation Smith
SF 515 Underscores in Numeric Literals Brandl, Storchaka
SF 528 Change Windows console encoding to UTF-8 Dower
SF 529 Change Windows filesystem encoding to UTF-8 Dower
SF 538 Coercing the legacy C locale to a UTF-8 based locale Coghlan
SF 540 Add a new UTF-8 Mode Stinner
SF 597 Add optional EncodingWarning Naoki
SF 616 String methods to remove prefixes and suffixes Sweeney
SF 623 Remove wstr from Unicode Naoki
SF 624 Remove Py_UNICODE encoder APIs Naoki
SF 3101 Advanced String Formatting Talin
SF 3112 Bytes literals in Python 3000 Orendorff
SF 3120 Using UTF-8 as the default source encoding von Löwis
SF 3127 Integer Literal Support and Syntax Maupin
SF 3131 Supporting Non-ASCII Identifiers von Löwis
SF 3137 Immutable Bytes and Mutable Buffer GvR
SF 3138 String representation in Python 3000 Ishimoto
Deferred PEPs (postponed pending further research or updates)
SD 501 General purpose string interpolation Coghlan
SD 536 Final Grammar for Literal String Interpolation Angerer
SD 558 Defined semantics for locals() Coghlan
Abandoned, Withdrawn, and Rejected PEPs
SS 215 String Interpolation Yee
IR 216 Docstring Format Zadka
SR 224 Attribute Docstrings Lemburg
SR 256 Docstring Processing System Framework Goodger
SR 295 Interpretation of multiline string constants Koltsov
SR 332 Byte vectors and String/Unicode Unification Montanaro
SR 349 Allow str() to return unicode strings Schemenauer
IR 502 String Interpolation - Extended Discussion Miller
SR 3126 Remove Implicit String Concatenation Jewett, Hettinger
15/07/2021
review
https://docs.python.org/3/howto/unicode.html
Escape sequences in string literals
"\N{GREEK CAPITAL LETTER DELTA}" # Using the character name '\u0394'
"\u0394" # Using a 16-bit hex value '\u0394'
"\U00000394" # Using a 32-bit hex value '\u0394'
One can create a string using the decode() method of bytes.
This method takes an encoding argument, such as UTF-8, and optionally an errors argument.
The errors argument specifies the response when the input string can’t be converted
according to the encoding’s rules. Legal values for this argument are
'strict' (raise a UnicodeDecodeError exception),
'replace' (use U+FFFD, REPLACEMENT CHARACTER),
'ignore' (just leave the character out of the Unicode result),
'backslashreplace' (inserts a \xNN escape sequence).
Examples:
b'\x80abc'.decode("utf-8", "strict") # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0
b'\x80abc'.decode("utf-8", "replace") # '\ufffdabc'
b'\x80abc'.decode("utf-8", "backslashreplace") # '\\x80abc'
b'\x80abc'.decode("utf-8", "ignore") # 'abc'
Encodings are specified as strings containing the encoding’s name.
Python comes with roughly 100 different encodings:
https://docs.python.org/3/library/codecs.html#standard-encodings
One-character Unicode strings can also be created with the chr() built-in function,
which takes integers and returns a Unicode string of length 1 that contains
the corresponding code point:
chr(57344) # '\ue000'
The reverse operation is the built-in ord() function that takes a one-character
Unicode string and returns the code point value:
ord('\ue000') # 57344
The opposite method of bytes.decode() is str.encode(), which returns a bytes
representation of the Unicode string, encoded in the requested encoding.
The errors parameter is the same as the parameter of the decode() method
but supports a few more possible handlers.
'strict' (raise a UnicodeDecodeError exception),
'replace' inserts a question mark instead of the unencodable character,
'ignore' (just leave the character out of the Unicode result),
'backslashreplace' (inserts a \uNNNN escape sequence)
'xmlcharrefreplace' (inserts an XML character reference),
'namereplace' (inserts a \N{...} escape sequence).
Unicode code points can be written using the \u escape sequence, which is
followed by four hex digits giving the code point. The \U escape sequence
is similar, but expects eight hex digits, not four
>>> s = "a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> [ord(c) for c in s]
[97, 172, 4660, 8364, 32768]
Python supports writing source code in UTF-8 by default, but you can use almost
any encoding if you declare the encoding being used. This is done by including
a special comment as either the first or second line of the source file:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
u = 'abcdé'
https://www.python.org/dev/peps/pep-0263/
PEP 263 -- Defining Python Source Code Encodings
Comparing Strings
The casefold() string method converts a string to a case-insensitive
form following an algorithm described by the Unicode Standard. This
algorithm has special handling for characters such as the German letter ‘ß’
(code point U+00DF), which becomes the pair of lowercase letters ‘ss’.
>>> street = 'Gürzenichstraße'
>>> street.casefold()
'gürzenichstrasse'
The unicodedata module’s normalize() function converts strings to one of
several normal forms: ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
def compare_strs(s1, s2):
def NFD(s):
return unicodedata.normalize('NFD', s)
return NFD(s1) == NFD(s2)
The Unicode Standard also specifies how to do caseless comparisons:
def compare_caseless(s1, s2):
def NFD(s):
return unicodedata.normalize('NFD', s)
return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())
Why is NFD() invoked twice? Because there are a few characters that make
casefold() return a non-normalized string, so the result needs to be
normalized again. See section 3.13 of the Unicode Standard
https://docs.python.org/3/library/unicodedata.html
unicodedata.lookup(name)
Look up character by name.
If a character with the given name is found, return the corresponding character.
If not found, KeyError is raised.
Changed in version 3.3: Support for name aliases 1 and named sequences 2 has been added.
unicodedata.name(chr[, default])
Returns the name assigned to the character chr as a string.
unicodedata.decimal(chr[, default])
Returns the decimal value assigned to the character chr as integer.
unicodedata.digit(chr[, default])
Returns the digit value assigned to the character chr as integer.
unicodedata.numeric(chr[, default])
Returns the numeric value assigned to the character chr as float.
unicodedata.category(chr)
Returns the general category assigned to the character chr as string.
unicodedata.bidirectional(chr)
Returns the bidirectional class assigned to the character chr as string.
unicodedata.combining(chr)
Returns the canonical combining class assigned to the character chr as integer.
Returns 0 if no combining class is defined.
unicodedata.east_asian_width(chr)
Returns the east asian width assigned to the character chr as string.
unicodedata.mirrored(chr)
Returns the mirrored property assigned to the character chr as integer.
Returns 1 if the character has been identified as a “mirrored” character in bidirectional text, 0 otherwise.
unicodedata.decomposition(chr)
Returns the character decomposition mapping assigned to the character chr as string.
An empty string is returned in case no such mapping is defined.
unicodedata.normalize(form, unistr)
Return the normal form form for the Unicode string unistr.
Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
unicodedata.is_normalized(form, unistr)
Return whether the Unicode string unistr is in the normal form form.
Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
unicodedata.unidata_version
The version of the Unicode database used in this module.
unicodedata.ucd_3_2_0
This is an object that has the same methods as the entire module,
but uses the Unicode database version 3.2 instead
https://www.python.org/dev/peps/pep-0393/
PEP 393 -- Flexible String Representation
When creating new strings, it was common in Python to start of with a
heuristical buffer size, and then grow or shrink if the heuristics failed.
With this PEP, this is now less practical, as you need not only a heuristics
for the length of the string, but also for the maximum character.
In order to avoid heuristics, you need to make two passes over the input:
once to determine the output length, and the maximum character; then
allocate the target string with PyUnicode_New and iterate over the input
a second time to produce the final output. While this may sound expensive,
it could actually be cheaper than having to copy the result again as in
the following approach.
If you take the heuristical route, avoid allocating a string meant to be
resized, as resizing strings won't work for their canonical representation.
Instead, allocate a separate buffer to collect the characters, and then
construct a unicode object from that using PyUnicode_FromKindAndData.
One option is to use Py_UCS4 as the buffer element, assuming for the worst
case in character ordinals. This will allow for pointer arithmetics, but
may require a lot of memory. Alternatively, start with a 1-byte buffer,
and increase the element size as you encounter larger characters.
In any case, PyUnicode_FromKindAndData will scan over the buffer to
verify the maximum character.
15/07/2021
https://docs.python.org/3/library/codecs.html
Codec registry and base classes
Most standard codecs are text encodings, which encode text to bytes, but there
are also codecs provided that encode text to text, and bytes to bytes.
errors string argument:
strict
ignore
replace
xmlcharrefreplace
backslashreplace
namereplace
surrogateescape
surrogatepass
15/07/2021
https://discourse.julialang.org/t/a-python-rant-about-types/43294/22
A Python rant about types
jlf: the main discussion is about invalid string data.
Stefan Karpinski describes the Julia strings:
1. You can read and write any data, valid or not.
2. It is interpreted as UTF-8 where possible and as invalid characters otherwise.
3. You can simply check if strings or chars are valid UTF-8 or not.
4. You can work with individual characters easily, even invalid ones.
5. You can losslessly read and write any string data, valid or not, as strings or chars.
6. You only get an error when you try to ask for the code point of an invalid char.
Most Julia code that works with strings is automatically robust with respect to
invalid UTF-8 data. Only code that needs to look at the code points of individual
characters will fail on invalid data; in order to do that robustly, you simply
need to check if the character is valid before taking its code point and handle
that appropriately.
jlf: I think that all the Julia methods working at character level will raise an error,
not just when looking at the code point.
jlf: Stefan Karpinski explains why Python design is problematic.
Python 3 has to be able to represent any input string in terms of code points.
Needing to turn every string into a fixed-width sequence of code points puts them
in a tough position with respect to invalid strings where there is simply no
corresponding sequence of code points.
17/07/2021
https://groups.google.com/g/python-ideas/c/wStIS1_NVJQ
Fix default encodings on Windows
jlf: did not read in details, too long, too many feedbacks.
Maybe some comments are interesting, so I save this URL.
https://djangocas.dev/blog/python-unicode-string-lowercase-casefold-caseless-match/
Interesting infos about caseless matching
https://gist.github.com/dpk/8325992
PyICU cheat sheet
10/05/2023
https://github.com/python/cpython/issues/56938
original URL before migration to github:
https://bugs.python.org/issue12729
Python lib re cannot handle Unicode properly due to narrow/wide bug
jlf: TODO not yet read, but seems interesting.
I found this link thanks to https://news.ycombinator.com/item?id=9618306 (referenced
in the NetRexx section)
https://peps.python.org/pep-0414/
PEP 414 – Explicit Unicode Literal for Python 3.3
Specifically, the Python 3 definition for string literal prefixes will be expanded to allow:
"u" | "U"
in addition to the currently supported:
"r" | "R"
The following will all denote ordinary Python 3 strings:
'text'
"text"
'''text'''
"""text"""
u'text'
u"text"
u'''text'''
u"""text"""
U'text'
U"text"
U'''text'''
U"""text"""
Types of string and their methods:
string "H" "H"[0] # "H"
unicode string u"H" u"H"[0] # "H"
byte string b"H" b"H"[0] # 72 string of 8-bit bytes
raw string r"H" r"H"[0] # "H" string literals with an uninterpreted backslash.
f-string f"H" f"H"[0] # "H" string with formatted expression substitution.
dir(""), dir(f""), dir(r"") dir(b"")
-------------------------------------------------
__add__ __add__
__bytes__
__class__ __class__
__contains__ __contains__
__delattr__ __delattr__
__dir__ __dir__
__doc__ __doc__
__eq__ __eq__
__format__ __format__
__ge__ __ge__
__getattribute__ __getattribute__
__getitem__ __getitem__
__getnewargs__ __getnewargs__
__getstate__ __getstate__
__gt__ __gt__
__hash__ __hash__
__init__ __init__
__init_subclass__ __init_subclass__
__iter__ __iter__
__le__ __le__
__len__ __len__
__lt__ __lt__
__mod__ __mod__
__mul__ __mul__
__ne__ __ne__
__new__ __new__
__reduce__ __reduce__
__reduce_ex__ __reduce_ex__
__repr__ __repr__
__rmod__ __rmod__
__rmul__ __rmul__
__setattr__ __setattr__
__sizeof__ __sizeof__
__str__ __str__
__subclasshook__ __subclasshook__
capitalize capitalize
casefold
center center
count count
decode
encode
endswith endswith
expandtabs expandtabs
find find
format
format_map
fromhex
hex
index index
isalnum isalnum
isalpha isalpha
isascii isascii
isdecimal
isdigit isdigit
isidentifier
islower islower
isnumeric
isprintable
isspace isspace
istitle istitle
isupper isupper
join join
ljust ljust
lower lower
lstrip lstrip
maketrans maketrans
partition partition
removeprefix removeprefix
removesuffix removesuffix
replace replace
rfind rfind
rindex rindex
rjust rjust
rpartition rpartition
rsplit rsplit
rstrip rstrip
split split
splitlines splitlines
startswith startswith
strip strip
swapcase swapcase
title title
translate translate
upper upper
zfill zfill
https://stackoverflow.com/questions/72371202/unicodedecodeerror-utf-8-codec-cant-decode-byte-0x97-in-position-3118-inval
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 3118: invalid start byte Simple text file [duplicate]
(jlf: just keeping a note for the example)
It seems like the file is not encoded in utf-8. Could you try open the file using
io.open with latin-1 encoding instead?
https://docs.python.org/3/library/functions.html#open
--- (example)
from textblob import TextBlob
import io
with io.open("positive.txt", encoding='latin-1') as f:
for line in f.read().split('\n'):
# do what you want with line
---
https://github.com/life4/textdistance
Compute distance between sequences.
30+ algorithms, pure python implementation, common interface, optional external libs usage.
Reimplemented in Rust by the same author: https://github.com/life4/textdistance.rs
Testing the JMB's example
"ς".upper() # 'Σ'
"σ".upper() # 'Σ'
"ὈΔΥΣΣΕΎΣ".lower() # 'ὀδυσσεύς' last Σ becomes ς
"ὈΔΥΣΣΕΎΣA".lower() # 'ὀδυσσεύσa' last Σ becomes σ
# Humm... the concatenation doesn't change ς to σ
"ὈΔΥΣΣΕΎΣ".lower() + "A" # 'ὀδυσσεύςA'
("ὈΔΥΣΣΕΎΣ".lower() + "A").upper() # 'ὈΔΥΣΣΕΎΣA'
("ὈΔΥΣΣΕΎΣ".lower() + "A").upper().lower() # 'ὀδυσσεύσa'
https://news.ycombinator.com/item?id=33984308
The History and rationale of the Python 3 Unicode model for the operating system (vstinner.github.io)
jlf: HN comments about this old blog
https://vstinner.github.io/python30-listdir-undecodable-filenames.html
https://github.com/python/cpython/blob/main/Include/cpython/unicodeobject.h
(search "Unicode Type")
CPython source code of Unicode string
This URL comes from
https://blog.vito.nyc/posts/gil-balm/
Fast string construction for CPython extensions
https://python.developpez.com/tutoriels/plonger-au-coeur-de-python/?page=chapitre-4-moins-strings
jlf: todo read (french)
Translation from english, could not find the original article.
R lang
https://stringi.gagolewski.com/index.html
stringi: Fast and Portable Character String Processing in R
stringi (pronounced “stringy”, IPA [strinɡi]) is THE R package for very fast, portable,
correct, consistent, and convenient string/text processing in any locale or character encoding.
Thanks to ICU, stringi fully supports a wide range of Unicode standards.
Paper (PDF): https://www.jstatsoft.org/index.php/jss/article/view/v103i02/4324
https://github.com/gagolews/stringi
Fast and Portable Character String Processing in R (with the Unicode ICU)
RAKU lang Rakudo lang (Perl6, Perl 6, MOAR-VM)
https://raku-advent.blog/2022/12/23/sigils-2/
jlf: not related to unicode, but good for general culture.
A sigil is any non-alphabetic character that’s used at the front of a word, and
that conveys meta information about the word. For example, hashtags are a sigil:
the # in #nofilter is a sigil that communicates that “nofilter” is a tag
(not a regular word of text). The Raku programming language uses sigils to mark
its variables; Raku has four sigils:
@ (normally associated with arrays),
can only be used for types that implement the Positional (“array-like”) role
% (normally associated with hashes),
can only be used for types that implement the Associative (“hash-like”) role
& (normally associated with functions)
can only be used for types that implement the Callable (“function-like”) role
$ (for other variables, such as numbers and strings).
https://dev.to/lizmat/series/24075
Migrating Perl to Raku Series' Articles
jlf: not related to unicode, but good for general culture.
http://docs.p6c.org/routine.html
Raku Routines
This is a list of all built-in routines that are documented here as part of the Raku language.
jlf: not related to unicode, but good for general culture.
https://www.learningraku.com/2016/11/26/quick-tip-11-number-strings-and-numberstring-allomorphs/
Quick Tip #11: Number, Strings, and NumberString Allomorphs
jlf: maybe the same as ooRexx string numbers?
https://docs.raku.org/type/Stringy String or object that can act as a string (role)
https://rakudocs.github.io/type/Allomorph Dual value number and string (class)
https://docs.raku.org/type/IntStr Dual value integer and string (class)
https://docs.raku.org/type/RatStr Dual value rational number and string (class)
https://github.com/MoarVM/MoarVM/blob/master/docs/strings.asciidoc
MoarVM string documentation.
jlf: little intro, no detailled API.
https://docs.raku.org/type/Str
class Str
Built-in class for strings. Objects of type Str are immutable.
https://docs.raku.org/type/Uni
class Uni
A string of Unicode codepoints
Unlike Str, which is made of Grapheme clusters, Uni is string strictly made of
Unicode codepoints. That is, base characters and combining characters are
separate elements of a Uni instance.
Uni presents itself with a list-like interface of integer Codepoints.
Typical usage of Uni is through one of its subclasses, NFC, NFD, NFKD and NFKC,
which represent strings in one of the Unicode Normalization Forms of the same name.
https://course.raku.org/essentials/strings/string-concatenation/
String concatenation
jlf: strange... the concatenation is not described in the doc of Str.
In Raku, you concatenate strings using concatenation operator.
This operator is a tilde: ~.
my $greeting = 'Hello, ';
my $who = 'World!';
say $greeting ~ $who;
Concatenation with assignment
$str = $str ~ $another-str;
$str ~= $another-str;
https://www.codesections.com/blog/raku-unicode/
A deep dive into Raku's Unicode support
Grepping for "Unicode Character Database" brings us to unicode_db.c.
https://github.com/MoarVM/MoarVM/blob/master/src/strings/unicode_db.c
29/05/2021
http://moarvm.com/releases.html
2017.07
Greatly reduce the cases when string concatenation needs renormalization
Use normalize_should_break to decide if concat needs normalization
Rename should_break to MVM_unicode_normalize_should_break
Fix memory leak in MVM_nfg_is_concat_stable
If both last_a and first_b during concat are non-0 CCC, re-NFG
--> maybe to review : the last sentence seems to be an optimization of concatenation.
2017.02
Implement support for synthetic graphemes in MVM_unicode_string_compare
Implement configurable collation_mode for MVM_unicode_string_compare
2017.01
Add a new unicmp_s op, which compares using the Unicode Collation Algorithm
Add support for Grapheme_Cluster_Break=Prepend from Unicode 9.0
Add a script to download the latest version of all of the Unicode data
--> should review this script
2015.11
NFG now uses Unicode Grapheme Cluster algorithm; "\r\n" is now one grapheme
--> ??? [later] ah, I had a bug! Was not analyzing an UTF-8 ASCII string... Now fixed:
"0A0D"x~text~description= -- UTF-8 ASCII ( 2 graphemes, 2 codepoints, 2 bytes )
"0D0A"x~text~description= -- UTF-8 ASCII ( 1 grapheme, 2 codepoints, 2 bytes )
29/05/2021
https://news.ycombinator.com/item?id=26591373
String length functions for single emoji characters evaluate to greater than 1
--> to check : MOAR VM really concatenate a 8bit string with a 32bit string using a string concatenation object ?
You could do it the way Raku does. It's implementation defined. (Rakudo on MoarVM)
The way MoarVM does it is that it does NFG, which is sort of like NFC except that it stores grapheme clusters as if they were negative codepoints.
If a string is ASCII it uses an 8bit storage format, otherwise it uses a 32bit one.
It also creates a tree of immutable string objects.
If you do a substring operation it creates a substring object that points at an existing string object.
If you combine two strings it creates a string concatenation object. Which is useful for combining an 8bit string with a 32bit one.
All of that is completely opaque at the Raku level of course.
my $str = "\c[FACE PALM, EMOJI MODIFIER FITZPATRICK TYPE-3, ZWJ, MALE SIGN, VARIATION SELECTOR-16]";
say $str.chars; # 1
say $str.codes; # 5
say $str.encode('utf16').elems; # 7
say $str.encode('utf16').bytes; # 14
say $str.encode.elems; # 17
say $str.encode.bytes; # 17
say $str.codes * 4; # 20
#(utf32 encode/decode isn't implemented in MoarVM yet)
say for $str.uninames;
# FACE PALM
# EMOJI MODIFIER FITZPATRICK TYPE-3
# ZERO WIDTH JOINER
# MALE SIGN
# VARIATION SELECTOR-16
The reason we have utf8-c8 encode/decode is because filenames, usernames, and passwords are not actually Unicode.
(I have 4 files all named rèsumè in the same folder on my computer.)
utf8-c8 uses the same synthetic codepoint system as grapheme clusters.
https://andrewshitov.com/2018/10/31/unicode-in-perl-6/
Unicode in Raku
https://docs.raku.org/language/unicode
Raku applies normalization by default to all input and output except for file names,
which are read and written as UTF8-C8
UTF-8 Clean-8 is an encoder/decoder that primarily works as the UTF-8 one. However,
upon encountering a byte sequence that will either not decode as valid UTF-8, or
that would not round-trip due to normalization, it will use NFG synthetics to
keep track of the original bytes involved. This means that encoding back to UTF-8 Clean-8
will be able to recreate the bytes as they originally existed.
https://github.com/MoarVM/MoarVM/blob/master/docs/strings.asciidoc
Strings in MoarVM
Strands
Strands are a type of MVMString which instead of being a flat string with contiguous data,
actually contains references to other strings. Strands are created during concatenation
or substring operations. When two flat strings are concatenated together, a Strand with
references to both string a and string b is created. If string a and string b were strands
themselves, the references of string a and references of string b are copied one after another
into the Strand.
Synthetic’s
Synthetics are graphemes which contain multiple codepoints. In MoarVM these are stored
and accessed using a trie, while the actual data itself stores the base character seprately
and then the combiners are stored in an array.
Currently the maximum number of combiners in a synthetic is 1024.
MoarVM will throw an exception if you attempt to create a grapheme with more than 1024 codepoints in it.
Normalization
MoarVM normalizes into NFG form all input text.
NFG
Normalization Form Grapheme. Similar to NFC except graphemes which contain multiple codepoints
are stored in Synthetic graphemes.
https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-perl-6-and-you/
Types
Str type: graphemes
say "नि".codes; # returns 2
say "नि".chars; # returns 1
say "\r\n".chars; # returns 1
NFC, NFD, NFKC, NFKD: types (jlf: types? really?)
Uni: work with codepoints, no normalization (keep text as-is)
Blob: family of types to work at the binary level
Unicode source code
say 0 ∈ «42 -5 1».map(&log ∘ &abs);
say 0.1e0 + 0.2e0 ≅ 0.3e0;
say 「There is no \escape in here!」
"Texas" source code
say 0 (elem) <<42 -5 1>>.map(&log o &abs);
say 0.1e0 + 0.2e0 =~= 0.3e0;
say Q[[[There is no \escape in here!]]]
https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-perl-6-and-you/#comment-14302
jlf: interesting critics about graphemes.
See also the comment after, which provides answers to the critics.
https://lwn.net/Articles/667036/
Unicode, Perl 6, and You
jlf: interesting opinions.
https://en.wikipedia.org/wiki/Devanagari#Conjunct_consonants
jlf: this is executable code (what is this notation < षि > ?)
< षि > .NFC .say # NFC:0x<0937 093f>
< षि > .NFKC .say # NFD:0x<0937 093f>
< षि > .NFD .say # NFKC:0x<0937 093f>
< षि > .NFKD .say # NFKD:0x<0937 093f>
Particularly interesting, this subthread:
https://lwn.net/Articles/667669/
Is the current Unicode design impractical?
jlf tests
# Returns a list of Unicode codepoint numbers that describe the codepoints making up the string
"aå«".ords # (97 229 171)
# Returns the codepoint number of the base characters of the first grapheme in the string
"å«".ord # 229
"Bundesstraße im Freiland".lc # bundesstraße im freiland
"Bundesstraße im Freiland".uc # BUNDESSTRASSE IM FREILAND
"Bundesstraße im Freiland".fc # bundesstrasse im freiland
"Bundesstraße im Freiland".index("Freiland") # 16 (start at 0) (executor: 17)
"Bundesstraße im Freiland".index("freiland", :ignorecase) # 16
# Bundesstraße sss sßs ss
# 01234567890123456789012
# | | || || |
"Bundesstraße sss sßs ss".indices("ss") # (5 13 21)
"Bundesstraße sss sßs ss".indices("ss", :overlap) # (5 13 14 21)
"Bundesstraße sss sßs ss".indices("ss", :ignorecase) # (5 10 13 18 21)
"Bundesstraße sss sßs ss".indices("ss", :ignorecase, :overlap) # (5 10 13 14 18 21) not 17?
"Bundesstraße sss sßs ss".indices("s", :ignorecase, :overlap) # (5 6 13 14 15 17 19 21 22)
"Bundesstraße sss sßs ss".indices("sSs", :ignorecase, :overlap) # (13 17 18)
"Bundesstraße sss sßs ss".indices("sSsS", :ignorecase, :overlap) # (17)
"Bündesstraße sss sßs ss".fc # bundesstrasse sss ssss ss
# 0123456789012345678901234
# | | || ||| |
"Bündëssträßë sss sßs ss".fc.indices("ss") # (5 10 14 18 20 23)
"Bündëssträßë sss sßs ss".fc.indices("ss", :overlap) # (5 10 14 15 18 19 20 23)
# straßssßßssse
# 0123456789012
# || ||||
"straßssßßssse".indices("Ss", :ignorecase) # (4 7 9)
"straßssßßssse".indices("Ss", :ignorecase, :overlap) # (4 5 7 8 9 10)
"TÊt\c[TAG SPACE]e".chars # 4, "t" + "TAG SPACE" is one grapheme
"TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc # TÊte sss ssss ss têTE
# 012345678901234567890
# ^ ^ || ||| | ^ ^
"TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".indices("ss") # (5 13)
"TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".indices("ss", :ignorecase) # (5 10 13)
"TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("ss") # (5 9 11 14) 11? why not 10? because no overlap
"TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("ss", :overlap) # (5 6 9 10 11 14)
"TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("te") # ()
"TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("te", :ignorecase) # (19)
"TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("te", :ignoremark) # (0 2 17 19) so TAG SPACE is ignored when :ignoremark
# Matching inside a grapheme
"noël👩👨👩👧🎅".indices("👧🎅") # ()
"noël👩👨👩👧🎅".indices("👨👩") # ()
# Matching a ligature
# bâfflé
# 012 3
"bâfflé".indices("é") # (3)
"bâfflé".indices("ffl") # ()
"bâfflé".indices("ffl", :ignorecase) # (2)
https://raku-advent.blog/2022/12/22/day-22-hes-making-a-list-part-1/
Unicode’s CLDR (Common Linguistic Data Repository)
jlf: to read...
https://www.nu42.com/2015/12/perl6-newline-translation-broken.html
Newline translation in Perl6 is broken
A. Sinan Unur
December 11, 2015
---
jlf:
Referenced from https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-perl-6-and-you/#comment-14382
I reference this URL in case \r\n versus \r is a problem for Rexx Unicodified.
For Unicode, \r\n is one grapheme. Maybe no relation with the failed test cases.
Was fixed like that:
https://github.com/Raku/old-issue-tracker/issues/4849#issuecomment-570873506
* We do translation of \r\n graphemes to \n on all input read as text except
sockets, independent of platform
* We do translation of all \n graphemes to \r\n on text output to handles
except sockets, on Windows only
* \n is now, unless `use newline` is in force, always \x0A
* We don't do any such translation when using .encode/.decode, and of course
when reading/writing Bufs to files, providing an escape hatch from translation if needed
https://6guts.wordpress.com/2015/11/21/what-one-christmas-elf-has-been-up-to/
jlf: referenced for the section NFG improvements.
https://6guts.wordpress.com/2015/10/15/last-week-unicode-case-fixes-and-much-more/
jlf: referenced for the section A case of Unicode.
Testing the JMB's example
"ς".uc # Σ
"σ".uc # Σ
"ὈΔΥΣΣΕΎΣ".lc # ὀδυσσεύς last Σ becomes ς
"ὈΔΥΣΣΕΎΣA".lc # ὀδυσσεύσa last Σ becomes σ
# Humm... the concatenation doesn't change ς to σ
"ὈΔΥΣΣΕΎΣ".lc ~ "A" # ὀδυσσεύςA
("ὈΔΥΣΣΕΎΣ".lc ~ "A").uc # ὈΔΥΣΣΕΎΣA
("ὈΔΥΣΣΕΎΣ".lc ~ "A").uc.lc # ὀδυσσεύσa
https://stackoverflow.com/questions/39663846/how-can-i-make-perl-6-be-round-trip-safe-for-unicode-data
How can I make Perl 6 be round-trip safe for Unicode data?
Answer: UTF8-C8 isn't really a good solution (but is probably the only solution currently).
jlf: asked in 2016-09-23, maybe the situation is better today.
https://rosettacode.org/wiki/String_comparison#Raku
String comparisons never do case folding because that's a very complicated subject
in the modern world of Unicode. (You can explicitly apply an appropriate case-folding
function to the arguments before doing the comparison, or for "equality" testing you
can do matching with a case-insensitive regex, assuming Unicode's language-neutral
case-folding rules are okay.)
---
Be aware that Raku applies normalization (Unicode NFC form (Normalization Form Canonical))
by default to all input and output except for file names See docs. Raku follows the Unicode spec.
Raku follows all of the Unicode spec, including parts that some people don't like.
There are some graphemes for which the Unicode consortium has specified that the
NFC form is a different (though usually visually identical) grapheme. Referred to
in Unicode standard annex #15 as Canonical Equivalence. Raku adheres to that spec.
https://docs.raku.org/language/traps#Traps_to_avoid
Some problems that might arise when dealing with strings
https://raku.guide/#_unicode
Escape characters
say "\x0061";
say "\c[LATIN SMALL LETTER A]";
Numbers
say (٤,٥,٦,1,2,3).sort; # (1 2 3 4 5 6)
say 1 + ٩; # 10
Raku has methods/operators that implement the Unicode Collation Algorithm.
say 'a' unicmp 'B'; # Less
Raku provides a collate method that implements the Unicode Collation Algorithm.
say ('a','b','c','D','E','F').sort; # (D E F a b c)
say ('a','b','c','D','E','F').collate; # (a b c D E F)
Rexx lang
11/08/2021
http://nokix.sourceforge.net/help/learn_rexx/funcs5.htm#VALUEIN
Reads in a numeric value from a binary (ie, non-text) file.
value = VALUEIN(stream, position, length, options)
Args
stream is the name of the stream.
It can include the full path to the stream (ie, any drive and directory names).
If omitted, the default is to read from STDIN.
position specifies at what character position (within the stream) to start
reading from, where 1 means to start reading at the very first character
in the stream. If omitted, the default is to resume reading at where a
previous call to CHARIN() or VALUEIN() left off (ie, where you current
read character position is).
length is a 1 to read in the next binary byte (ie, 8-bit value), a 2 to
read in the next binary short (ie, 16-bit value), or a 4 to read in the
next binary long (ie, 32-bit value). If length is omitted, VALUEIN() defaults to reading a byte.
options can be any of the following:
M The value is stored (in the stream) in Motorola (big endian) byte order,
rather than Intel (little endian) byte order.
The effects only long and short values.
H Read in the value as hexadecimal (rather than the default of base 10,
or decimal, which is the base that REXX uses to express numbers).
The value can later be converted with X2D().
B Read in the value as binary (base 2).
- The value is signed (as opposed to unsigned).
V stream is the actual data string from which to extract a value.
You can now replace calls to SUBSTR and C2D with a single, faster call to VALUEIN.
If omitted, options defaults to none of the above.
Returns
The value, if successful.
If an error, an empty string is returned (unless the NOTREADY condition
is trapped via CALL method. Then, a '0' is returned).
http://nokix.sourceforge.net/help/learn_rexx/funcs5.htm#VALUEOUT
Write out numeric values to a binary (ie, non-text) file (ie, in non-text format).
result = VALUEOUT(stream, values, position, size, options)
Args
stream is the name of the stream.
It can include the full path to the stream (ie, any drive and directory names).
If omitted, the default is to write to STDOUT (typically, display the data in the console window).
position specifies at what character position (within the stream) to start writing the data,
where 1 means to start writing at the very first character in the stream.
If omitted, the default is to resume writing at where a previous call to
CHAROUT() or VALUEOUT() left off (or where the "write character pointer" was set via STREAM's SEEK).
values are the numeric values (ie, data) to write out.
Each value is separated by one space.
size is a 1 if each value is to be written as a byte (ie, 8-bit value),
2 if each value is to be written as a short (16-bit value),
or 4 if each value is to be written as a long (32-bit value). If omitted, size defaults to 1.
options can be any of the following:
M Write out the values in Motorola (big endian) byte order,
rather than Intel (little endian) byte order. The effects only long and short values.
H The values you supplied are specified in hexadecimal.
B The values you supplied are specified in binary (base 2).
V stream is the name of a variable, and the data will be overlaid
onto that variable's value. You can now replace calls to D2C and
OVERLAY with a single, faster call to VALUEOUT, especially when
a variable has a large amount of non-text data.
If omitted, options defaults to none of the above.
Returns
0 if the string was written out successfully.
If an error, VALUEOUT() returns non-zero.
http://www.dg77.net/tekno/manuel/rexxendian.htm
Test de l’endianité
/* Verifie l'endianité / check endiannity */
/* Pour traitement d'information encodees en UTF-8 */
/* Adapter si on utilise un autre encodage */
CALL CONV8_16 ' '
IF c2x(sortie) = '2000' THEN DO
endian = 'LE' /* little endian */
blanx = '2000'
END
ELSE DO
endian = 'BE' /* big endian */
blanx = '0020'
END
return endian blanx
/* ********************************************************************** */
/* Conversion UTF-8 -> UNICODE */
CONV8_16:
parse arg entree
sortie = ''
ZONESORTIE.='NUL'; ZONESORTIE.0=0
err = systounicode(entree, 'UTF8', , ZONESORTIE.)
if err == 0 then sortie = ZONESORTIE.!TEXT
else say 'probleme car., code ' err
return
http://www.dg77.net/tekno/xhtml/codage.htm
Le codage des caractères
To read, some infos about the code pages could be useful.
Regina doc
EXPORT(address, [string], [length] [,pad]) - (AREXX)
Copies data from the (optional) string into a previously-allocated memory area, which must be
specified as a 4-byte address. The length parameter specifies the maximum number of characters to
be copied; the default is the length of the string. If the specified length is longer than the string, the
remaining area is filled with the pad character or nulls('00'x). The returned value is the number
of characters copied.
Caution is advised in using this function. Any area of memory can be overwritten,possibly
causing a system crash.
See also STORAGE() and IMPORT().
Note that the address specified is subject to a machine's endianess.
EXPORT('0004 0000'x,'The answer') '10'
IMPORT(address [,length]) - (AREXX)
Creates a string by copying data from the specified 4-byte address. If the length parameter is not
supplied,the copy terminates when a null byte is found.
See also EXPORT()
Note that the address specified is subject to a machine's endianess.
IMPORT('0004 0000'x,10) 'The answer' /* maybe */
Ruby lang
jlf note:
still searching articles/blogs comparing the Ruby's approach (multi-encodings)
with languages that force the conversion to Unicode (be it utf-8 or Unicode scalars).
https://docs.ruby-lang.org/en/3.2/String.html
class String
---
jlf: focus on comparison.
I did not find the definition of "compatible".
Methods for Comparing
==, ===: Returns true if a given other string has the same content as self.
eql?: Returns true if the content is the same as the given other string.
<=>: Returns -1, 0, or 1 as a given other string is smaller than, equal to, or larger than self.
casecmp: Ignoring case, returns -1, 0, or 1 as a given other string is smaller than, equal to, or larger than self.
casecmp?: Returns true if the string is equal to a given string after Unicode case folding; false otherwise.
Returns false if the two strings’ encodings are not compatible:
"\u{e4 f6 fc}" == ("\u{e4 f6 fc}") # => true
"\u{e4 f6 fc}".encode("ISO-8859-1") == ("\u{e4 f6 fc}") # => false
"\u{e4 f6 fc}".eql?("\u{e4 f6 fc}") # => true
"\u{e4 f6 fc}".encode("ISO-8859-1").eql?("\u{e4 f6 fc}") # => false
# "äöü" "ÄÖÜ"
"\u{e4 f6 fc}".casecmp("\u{c4 d6 dc}") # => 1
"\u{e4 f6 fc}".encode("ISO-8859-1").casecmp("\u{c4 d6 dc}") # => nil
https://yehudakatz.com/2010/05/17/encodings-unabridged/
Encodings, Unabridged
jlf: this article explains why the Ruby team consider that Unicode is not a
good solution for CJK.
https://ruby-doc.org/current/Encoding.html
https://github.com/ruby/ruby/blob/master/encoding.c
jlf: search "compat"
https://docs.ruby-lang.org/en/master/encodings_rdoc.html
Encodings
---
jlf: Executor has a similar support of encodings, with less defaults and less
supported encodings. Otherwise the technical solution is the same: all
encodings are equal, there is no forced internal encoding, no forced
conversion.
---
Default encodings:
- Encoding.default_external: the default external encoding
- Encoding.default_internal: the default internal encoding (may be nil)
- locale: the default encoding for a string from the environment
- filesystem: the default encoding for a string from the filesystem
String encoding
A Ruby String object has an encoding that is an instance of class Encoding.
The encoding may be retrieved by method String#encoding.
's'.encoding # => #<Encoding:UTF-8>
The default encoding for a string literal is the script encoding
The encoding for a string may be changed:
s = "R\xC3\xA9sum\xC3\xA9" # => "Résumé"
s.encoding # => #<Encoding:UTF-8>
s.force_encoding('ISO-8859-1') # => "R\xC3\xA9sum\xC3\xA9"
s.encoding # => #<Encoding:ISO-8859-1>
Stream Encodings
Certain stream objects can have two encodings; these objects include instances of:
IO.
File.
ARGF.
StringIO.
The two encodings are:
- An external encoding, which identifies the encoding of the stream.
The default external encoding is:
- UTF-8 for a text stream.
- ASCII-8BIT for a binary stream.
- An internal encoding, which (if not nil) specifies the encoding to be used
for the string constructed from the stream.
The default internal encoding is nil (no conversion).
Script Encoding
The default script encoding is UTF-8; a Ruby source file may set its script
encoding with a magic comment on the first line of the file (or second line,
if there is a shebang on the first).
The comment must contain the word coding or encoding, followed by a colon,
space and the Encoding name or alias:
# encoding: ISO-8859-1
__ENCODING__ #=> #<Encoding:ISO-8859-1>
This example writes a string to a file, encoding it as ISO-8859-1, then reads
the file into a new string, encoding it as UTF-8:
s = "R\u00E9sum\u00E9"
path = 't.tmp'
ext_enc = 'ISO-8859-1'
int_enc = 'UTF-8'
File.write(path, s, external_encoding: ext_enc)
raw_text = File.binread(path) # "R\xE9sum\xE9"
transcoded_text = File.read(path, external_encoding: ext_enc, internal_encoding: int_enc) # "Résumé"
https://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/
3 Steps to Fix Encoding Problems in Ruby
The major difference between encode and force_encoding is that encode might
change bytes, and force_encoding won’t.
In ASCII-8BIT, every character is represented by a single byte.
That is, str.chars.length == str.bytes.length.
https://www.cloudbees.com/blog/how-ruby-string-encoding-benefits-developers
Familiarize Yourself with Ruby String Encoding
written August 14, 2018
Ruby encoding methods
- String#force_encoding is a way of saying that we know the bits for the characters
are correct and we simply want to properly define how those bits are to be
interpreted to characters.
- String#encode will transcode the bits themselves that form the characters from
whatever the string is currently encoded as to our target encoding.
Example of the byte size being different from the character length:
"łał".size
# => 3
"łał".bytesize
# => 5
Different operating systems have different default character encodings so
programming languages need to support these.
Encoding.default_external
# => #<encoding:utf -8></encoding:utf>
Ruby defaults to UTF-8 as its encoding so if it is opening up files from the
operating system and the default is different from UTF-8, it will transcode the
input from that encoding to UTF-8. If this isn't desirable, you may change the
default internal encoding in Ruby with Encoding.default_internal. Otherwise you
can use specific IO encodings in your Ruby code.
File.open(filename, 'r:UTF-8', &:read)
# or
File.open(filename, external_encoding: "ASCII-8BIT", internal_encoding: "ASCII-8BIT") do |f| f.read end
Lately, I've been integrating Ruby's encoding support to Rust with the library Rutie.
Rutie allows you to write Rust that works in Ruby and Ruby that works in Rust.
jlf: see Rutie in Rust lang.
https://ruby-core.ruby-lang.narkive.com/RDKAvdS7/20483-encoding-of-symbols#post2
[ruby-core:20483] encoding of symbols
---
jlf: AT LAST! I found a discussion about the comparison of strings.
LONG thread, to carefully read.
---
This message 2008-12-14 is a good summary!
Is it still correct today?
https://ruby-core.ruby-lang.narkive.com/RDKAvdS7/20483-encoding-of-symbols#post12
- String operations are done using the bytes in the strings - they are not
converted to codepoints internally
- String equality comparisons seem to be simply done on a byte-by-byte
basis, without regard to the encoding
- *However* other operations are not simply byte-by-byte. They are done
character-by-character, but without converting to codepoints - eg: a 3
byte character is kept as 3 bytes. For example this means that when
operating on a variable-length encoding, simple operations like indexing
can be inefficient, as Ruby may have to scan through the string from the
start. However Ruby does try to optimize this where possible.
- There is also a concept of "compatible encodings". Given 2 encodings e1
& e2, e1 is compatible with e2 if the representation of every character in
e1 is the same as in e2. This implies that e2 must be a "bigger" encoding
than e1 - ie: e2 is a superset of e1. Typically we are mainly talking
about US-ASCII here, which is compatible with most other character sets
that are either all single-byte (eg: all the ISO-8859 sets) or are
variable-length multi-byte (eg: UTF-8).
- When operating on encodings e1 & e2, if e1 is compatible with e2, then
Ruby treats both strings as being in encoding e2.
- String#> and String#< are a bit wierd. Normally they are just done on a
byte-by-byte basis, UNLESS the strings are the same and are incompatible
encodings, then they always seem to return FALSE. (I have to check this -
it may be more complicated than this).
- When operating on incompatible encodings, *normally* non-comparison
operations (including regexp matches) raise an "Encoding Compatibility
Error".
- However there appears to be an exception to this: if operating on 2
incompatible encodings AND US-ASCII is compatible with both, AND both
strings are US-ASCII strings, then the operation appears to proceed,
treating both as US-ASCII. For example "abc" as an ISO-8859-1 and "abc" as
UTF-8. I guess this is Ruby being "forgiving". (Personally I am not sure
if this is good or bad). The encoding of the result (for example of a
string concatenation) seems to be one of the 2 original encodings - I
haven't figured out the logic to this yet :)
---
jlf: this one seems ugly...
Actually I just checked this, and this is wrong, sorry. I ended up looking
at the source code of rb_str_cmp() in string.c, and here is what I think
it does:
- it does a byte-by-byte comparison. Assuming the strings are different,
Ruby returns what you would expect based on this.
- if the strings are byte for byte identical, but they have incompatible
encodings and at least one of the strings contains a non-ASCII character,
then it seems that the result is determined by the ordering of the
encodings, based on ruby's "encoding index" - an internal ordering of the
available encodings. Maybe I have got this wrong - it doesn't make a lot
of sense to me!
---
I don't mean to shoot you down in flames, but a lot of thought and effort
has gone into Ruby's encoding support. Ruby could have followed the Python
route of converting everything to Unicode, but that was rejected for various
good reasons. Also automatic transcoding to solve issues of incompatible
encodings was also rejected because it causes a number of problems, in
particular I believe that transcoding isn't necessarilly accurate, because
for example there may be multiple or ambiguous representations of the same
character.
---
Yukihiro Matsumoto
UTF-8 + ASCII-8BIT makes ASCII-8BIT. Binary wins.
jlf: hum... I do the opposite with Executor
jlf 2023.08.09: I checked today with Ruby 3.2, the result is UTF-8
http://graysoftinc.com/character-encodings
jlf: 12 articles about character encoding in Ruby.
From 2008-10-14 to 2009-06-18
Old, but maybe interesting?
todo: read
https://docs.ruby-lang.org/en/3.2/case_mapping_rdoc.html
Case Mapping
By default, all of these methods use full Unicode case mapping, which is suitable for most languages.
Non-ASCII case mapping and folding are supported for UTF-8, UTF-16BE/LE, UTF-32BE/LE, and ISO-8859-1~16 Strings/Symbols.
Context-dependent case mapping is currently not supported (Unicode standard: Context Specification for Casing).
In most cases, case conversions of a string have the same number of characters. There are exceptions (see also :fold below):
s = "\u00DF" # => "ß"
s.upcase # => "SS"
s = "\u0149" # => "ʼn"
s.upcase # => "ʼN"
Case mapping may also depend on locale (see also :turkic below)
s = "\u0049" # => "I"
s.downcase # => "i" # Dot above.
s.downcase(:turkic) # => "ı" # No dot above.
Case changing methods may not maintain Unicode normalization.
Except for casecmp and casecmp?, each of the case-mapping methods listed above accepts optional arguments, *options.
The arguments may be:
:ascii only.
:fold only.
:turkic or :lithuanian or both.
https://andre.arko.net/2013/12/01/strings-in-ruby-are-utf-8-now/
composition in the form of ligatures isn’t handled at all
"baffle".upcase == "BAFFLE" # => false
jlf: Has been fixed in a later version:
"baffle".upcase # => "BAFFLE"
BUT
other things are still not good in Ruby 3.2.2 (March 30, 2023):
"noël".reverse # => "l̈eon"
"noël"[0..2] # => "noe"
---
"baffle"~text~upper= -- T'BAfflE' 30/05/2023 Executor not good because utf8proc upper is not good
"baffle"~text~caselessEquals("baffle")= -- 1 30/05/2023 Executor is good because utf8proc casefold is good
"noël"~text~reverse= -- T'lëon'
"noël"~text[1,3]= -- T'noë'
https://github.com/jmhodges/rchardet
Character encoding auto-detection in Ruby.
jlf: no doc :-(
Returns a confidence rate?
cd = CharDet.detect(some_data)
encoding = cd['encoding']
confidence = cd['confidence'] # 0.0 <= confidence <= 1.0
https://bugs.ruby-lang.org/issues/18949
Deprecate and remove replicate and dummy encodings
Rejected by Naruse:
String is a container and an encoding is a label of it. While data whose
encoding is an encoding categorized in dummy encodings in Ruby, we cannot
avoid such encodings.
<reopened, lot of discussions>
This is all done now, only https://github.com/ruby/ruby/pull/7079.
Overall:
We deprecated and removed Encoding#replicate
We removed get_actual_encoding()
We limited to 256 encodings and kept rb_define_dummy_encoding() with that constraint.
There is a single flat array to lookup encodings, rb_enc_from_index() is fast now.
https://github.com/ruby/ruby/pull/3803
Add string encoding IBM720 alias CP720
The mapping table is generated from the ICU project:
https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/ibm-720_P100-1997.ucm
https://speakerdeck.com/ima1zumi/dive-into-encoding
slide 23: Code Set Independent (CSI), Treat all encodings fair
slide 24: Each instance of string has encoding information
slide 26: Universal Coded Set (UCS)
https://shopify.engineering/code-ranges-ruby-strings
Code Ranges: A Deeper Look at Ruby Strings
Code ranges are a way for the VM to avoid repeated work and optimize operations
on a per-string basis, guiding away from slow paths when that functionality
isn't needed.
jlf: not sure this article is useful.
https://idiosyncratic-ruby.com/66-ruby-has-character.html
Ruby has Character
video: https://www.youtube.com/watch?v=hlryzsdGtZo
(jlf: too small, not very readable, but good for pronuntiation: "Louby")
---
jlf: this page is interesting for the one-liners.
Tools implemented by the author
https://github.com/janlelis/unibits Visualize different Unicode encodings in the terminal
https://github.com/janlelis/uniscribe Know your Unicode ✀
https://idiosyncratic-ruby.com/41-proper-unicoding.html
Proper Unicoding
Ruby's Regexp engine has a powerful feature built in: It can match for Unicode
character properties.
https://idiosyncratic-ruby.com/26-file-encoding-magic.html
default source encoding
# coding: cp1252
p "".encoding #=> #<Encoding:Windows-1252>
https://tomdebruijn.com/posts/rust-string-length-width-calculations/
The article is about Rust, but there is an appendix about Ruby.
Seems a good summary, so copy-paste here...
---
When calling Ruby's String#length, it returns the length of characters like
Rust's Chars.count. If you want the length in bytes you need to call String#bytesize.
"abc".length # => 3 characters
"abc".bytesize # => 3 bytes
"é".length # => 1 characters
"é".bytesize # => 2 bytes
Calling the length on emoji will return the individual characters as the length.
The 👩🔬 emoji is three characters and eleven bytes in Ruby as well.
"👩🔬".length # => 3 characters
"👩🔬".bytesize # => 11 bytes
Do you want grapheme clusters? it's built-in to Ruby with String#grapheme_clusters.
"👩🔬".grapheme_clusters.length # => 1 cluster
To calculate the display with, we can use the unicode-display_width gem. The same
multiple counting of emoji in the grapheme cluster still applies here.
require "unicode/display_width"
Unicode::DisplayWidth.of("👩🔬") # => 4
Unicode::DisplayWidth.of("❤️") # => 1
https://ruby-doc.org/3.2.2/File.html
class File
A File object is a representation of a file in the underlying platform.
---
Data mode
To specify whether data is to be treated as text or as binary data, either of
the following may be suffixed to any of the string read/write modes above:
't': Text data; sets the default external encoding to Encoding::UTF_8;
on Windows, enables conversion between EOL and CRLF and enables
interpreting 0x1A as an end-of-file marker.
'b': Binary data; sets the default external encoding to Encoding::ASCII_8BIT;
on Windows, suppresses conversion between EOL and CRLF and disables
interpreting 0x1A as an end-of-file marker.
---
Encodings
Any of the string modes above may specify encodings - either external encoding
only or both external and internal encodings - by appending one or both encoding
names, separated by colons:
f = File.new('t.dat', 'rb')
f.external_encoding # => #<Encoding:ASCII-8BIT>
f.internal_encoding # => nil
f = File.new('t.dat', 'rb:UTF-16')
f.external_encoding # => #<Encoding:UTF-16 (dummy)>
f.internal_encoding # => nil
f = File.new('t.dat', 'rb:UTF-16:UTF-16')
f.external_encoding # => #<Encoding:UTF-16 (dummy)>
f.internal_encoding # => #<Encoding:UTF-16>
f.close
- When the external encoding is set, strings read are tagged by that encoding
when reading, and strings written are converted to that encoding when writing.
- When both external and internal encodings are set, strings read are converted
from external to internal encoding, and strings written are converted from
internal to external encoding. For further details about transcoding input and
output, see Encodings.
https://ruby-doc.org/3.2.2/encodings_rdoc.html#label-Encodings
String comparison
If the encodings are different then the strings are different.
So it's not a comparison of Unicode codepoints.
irb(main):026:0> s1 = "hello"
=> "hello"
irb(main):027:0> s1
=> "hello"
irb(main):028:0> s2 = "hello"
=> "hello"
irb(main):029:0> s1 == s2
=> true
irb(main):030:0> s2.force_encoding("utf-16")
=> "\x68\x65\x6C\x6C\x6F"
irb(main):031:0> s2
=> "\x68\x65\x6C\x6C\x6F"
irb(main):032:0> s1 == s2
=> false
https://bugs.ruby-lang.org/issues/9111
Encoding-free String comparison
14/11/2013
---
Description
Currently, strings with the same content but with different encodings count
as different strings. This causes strange behaviour as below (noted in
StackOverflow question
http://stackoverflow.com/questions/19977788/strange-behavior-in-packed-ruby-strings#19978206):
[128].pack("C") # => "\x80"
[128].pack("C") == "\x80" # => false
Since [128].pack("C") has the encoding ASCII-8BIT and "\x80" (by default)
has the encoding UTF-8, the two strings are not equal.
Also, comparison of strings with different encodings may end up with a messy,
unintended result.
I suggest that the comparison String#<=> should not be based on the respective
encoding of the strings, but all the strings should be internally converted
to UTF-8 for the purpose of comparison.
---
nobu (Nobuyoshi Nakada)
It's unacceptable to always convert all strings to UTF-8, should restrict to
comparison with an ASCII-8BIT string.
---
naruse (Yui NARUSE)
The standard practice is NFD("â") == NFD("a" + "^").
To NFD, you can use some libraries.
---
duerst (Martin Dürst)
Lié à Feature #10084: Add Unicode String Normalization to String class ajouté
https://bugs.ruby-lang.org/issues/10084
---
jlf 09/08/2023: ticket still opened...
The test [128].pack("C") == "\x80" still returns false, so I assume they made
no change.
https://bugs.ruby-lang.org/issues/10084
Add Unicode String Normalization to String class
23/07/2014
---
nobu (Nobuyoshi Nakada)
What will happen for a non-unicode string, raising an exception?
---
duerst (Martin Dürst)
This is a very good question. I'm okay with whatever Matz and the community
think is best.
There are many potential approaches. In general, these will be:
1. Make the operation a no-op.
2. Convert to UTF-8, normalize, then convert back.
3. Implement normalization directly in the encoding.
4. Raise an exception.
There is also the question of what a "non-unicode" or "unicode" string is.
UTF-8 is the preferred way to handle Unicode in Ruby, and is where normalization
is really needed and will be used.
For the other encodings, unless we go with 1) or 4), the following considerations
apply.
UTF8-Mac, UTF8-DoCoMo, UTF8-KDDI and UTF8-Softbank are essentially UTF-8 but
with slightly different character conversions. For these encodings, the easiest
thing to do is force_encoding to UTF-8, normalize, and force_encoding back.
A C-level implementation may not actually need force_encoding, but a Ruby
implementation does. There are some questions about what normalizing UTF8-Mac
means, so that may have to be treated separately. The DoCoMo/KDDI/Softbank
variants are mostly about emoji, which as far as I know are not affected by
normalization.
Then there are UTF-16LE/BE and UTF-32LE/BE. For these, it depends on the
implementation. A Ruby-level implementation (unless very slow) may want to
convert to UTF-8 and back. A C-level implementation may not need to do this.
Then there is also GB18030. Conversion to UTF-8 and back seems to be the best
solution. Doing normalization directly in GB18030 will need too much data.
For other, truely non-unicode encodings, implementing noramlization directly
in the encoding would mean the following: Analyze to what extent the normalization
applies to the encoding in question, and apply this part.
As an example, '①'.nfkc produces '1' in UTF-8, it could do the same in Windows-31J.
The analysis might take some time (but can be automated), and the data needed
for each encoding would mostly be just very small.
---
matz (Yukihiro Matsumoto)
First of all, I don't think normalize is the best name.
I propose unicode_normalize instead, since this normalization is sort of
unicode specific.
It should raise an exception for non Unicode strings.
It shouldn't convert to UTF-8 implicitly inside.
https://www.honeybadger.io/blog/troubleshooting-encoding-errors-in-ruby/
Troubleshooting Encoding Errors in Ruby
---
jlf: interesting for the one-liners
---
"H".bytes # => [72] in decimal
"H".bytes.map {|e| e.to_s 2} # => ["1001000"] convert in base 2
Encoding.name_list # => ["ASCII-8BIT", "UTF-8", "US-ASCII", "UTF-16BE", "UTF-16LE", "UTF-32BE", "UTF-32LE", "UTF-16", "UTF-32", "UTF8-MAC", "EUC-JP", "Windows-31J", "Big5", "Big5-HKSCS", "Big5-UAO", "CP949", "Emacs-Mule", "EUC-KR", ...]
"hellÔ!".encode("US-ASCII") # in `encode': U+00D4 from UTF-8 to US-ASCII (Encoding::UndefinedConversionError)
"hellÔ!".force_encoding("US-ASCII"); # => "hell\xC3\x94!"
"abc\xCF\x88\xCF\x88" # => "abcψψ"
"abcψψ".force_encoding("US-ASCII").valid_encoding? # => false
"abcψψ".encode("US-ASCII", "UTF-8", invalid: :replace, undef: :replace, replace: "") # => "abc"
"abc\xA1z".encode("US-ASCII") # in `encode': "\xA1" on UTF-8 (Encoding::InvalidByteSequenceError)
"abc\xA1z".force_encoding("US-ASCII").scrub("*") # => "abc*z"
"abc\xA1z".force_encoding("US-ASCII").scrub("") # => "abcz"
"abc\xA1z".force_encoding("US-ASCII").valid_encoding? # => false
Rust lang
Seen in a comment here : https://bugs.swift.org/browse/SR-7602
For reference, I think [Rust's model]( https://doc.rust-lang.org/std/string/struct.String.html ) is pretty good:
`from_utf8` produces an error explaining why the code units were invalid
`from_utf8_lossy` replaces encoding errors with U+FFFD
`from_utf8_unchecked` which takes the bytes, but if there's an encoding error, then memory safety has been violated
I'm not entirely sure if accepting invalid bytes requires voiding memory safety (assuming bounds checking always happens), but it is totally a security hazard if used improperly.
We may want to be very cautious about if/how we expose it.
I think that trying to do read-time validation is dubious for UTF-16, and totally bananas for UTF-8.
17/07/2021
https://www.generacodice.com/en/articolo/120763/Unicode+Support+in+Various+Programming+Languages
jlf: I learned something: OsStr/OsString
Rust's strings (std::String and &str) are always valid UTF-8, and do not use null
terminators, and as a result can not be indexed as an array, like they can be in C/C++, etc.
They can be sliced somewhat like Go using .get since 1.20, with the caveat that
it will fail if you try slicing the middle of a code point.
Rust also has OsStr/OsString for interacting with the Host OS.
It's byte array on Unix (containing any sequence of bytes).
On windows it's WTF-8 (A super-set of UTF-8 that handles the improperly
formed Unicode strings that are allowed in Windows and Javascript),
&str and String can be freely converted to OsStr or OsString, but require
checks to covert the other way. Either by Failing on invalid unicode, or
replacing with the Unicode replacement char. (There is also Path/PathBuf,
which are just wrappers around OsStr/OsString).
There is also the CStr and CString types, which represent Null terminated C
strings, like OsStr on Unix they can contain arbitrary bytes.
Rust doesn't directly support UTF-16. But can convert OsStr to UCS-2 on windows.
22/07/2021
https://lib.rs/crates/
STFU-8: Sorta Text Format in UTF-8
STFU-8 is a hacky text encoding/decoding protocol for data that might be not
quite UTF-8 but is still mostly UTF-8.
Its primary purpose is to be able to allow a human to visualize and edit "data"
that is mostly (or fully) visible UTF-8 text. It encodes all non visible or non
UTF-8 compliant bytes as longform text (i.e. ESC becomes the full string r"\x1B").
It can also encode/decode ill-formed UTF-16.
28/07/2021
https://fasterthanli.me/articles/working-with-strings-in-rust
07/11/2021
https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html
security concern affecting source code containing "bidirectional override" Unicode codepoints
10/03/2022
https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html
Allow non-ASCII letters (such as accented characters, Cyrillic, Greek, Kanji, etc.) in Rust identifiers.
10/09/2022
https://blog.burntsushi.net/bstr/
A byte string library for Rust
Invalid UTF-8 doesn’t actually prevent one from applying Unicode-aware algorithms on the parts
of the string that are valid UTF-8. The parts that are invalid UTF-8 are simply ignored.
15/10/2022
https://crates.io/crates/finl_unicode
Library for handling Unicode functionality for finl (categories and grapheme segmentation)
There are these comments in https://news.ycombinator.com/item?id=32700315
All with two-step tables instead of range- and binary search?
Yes. The two-step tables are really not that expensive and they enable features not possible with range and binary search, like identifying the category of a character cheaply.
https://github.com/open-i18n/rust-unic
UNIC: Unicode and Internationalization Crates for Rust
jlf: seems stale since Oct 21, 2020. Killed by ICU4X?
This fork is still alive: https://github.com/eyeplum/rust-unic
https://github.com/logannc/fuzzywuzzy-rs
port of https://github.com/seatgeek/fuzzywuzzy
(Fuzzy String Matching in Python
This project has been renamed and moved to https://github.com/seatgeek/thefuzz)
Fuzzy string matching like a boss.
It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.
https://en.wikipedia.org/wiki/Levenshtein_distance
https://hsivonen.fi/encoding_rs/
encoding_rs: a Web-Compatible Character Encoding Library in Rust
encoding_rs is a high-decode-performance, low-legacy-encode-footprint and high-correctness implementation
of the WHATWG Encoding Standard written in Rust.
---
https://hsivonen.fi/modern-cpp-in-rust/
How I Wrote a Modern C++ Library in Rust
Slides: https://hsivonen.fi/rustfest2018/
Video: https://media.ccc.de/v/rustfest18-5-a_rust_crate_that_also_quacks_like_a_modern_c_library
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2773r0.pdf
(pdf...)
Generally speaking, reducing the size of the tables has a direct impact on
performance, if only because increasing cache locality is the most effective
way to improve the performance of anything.
I landed on a set of strategies developed by the rust team
https://github.com/rust-lang/rust/tree/master/src/tools/unicode-table-generator/src
https://www.youtube.com/watch?v=Mcuqzx3rBWc
Strings in Rust FINALLY EXPLAINED!
jlf: is there something to learn from 15:29 Indexing into a string? no.
https://github.com/rust-lang/regex/blob/master/UNICODE.md
regex Unicode conformance
jlf: I found the URL above in this HN comment (related to awk support of Unicode)
https://news.ycombinator.com/item?id=32538560
https://github.com/danielpclark/rutie
Integrate Ruby with your Rust application. Or integrate Rust with your Ruby application.
https://github.com/danielpclark/rutie/blob/master/src/class/string.rs
https://tomdebruijn.com/posts/rust-string-length-width-calculations/
Calculating String length and width
https://github.com/lintje/lintje/blob/501aab06e19008e787237438a69ac961f38bb4b7/src/utils.rs#L22-L71
// Return String display width as rendered in a monospace font according to the Unicode
// specification.
https://www.reddit.com/r/rust/comments/gpw2ra/how_is_the_rust_compiler_able_to_tell_the_visible/
How is the Rust compiler able to tell the visible width of unicode characters?
---
jlf: some arbitray excerpts
- rustc uses the unicode-width crate (https://github.com/unicode-rs/unicode-width)
- Now try it with the rainbow flag emoji. Unicode is hard :)
- explanation:
the rainbow flag emoji is actually just a white flag + zero width joiner + a rainbow, meaning it's technically three characters.
- Sure but why doesn't the unicode-width crate handle that?
- The unicode-width crate operates on scalar values. I don't believe Unicode has
a way to determine whether a grapheme cluster is halfwidth/fullwidth. The most
reasonable way to determine this would probably be the maximum width of any scalar
value within a grapheme cluster, but this isn't part of any standard and probably
isn't 100% accurate.
- It is also dependent on the display platform. A platform with support for displaying
emojis but only in older unicode versions would indeed display multiple emojis
on the screen. I don't believe there's a platform independent way to detect the
visual length of any given series of unicode codepoints. For Rust this isn't a
problem as we restrict the unicode identifiers only to things that are fairly
homogeneous (namely, no emojis in your variable names!).
- At the bottom of things is the unicode-width native Rust implementation, based
off the Unicode 13.0 data tables. In C/POSIX land, we would use the function
wcwidth(). Unfortunately, this isn't the whole story. The actual number of
columns used is dependent upon your font and the font layout engine.
See section 7.4 of my Free book, Hacking the Planet! with Notcurses, aka "Fixed-width Fonts Ain't So Fixed."
https://nick-black.com/htp-notcurses.pdf#page=57
you want pages 47--49 (p49 has some good examples).
https://github.com/unicode-rs/unicode-width
Displayed width of Unicode characters and strings according to UAX#11 rules.
NOTE: The computed width values may not match the actual rendered column width.
For example, the woman scientist emoji comprises of a woman emoji, a zero-width
joiner and a microscope emoji.
extern crate unicode_width;
use unicode_width::UnicodeWidthStr;
fn main() {
assert_eq!(UnicodeWidthStr::width("👩"), 2); // Woman
assert_eq!(UnicodeWidthStr::width("🔬"), 2); // Microscope
assert_eq!(UnicodeWidthStr::width("👩🔬"), 4); // Woman scientist
}
https://github.com/life4/textdistance.rs
https://www.reddit.com/r/rust/comments/13lo6ne/textdistancers_rust_library_to_compare_strings_or/
textdistance.rs: Rust library to compare strings (or any sequences).
25+ algorithms, pure Rust, common interface, Unicode support.
Based on popular and battle-tested textdistance Python library https://github.com/life4/textdistance
https://github.com/dguo/strsim-rs
Rust implementations of string similarity metrics:
Hamming
Levenshtein - distance & normalized
Optimal string alignment
Damerau-Levenshtein - distance & normalized
Jaro and Jaro-Winkler - this implementation of Jaro-Winkler does not limit the common prefix length
Sørensen-Dice
https://docs.rs/xi-unicode/latest/xi_unicode/
Unicode utilities useful for text editing, including a line breaking iterator.
https://github.com/BurntSushi/bstr
A string type for Rust that is not required to be valid UTF-8.
---
jlf: this crate is referenced by Stefan Karpinski in the section Filenames
(search this URL).
https://www.reddit.com/r/rust/comments/qr0rem/how_many_string_types_does_rust_have_maybe_its/
How many String types does Rust have? Maybe it's just 1
jlf: to read?
Saxon lang
https://www.saxonica.com/documentation12/#!localization/unicode-collation-algorithm
Unicode Collation Algorithm
https://www.saxonica.com/documentation12/index.html#!localization/sorting-and-collations
Sorting and collations
https://www.saxonica.com/documentation12/index.html#!changes/spi/10-11
Changes from 10 to 11
Strings
Most uses of CharSequence have been replaced by a new class
net.sf.saxon.str.UnicodeString (which also replaces the old class
net.sf.saxon.regex.UnicodeString).
The UnicodeString class has a number of implementations.
All of them are designed to be codepoint-addressible: they expose an
indexable array of 32-bit codepoint values, and never use surrogate pairs.
The implementations of UnicodeString include:
- Twine8:
a string consisting entirely of codepoints in the range 1-255, held in
an array with one byte per character.
- Twine16:
a string consisting entirely of codepoints in the range 1-65535, held
in an array with two bytes per character.
- Twine24:
a string of arbitrary codepoints, held in an array with three bytes
per character.
- Slice8:
a sub-range of an array using one byte per character.
- Slice16:
a sub-range of an array using two bytes per character.
- Slice24:
a sub-range of an array using two bytes per character.
- BMPString:
a wrapper around a Java/C# string known to contain no surrogate pairs.
- ZenoString:
a composite string held as a list of segments, each of which is itself
a UnicodeString. The name derives from the algorithm used to combine
segments, which results in segments having progressively decreasing
lengths towards the end of the string.
- StringView:
a wrapper around an arbitrary Java/C# string. (This stores the string
both in its native Java/C# form, and using a "real" codepoint-
addressible implementation of UnicodeString, which is constructed
lazily when it is first required.)
Unicode normalization of strings (for example in the fn:normalize-unicode()
function) now uses the JDK class java.text.Normalizer rather than code
derived from the Unicode Consortium's implementation.
This appears to be substantially faster.
https://www.balisage.net/Proceedings/vol26/html/Kay01/BalisageVol26-Kay01.html
ZenoString: A Data Structure for Processing XML Strings
August 2 - 6, 2021
Compare with
- Monolithic char arrays
- Strings in Saxon
- Ropes
- Finger Trees
https://www.cambridge.org/core/journals/journal-of-functional-programming/article/finger-trees-a-simple-generalpurpose-data-structure/BF419BCA07292DCAAF2A946E6BDF573B#article
finger-trees-a-simple-general-purpose-data-structure.pdf
SQL lang
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode.html
Unicode Support
BMP characters
- can be encoded in a variable-length encoding using 8, 16, or 24 bits (1 to 3 bytes)
- can be encoded in a fixed-length encoding using 16 bits (2 bytes).
Supplementary characters take more space than BMP characters (up to 4 bytes per character).
MySQL supports these Unicode character sets:
- utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.
- utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.
This character set is deprecated in MySQL 8.0, and you should use utfmb4 instead.
- utf8: An alias for utf8mb3. In MySQL 8.0, this alias is deprecated; use utf8mb4 instead.
utf8 is expected in a future release to become an alias for utf8mb4.
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb4.html
jlf: I take note of this URL for this concatenation rule:
utf8mb4 is a superset of utf8mb3, so for an operation such as the following
concatenation, the result has character set utf8mb4 and the collation of
utf8mb4_col:
SELECT CONCAT(utf8mb3_col, utf8mb4_col);
Similarly, the following comparison in the WHERE clause works according to the
collation of utf8mb4_col:
SELECT * FROM utf8mb3_tbl, utf8mb4_tbl
WHERE utf8mb3_tbl.utf8mb3_col = utf8mb4_tbl.utf8mb4_col;
https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html#data-types-storage-reqs-strings
String Type Storage Requirements
https://dev.mysql.com/doc/refman/8.0/en/charset-introducer.html
Character Set Introducers
A character string literal, hexadecimal literal, or bit-value literal may have
an optional character set introducer and COLLATE clause, to designate it as a
string that uses a particular character set and collation:
[_charset_name] literal [COLLATE collation_name]
The _charset_name expression is formally called an introducer. It tells the
parser, “the string that follows uses character set charset_name.” An introducer
does not change the string to the introducer character set like CONVERT() would
do. It does not change the string value, although padding may occur. The
introducer is just a signal.
---
Examples:
SELECT 'abc';
SELECT _latin1'abc';
SELECT _binary'abc';
SELECT _utf8mb4'abc' COLLATE utf8mb4_danish_ci;
SELECT _latin1 X'4D7953514C';
SELECT _utf8mb4 0x4D7953514C COLLATE utf8mb4_danish_ci;
SELECT _latin1 b'1000001';
SELECT _utf8mb4 0b1000001 COLLATE utf8mb4_danish_ci;
---
Character string literals can be designated as binary strings by using the
_binary introducer.
mysql> SET @v1 = X'000D' | X'0BC0';
mysql> SET @v2 = _binary X'000D' | X'0BC0';
mysql> SELECT HEX(@v1), HEX(@v2);
+----------+----------+
| HEX(@v1) | HEX(@v2) |
+----------+----------+
| BCD | 0BCD |
+----------+----------+
---
Followed by rules to determines the character set and collation of a character
string literal, hexadecimal literal, or bit-value literal.
See the page for the details.
https://www.eversql.com/mysql-utf8-vs-utf8mb4-whats-the-difference-between-utf8-and-utf8mb4/
MySQL utf8 vs utf8mb4 – What’s the difference between utf8 and utf8mb4?
MySQL decided that UTF-8 can only hold 3 bytes per character (as it's defined
as an alias of utf8mb3). Why? no good reason that I can find documented anywhere.
Few years later, when MySQL 5.5.3 was released, they introduced a new encoding
called utf8mb4, which is actually the real 4-byte utf8 encoding that you know and love.
https://www.percona.com/blog/migrating-to-utf8mb4-things-to-consider/
Migrating to utf8mb4: Things to Consider
The utf8mb4 character set is the new default as of MySQL 8.0, and this change
neither affects existing data nor forces any upgrades.
Migration to utf8mb4 has many advantages including:
- It can store more symbols, including emojis
- It has new collations for Asian languages
- It is faster than utf8mb3
Swift lang
https://github.com/apple/swift-evolution/blob/main/proposals/0363-unicode-for-string-processing.md
Proposal: Unicode for String Processing
This proposal describes Regex's rich Unicode support during regex matching,
along with the character classes and options that define and modify that behavior.
This proposal is one component of a larger regex-powered string processing initiative.
https://docs.swift.org/swift-book/documentation/the-swift-programming-language/stringsandcharacters/
Strings and Characters
Every string is composed of encoding-independent Unicode characters, and provides
support for accessing those characters in various Unicode representations.
When a Unicode string is written to a text file or some other storage, the Unicode
scalars in that string are encoded in one of several Unicode-defined encoding forms.
Each form encodes the string in small chunks known as code units. These include the
UTF-8 encoding form (which encodes a string as 8-bit code units), the UTF-16 encoding
form (which encodes a string as 16-bit code units), and the UTF-32 encoding form
(which encodes a string as 32-bit code units).
03/08/2021
https://swiftdoc.org/v5.1/type/string/
Auto-generated documentation for Swift.
A Unicode string value that is a collection of characters.
https://developer.apple.com/documentation/swift/string
https://www.simpleswiftguide.com/get-character-from-string-using-its-index-in-swift/
jlf: no direct access to a character
Doesn't work:
let input = "Swift Tutorials"
let char = input[3]
Work:
let input = "Swift Tutorials"
let char = input[input.index(input.startIndex, offsetBy: 3)]
A "workaround" to have direct access
extension StringProtocol {
subscript(offset: Int) -> Character {
self[index(startIndex, offsetBy: offset)]
}
}
Which can be used just like that:
let input = "Swift Tutorials"
let char = input[3]
https://gist.github.com/paultopia/6609780e7b53676b7dfc55736221cd23
paultopia/monkey_patch_slicing_into_string.swift
Another "workaround" to have direct access to the characters like that:
var s = "here is a boring string"
print(s.getCharList())
print(s[1])
print(s[-1])
print(s[0, 5])
print(s[5, 0])
print(s[3...6])
print(s[2..<10])
print(s[...15])
print(s[2...])
print(s[..<15])
https://developer.apple.com/documentation/swift/unicode/canonicalcombiningclass
Unicode.CanonicalCombiningClass
The classification of a scalar used in the Canonical Ordering Algorithm defined by the Unicode Standard.
---
Canonical combining classes are used by the ordering algorithm to determine if
two sequences of combining marks should be considered canonically equivalent
(that is, identical in interpretation). Two sequences are canonically equivalent
if they are equal when sorting the scalars in ascending order by their combining class.
---
aboveBeforeBelow = "\u{0041}\u{0301}\u{0316}"~text~unescape
belowBeforeAbove = "\u{0041}\u{0316}\u{0301}"~text~unescape
aboveBeforeBelow~compareTo(belowBeforeAbove)= -- 0 (good, means equal)
aboveBeforeBelow == belowBeforeAbove= -- .true
15/07/2017
String Processing For Swift 4
https://github.com/apple/swift/blob/master/docs/StringManifesto.md
https://swift.org/blog/utf8-string/
Swift 5 switches the preferred encoding of strings from UTF-16 to UTF-8 while preserving efficient Objective-C-interoperability.
jlf: Search "breadcrumb".
Notice that the article is about Swift Objective-C interoperability.
The language Swift itself is not allowing random access to characters.
---
Swift 5, like Rust, performs encoding validation once on creation, when it is far
more efficient to do so. NSStrings, which are lazily bridged (zero-copy) into
Swift and use UTF-16, may contain invalid content (i.e. isolated surrogates).
As in Swift 4.2, these are lazily validated when read from.
https://bugs.swift.org/browse/SR-7602 (redirect to next URL)
https://github.com/apple/swift/issues/50144
UTF8 should be (one of) the fastest String encoding(s)
---
Requirements:
being able to copy UTF-8 encoded bytes from a String into a pre-allocated raw buffer
must be allocation-free and as fast as memcpy can copy them
creating a String from UTF-8 encoded bytes should just validate the encoding and store the bytes as they are
(jlf: "and store the bytes as they are" --> YES!)
slightly softer but still very strong requirement: currently (even with ASCII)
only the stdlib seems to be able to get a pointer to the contiguous ASCII representation
(if at all in that form). That works fine if you just want to copy the bytes
(UnsafeMutableBufferPointer(start: destinationStart, count: destinationLength).initialize(from: string.utf8)
which will use memcpy if in ASCII representation) but doesn't allow you to implement
your own algorithms that are only performant on a contiguously stored [UInt8]
---
jlf: this comment in the thread is particularly interesting, because it reminds
me what was said on the ARB mailing list about byte versus string.
https://github.com/apple/swift/issues/50144#issuecomment-1108303710
May 9, 2018
@milseman Virtually all of it comes down to `String(data: myData, encoding: .utf8)`
and `myString.data(encoding: .utf8)`.
When parsing protocols such as HTTP, Redis, MySQL, PostgreSQL, etc we will read data from
the OS into an `UnsafeBufferPointer<UInt8>`. This is almost always via NIO's
[`ByteBuffer`](https://apple.github.io/swift-nio/docs/current/NIO/Structs/ByteBuffer.html) type.
We sometimes grab `String` from that directly or grab `Data` if we want to iterate over the bytes
for additional parsing.
In other words, from `UnsafePointer<UInt8>` we commonly read `FixedWidthInteger`,
`BinaryFloatingPoint`, `Data`, and `String`. All are very performant except String
which is the concern since the vast majority of bytes ends up being `String`s.
Considering the DB use case specifically, the data transfer is usually emails,
names, bios, comments, etc. Very few bytes are actually dedicated to binary
numbers or data blobs. Strings everywhere.
To summarize, the faster we can get from `Swift.Unsafe...Pointer<UInt8>` or
`Foundation.Data` to `String` the better. That will affect (for the better!)
quite literally our entire framework.
---
jlf: this comment from the same thread shows which questions we should answer for Rexx:
https://github.com/apple/swift/issues/50144#issuecomment-1108303720
Along the lines of potentially separable issues, what is your validation story?
If the stream of bytes contains invalid UTF-8, do you want:
1) The initializer to fail resulting in nil
2) The initializer to fail producing an error
3) The invalid bytes to be replaced with U+FFFD
4) The bytes verbatim, and experience the emergent behavior / unspecified results / security hazard from those bytes.
For reference, I think [Rust's model](https://doc.rust-lang.org/std/string/struct.String.html) is pretty good:
`from_utf8` produces an error explaining why the code units were invalid
`from_utf8_lossy` replaces encoding errors with U+FFFD
`from_utf8_unchecked` which takes the bytes, but if there's an encoding error, then memory safety has been violated
I'm not entirely sure if accepting invalid bytes requires voiding memory safety
(assuming bounds checking always happens), but it is totally a security hazard if used improperly.
We may want to be very cautious about if/how we expose it.
I think that trying to do read-time validation is dubious for UTF-16, and totally bananas for UTF-8.
(jlf: I don't understand this last sentence. By "read-time", does he means "when working with the string?")
milseman Michael Ilseman added a comment - 5 Nov 2018 3:44 PM
It's now the fastest encoding.
https://forums.swift.org/t/string-s-abi-and-utf-8/17676/1
https://github.com/apple/swift/pull/20315
https://github.com/apple/swift/blob/7e68e8f4a3cb1173e909dc22a3490c05e43fa592/stdlib/public/core/StringObject.swift
swift/stdlib/public/core/StringObject.swift
jlf: the link above is a frozen link
To have an up-to-date view, go to
https://github.com/apple/swift/tree/main/stdlib/public/core
Many code to review!
String.swift
StringBreadcrumbs.swift
StringBridge.swift
StringCharacterView.swift
StringComparable.swift
StringComparison.swift
StringCreate.swift
StringGraphemeBreaking.swift
jlf: Apparently, there are some difficulties when going backwards .
// When walking backwards, it's impossible to know whether we were in an emoji
// sequence without walking further backwards. This walks the string backwards
// enough until we figure out whether or not to break our
// (.zwj, .extendedPictographic) question.
// When walking backwards, it's impossible to know whether we break when we
// see our first (.regionalIndicator, .regionalIndicator) without walking
// further backwards. This walks the string backwards enough until we figure
// out whether or not to break these RIs.
StringGuts.swift
StringGutsRangeReplaceable.swift
StringGutsSlice.swift
StringHashable.swift
StringIndex.swift
StringIndexConversions.swift
StringIndexValidation.swift
StringInterpolation.swift
StringLegacy.swift
StringNormalization.swift
StringObject.swift
StringProtocol.swift
StringRangeReplaceableCollection.swift
StringStorage.swift
StringStorageBridge.swift
StringSwitch.swift
StringTesting.swift
StringUTF16View.swift
StringUTF8Validation.swift
StringUTF8View.swift
StringUnicodeScalarView.swift
StringWordBreaking.swift
Substring.swift
https://github.com/apple/swift/blob/main/stdlib/public/core/StringBreadcrumbs.swift
Breadcrumb optimization
The distance between successive breadcrumbs, measured in UTF-16 code units is 64.
internal static var breadcrumbStride: Int { 64 }
jlf: nothing sophisticated here...
They scan the whole string by iterating over the UTF-16 indexes and when i % stride == 0 then self.crumbs.append(curIdx)
When searching the offset for a String.Index, they do a binary search.
https://github.com/apple/swift/pull/20315/commits/2e368a3f6a25b5e84c0f682861ea0a5c9b3b26af
[String] Introduce StringBreadcrumbs
Breadcrumbs provide us amortized O(1) access to the UTF-16 view, which
is vital for efficient Cocoa interoperability.
---
jlf: this is the commit where breadcrumbs are added to Swift (Nov 4, 2018).
https://stackoverflow.com/questions/55389444/whats-does-extended-grapheme-clusters-are-canonically-equivalent-means-in-term
Whats does “extended grapheme clusters are canonically equivalent” means in terms of Swift String?
jlf:
They don't answer to the question :-(
no explanation about "canonically equivalent", just ONE poor example, no general definition.
https://forums.swift.org/t/pitch-unicode-equivalence-for-swift-source/21576/6
Pitch: Unicode Equivalence for Swift Source
jlf: interersting
Mar 13,2019
In short, there is a thorough set of rules already laid out in UAX#31 on how to normalize identifiers in programming languages.
Several of us have written several versions of a proposal to adopt it, but each time it has failed because of issues with emoji.
Recent versions of Unicode now have more robust classifications for emoji, so the proposal can be resurrected with better luck now, probably.
No need to start from scratch; feel free to build on the work that we’ve already done.
All of this applies only to identifiers. Literals should never be messed with by the compiler.
That are, after all, supposed to be literals.
13/06/2021
https://github.com/apple/swift-evolution/blob/main/proposals/0211-unicode-scalar-properties.md
Add Unicode Properties to Unicode.Scalar
Issues Linking with ICU
The Swift standard library uses the system's ICU libraries to implement its Unicode support.
A third-party developer may expect that they could also link their application directly to the system ICU
to access the functionality that they need, but this proves problematic on both Apple and Linux platforms.
Apple
On Apple operating systems, libicucore.dylib is built with function renaming disabled
(function names lack the _NN version number suffix). This makes it fairly straightforward to import the C APIs
and call them from Swift without worrying about which version the operating system is using.
Unfortunately, libicucore.dylib is considered to be private API for submissions to the App Store,
so applications doing this will be rejected. Instead, users must built their own copy of ICU from source
and link that into their applications. This is significant overhead.
Linux
On Linux, system ICU libraries are built with function renaming enabled (the default),
so function names have the _NN version number suffix. Function renaming makes it more difficult
to use these APIs from Swift; even though the C header files contain #defines that map function names
like u_foo_59 to u_foo, these #defines are not imported into Swift—only the suffixed function names are available.
This means that Swift bindings would be fixed to a specific version of the library without some other intermediary layer.
Again, this is significant overhead.
extension Unicode.Scalar.Properties {
public var isAlphabetic: Bool { get } // Alphabetic
public var isASCIIHexDigit: Bool { get } // ASCII_Hex_Digit
public var isBidiControl: Bool { get } // Bidi_Control
public var isBidiMirrored: Bool { get } // Bidi_Mirrored
public var isDash: Bool { get } // Dash
public var isDefaultIgnorableCodePoint: Bool { get } // Default_Ignorable_Code_Point
public var isDeprecated: Bool { get } // Deprecated
public var isDiacritic: Bool { get } // Diacritic
public var isExtender: Bool { get } // Extender
public var isFullCompositionExclusion: Bool { get } // Full_Composition_Exclusion
public var isGraphemeBase: Bool { get } // Grapheme_Base
public var isGraphemeExtend: Bool { get } // Grapheme_Extend
public var isHexDigit: Bool { get } // Hex_Digit
public var isIDContinue: Bool { get } // ID_Continue
public var isIDStart: Bool { get } // ID_Start
public var isIdeographic: Bool { get } // Ideographic
public var isIDSBinaryOperator: Bool { get } // IDS_Binary_Operator
public var isIDSTrinaryOperator: Bool { get } // IDS_Trinary_Operator
public var isJoinControl: Bool { get } // Join_Control
public var isLogicalOrderException: Bool { get } // Logical_Order_Exception
public var isLowercase: Bool { get } // Lowercase
public var isMath: Bool { get } // Math
public var isNoncharacterCodePoint: Bool { get } // Noncharacter_Code_Point
public var isQuotationMark: Bool { get } // Quotation_Mark
public var isRadical: Bool { get } // Radical
public var isSoftDotted: Bool { get } // Soft_Dotted
public var isTerminalPunctuation: Bool { get } // Terminal_Punctuation
public var isUnifiedIdeograph: Bool { get } // Unified_Ideograph
public var isUppercase: Bool { get } // Uppercase
public var isWhitespace: Bool { get } // Whitespace
public var isXIDContinue: Bool { get } // XID_Continue
public var isXIDStart: Bool { get } // XID_Start
public var isCaseSensitive: Bool { get } // Case_Sensitive
public var isSentenceTerminal: Bool { get } // Sentence_Terminal (S_Term)
public var isVariationSelector: Bool { get } // Variation_Selector
public var isNFDInert: Bool { get } // NFD_Inert
public var isNFKDInert: Bool { get } // NFKD_Inert
public var isNFCInert: Bool { get } // NFC_Inert
public var isNFKCInert: Bool { get } // NFKC_Inert
public var isSegmentStarter: Bool { get } // Segment_Starter
public var isPatternSyntax: Bool { get } // Pattern_Syntax
public var isPatternWhitespace: Bool { get } // Pattern_White_Space
public var isCased: Bool { get } // Cased
public var isCaseIgnorable: Bool { get } // Case_Ignorable
public var changesWhenLowercased: Bool { get } // Changes_When_Lowercased
public var changesWhenUppercased: Bool { get } // Changes_When_Uppercased
public var changesWhenTitlecased: Bool { get } // Changes_When_Titlecased
public var changesWhenCaseFolded: Bool { get } // Changes_When_Casefolded
public var changesWhenCaseMapped: Bool { get } // Changes_When_Casemapped
public var changesWhenNFKCCaseFolded: Bool { get } // Changes_When_NFKC_Casefolded
public var isEmoji: Bool { get } // Emoji
public var isEmojiPresentation: Bool { get } // Emoji_Presentation
public var isEmojiModifier: Bool { get } // Emoji_Modifier
public var isEmojiModifierBase: Bool { get } // Emoji_Modifier_Base
}
extension Unicode.Scalar.Properties {
// Implemented in terms of ICU's `u_isdefined`.
public var isDefined: Bool { get }
}
Case Mappings
The properties below provide full case mappings for scalars. Since a handful of mappings result in multiple scalars (e.g., "ß" uppercases to "SS"), these properties are String-valued, not Unicode.Scalar.
extension Unicode.Scalar.Properties {
public var lowercaseMapping: String { get } // u_strToLower
public var titlecaseMapping: String { get } // u_strToTitle
public var uppercaseMapping: String { get } // u_strToUpper
}
Identification and Classification
extension Unicode.Scalar.Properties {
/// Corresponds to the `Age` Unicode property, when a code point was first
/// defined.
public var age: Unicode.Version? { get }
/// Corresponds to the `Name` Unicode property.
public var name: String? { get }
/// Corresponds to the `Name_Alias` Unicode property.
public var nameAlias: String? { get }
/// Corresponds to the `General_Category` Unicode property.
public var generalCategory: Unicode.GeneralCategory { get }
/// Corresponds to the `Canonical_Combining_Class` Unicode property.
public var canonicalCombiningClass: Unicode.CanonicalCombiningClass { get }
}
extension Unicode {
/// Represents the version of Unicode in which a scalar was introduced.
public typealias Version = (major: Int, minor: Int)
/// General categories returned by
/// `Unicode.Scalar.Properties.generalCategory`. Listed along with their
/// two-letter code.
public enum GeneralCategory {
case uppercaseLetter // Lu
case lowercaseLetter // Ll
case titlecaseLetter // Lt
case modifierLetter // Lm
case otherLetter // Lo
case nonspacingMark // Mn
case spacingMark // Mc
case enclosingMark // Me
case decimalNumber // Nd
case letterlikeNumber // Nl
case otherNumber // No
case connectorPunctuation //Pc
case dashPunctuation // Pd
case openPunctuation // Ps
case closePunctuation // Pe
case initialPunctuation // Pi
case finalPunctuation // Pf
case otherPunctuation // Po
case mathSymbol // Sm
case currencySymbol // Sc
case modifierSymbol // Sk
case otherSymbol // So
case spaceSeparator // Zs
case lineSeparator // Zl
case paragraphSeparator // Zp
case control // Cc
case format // Cf
case surrogate // Cs
case privateUse // Co
case unassigned // Cn
}
public struct CanonicalCombiningClass:
Comparable, Hashable, RawRepresentable
{
public static let notReordered = CanonicalCombiningClass(rawValue: 0)
public static let overlay = CanonicalCombiningClass(rawValue: 1)
public static let nukta = CanonicalCombiningClass(rawValue: 7)
public static let kanaVoicing = CanonicalCombiningClass(rawValue: 8)
public static let virama = CanonicalCombiningClass(rawValue: 9)
public static let attachedBelowLeft = CanonicalCombiningClass(rawValue: 200)
public static let attachedBelow = CanonicalCombiningClass(rawValue: 202)
public static let attachedAbove = CanonicalCombiningClass(rawValue: 214)
public static let attachedAboveRight = CanonicalCombiningClass(rawValue: 216)
public static let belowLeft = CanonicalCombiningClass(rawValue: 218)
public static let below = CanonicalCombiningClass(rawValue: 220)
public static let belowRight = CanonicalCombiningClass(rawValue: 222)
public static let left = CanonicalCombiningClass(rawValue: 224)
public static let right = CanonicalCombiningClass(rawValue: 226)
public static let aboveLeft = CanonicalCombiningClass(rawValue: 228)
public static let above = CanonicalCombiningClass(rawValue: 230)
public static let aboveRight = CanonicalCombiningClass(rawValue: 232)
public static let doubleBelow = CanonicalCombiningClass(rawValue: 233)
public static let doubleAbove = CanonicalCombiningClass(rawValue: 234)
public static let iotaSubscript = CanonicalCombiningClass(rawValue: 240)
public let rawValue: UInt8
public init(rawValue: UInt8)
}
}
Numerics
Many Unicode scalars have associated numeric values.
These are not only the common digits zero through nine, but also vulgar fractions
and various other linguistic characters and ideographs that have an innate numeric value.
These properties are exposed below. They can be useful for determining whether segments
of text contain numbers or non-numeric data, and can also help in the design of algorithms
to determine the values of such numbers.
extension Unicode.Scalar.Properties {
/// Corresponds to the `Numeric_Type` Unicode property.
public var numericType: Unicode.NumericType?
/// Corresponds to the `Numeric_Value` Unicode property.
public var numericValue: Double?
}
extension Unicode {
public enum NumericType {
case decimal
case digit
case numeric
}
}
14/06/2021
https://lists.isocpp.org/sg16/2018/08/0121.php
Feedback from swift team
Swift strings now sort with NFC (currently UTF-16 code unit order, but likely changed to Unicode scalar value order).
We didn't find FCC significantly more compelling in practice. Since NFC is far more frequent in the wild
(why waste space if you don't have to), strings are likely to already be in NFC.
We have fast-paths to detect on-the-fly normal sections of strings (e.g. all ASCII, all < U+0300, NFC_QC=yes, etc.).
We lazily normalize portions of string during comparison when needed.
Q: Swift strings support comparison via normalization. Has use of canonical string equality been a performance issue?
Or been a source of surprise to programmers?
A: This was a big performance issue on Linux, where we used to do UCA+DUCET based comparisons.
We switch to lexicographical order of NFC-normalized UTF-16 code units (future: scalar values),
and saw a very significant speed up there. The remaining performance work revolves around checking
and tracking whether a string is known to already be in a normal form, so we can just memcmp.
Q: I'm curious why this was a larger performance issue for Linux than for (presumably) macOS and/or iOS.
A: There were two main factors.
The first is that on Darwin platforms, CFString had an implementation that we used instead of UCA+DUCET which was faster.
The second is that Darwin platforms are typically up-to-date and have very recent versions of ICU.
On Linux, we still support Ubuntu LTS 14.04 which has a version of ICU which predates Swift and didn't have any fast-paths for ASCII or mostly-ASCII text.
Switching to our own implementation based on NFC gave us many X improvement over CFString, which in turn was many X faster than UCA+DUCET (especially on older versions of ICU).
Q: How firmly is the Swift string implementation tied to ICU?
If the C++ standard library were to add suitable Unicode support, what would motivate reimplementing Swift strings on top of it?
A: Swift's tie to ICU is less firm than it used to be
If the C++ standard library provided these operations, sufficiently up-to-date with Unicode version and comparable or better to ICU in performance,
we would be willing to switch. A big pain in interacting with ICU is their limited support for UTF-8.
Some users who would like to use a lighter-weight Swift and are unhappy at having to link against ICU, as it's fairly large, and it can complicate security audits.
https://forums.swift.org/t/pitch-unicode-for-string-processing/56907/6
[Pitch] Unicode for String Processing
https://github.com/apple/swift-evolution/blob/main/proposals/0350-regex-type-overview.md
jlf: surprising intro!
Swift strings provide an obsessively Unicode-forward model of programming with strings.
String processing with Collection's algorithms is woefully inadequate for many day-to-day
tasks compared to other popular programming and scripting languages.
We propose addressing this basic shortcoming through an effort we are calling regex.
https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/ProposalOverview.md
Regex Proposals
todo: read String processing algorithms https://forums.swift.org/t/pitch-regex-powered-string-processing-algorithms/55969
todo: read Unicode for String Processing https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/UnicodeForStringProcessing.md
https://stackoverflow.com/questions/41059974/german-character-%C3%9F-uppercased-in-ss
"ß" is converted to "SS" when using uppercased().
---
Use caseInsensitiveCompare() instead of converting the strings to upper or lowercase:
let s1 = "gruß"
let s2 = "GRUß"
let eq = s1.caseInsensitiveCompare(s2) == .orderedSame
print(eq) // true
This compares the strings in a case-insensitive way according to the Unicode standard.
There is also localizedCaseInsensitiveCompare() which does a comparison according to the current locale, and
s1.compare(s2, options: .caseInsensitive, locale: ...)
for a case-insensitive comparison according to an arbitrary given locale.
https://www.kodeco.com/3418439-encoding-and-decoding-in-swift
jlf: out of subject, it's not related to strings. It's about serialization of data strctures.
https://github.com/apple/swift-evolution/blob/main/proposals/0241-string-index-explicit-encoding-offset.md
Deprecate String Index Encoded Offsets
Feb 23, 2019
jlf: I add this URL for this description, not for the topic covered by this proposition:
String abstracts away details about the underlying encoding used in its storage.
String.Index is opaque and represents a position within a String or Substring.
This can make serializing a string alongside its indices difficult, and for that
reason SE-0180 added a computed variable and initializer encodedOffset in Swift 4.0.
String was always meant to be capable of handling multiple backing encodings for
its contents, and this is realized in Swift 5. String now uses UTF-8 for its
preferred “fast” native encoding, but has a resilient fallback for strings of
different encodings. Currently, we only use this fall-back for lazily-bridged
Cocoa strings, which are commonly encoded as UTF-16, though it can be extended
in the future thanks to resilience.
Unfortunately, SE-0180’s approach of a single notion of encodedOffset is flawed.
A string can be serialized with a choice of encodings, and the offset is therefore
encoding-dependent and requires access to the contents of the string to calculate.
https://www.tutorialkart.com/swift-tutorial/swift-read-text-file/#gsc.tab=0
Read text file
import Foundation
let file = "sample.txt"
var result = ""
//if you get access to the directory
if let dir = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask).first {
//prepare file url
let fileURL = dir.appendingPathComponent(file)
do {
result = try String(contentsOf: fileURL, encoding: .utf8)
}
catch {/* handle if there are any errors */}
}
print(result)
https://www.appsdeveloperblog.com/read-and-write-string-into-a-text-file/
Read and Write String Into a Text File
let fileName = "myFileName.txt"
var filePath = ""
// Fine documents directory on device
let dirs : [String] = NSSearchPathForDirectoriesInDomains(FileManager.SearchPathDirectory.documentDirectory, FileManager.SearchPathDomainMask.allDomainsMask, true)
if dirs.count > 0 {
let dir = dirs[0] //documents directory
filePath = dir.appending("/" + fileName)
print("Local path = \(filePath)")
} else {
print("Could not find local directory to store file")
return
}
// Set the contents
let fileContentToWrite = "Text to be recorded into file"
do {
// Write contents to file
try fileContentToWrite.write(toFile: filePath, atomically: false, encoding: String.Encoding.utf8)
}
catch let error as NSError {
print("An error took place: \(error)")
}
// Read file content. Example in Swift
do {
// Read file content
let contentFromFile = try NSString(contentsOfFile: filePath, encoding: String.Encoding.utf8.rawValue)
print(contentFromFile)
}
catch let error as NSError {
print("An error took place: \(error)")
}
Testing the JMB's example
"ς".uppercased() // "Σ"
"σ".uc // "Σ"
"ὈΔΥΣΣΕΎΣ".lowercased() // "ὀδυσσεύσ" NOT SUPPORTED last Σ becomes ς
"ὈΔΥΣΣΕΎΣA".lowercased() // "ὀδυσσεύσa" last Σ becomes σ
https://developer.apple.com/documentation/swift/character/isnewline
isNewline
A Boolean value indicating whether this character represents a newline.
For example, the following characters all represent newlines:
“\n” (U+000A): LINE FEED (LF)
U+000B: LINE TABULATION (VT)
U+000C: FORM FEED (FF)
“\r” (U+000D): CARRIAGE RETURN (CR)
“\r\n” (U+000D U+000A): CR-LF
U+0085: NEXT LINE (NEL)
U+2028: LINE SEPARATOR
U+2029: PARAGRAPH SEPARATOR
---
jlf: this is related to Unicode properties of a character.
But what is the impacts on file I/O?
Typst lang
https://github.com/typst/typst
A new markup-based typesetting system that is powerful and easy to learn.
---
jlf: uses ICU4X
https://github.com/unicode-org/icu4x/issues/3811
XPath lang
https://www.w3.org/TR/xpath-functions-31/#string-functions
Functions on strings
jlf:
to read
no "grapheme" in this document.
written by Michael Kay (XSLT WG), Saxonica <http://www.saxonica.com/>
https://www.w3.org/TR/xpath-functions-31/#string.match
String functions that use regular expressions
jlf: part of the doc "Functions on strings" above, explicitely referenced for
direct access.
https://www.w3.org/TR/xpath-functions-31/#func-collation-key
Referenced in https://github.com/unicode-org/icu4x/issues/2689#issuecomment-1743127855
hsivonen:
I'm quite skeptical of processes that use XPath having the kind of lifetimes
and numbers of comparisons that computing a sort key is justified, but whether
or not exposing sort keys in XPath is a good idea, it's good to know that XPath
has this dependency.
faassen:
I think the XPath spec (the library portion) has been influenced by the capabilities of ICU4J.
The motivation for this facility is described in the "notes" section:
https://www.w3.org/TR/xpath-functions-31/#func-collation-key
and is basically to use this as a collation-dependent hashmap key.
I can't judge myself how useful that is, so I'll defer to your skepticism.
I'll note however that this same specification also provides the function
library available to XQuery, and with XQuery the lifetimes and numbers of
comparisons are likely to be much bigger.
Zig lang, Ziglyph
04/07/2021
https://github.com/jecolon/ziglyph
Unicode text processing for the Zig programming language.
https://devlog.hexops.com/2021/unicode-data-file-compression/
achieving 40-70% reduction over gzip alone
https://github.com/jecolon/ziglyph/issues/3
More size-optimal grapheme cluster sorting
08/02/2023
https://github.com/natecraddock/zf
a commandline fuzzy finder that prioritizes matches on filenames
To review: uses zygliph
https://github.com/jecolon/ziglyph/issues/20
Grapheme segmentation with ZWJ sequences
10/02/2023
https://github.com/jecolon/ziglyph/issues/20
Grapheme segmentation with ZWJ sequences
---
jlf: Executor is ok with utf8proc
t = "🐻❄️🐻❄️"~text
t~description= -- 'UTF-8 not-ASCII (2 graphemes, 8 codepoints, 26 bytes, 0 error)'
t~characters==
an Array (shape [8], 8 items)
1 : ( "🐻" U+1F43B So 2 "BEAR FACE" )
2 : ( "" U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" )
3 : ( "❄" U+2744 So 1 "SNOWFLAKE" )
4 : ( "️" U+FE0F Mn 0 "VARIATION SELECTOR-16", "VS16" )
5 : ( "🐻" U+1F43B So 2 "BEAR FACE" )
6 : ( "" U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" )
7 : ( "❄" U+2744 So 1 "SNOWFLAKE" )
8 : ( "️" U+FE0F Mn 0 "VARIATION SELECTOR-16", "VS16" )
https://devlog.hexops.com/2021/unicode-sorting-why-browsers-added-special-emoji-matching/
Whether your application is in Go and has it’s own Unicode Collation Algorithm
(UCA) implementation, or Rust and uses bindings to the popular ICU4C library -
one thing is going to remain true: it requires large data files to work.
The UCA algorithm depends on two quite large data table files to work:
- UnicodeData.txt for normalization, a step required before sorting can take place.
- allkeys.txt for weighting certain text above others.
- And more, if you want truly locale-aware sorting and not just “the default”
the UCA algorithm gives you.
Together, these files can add up to over a half a megabyte.
While WASM languages could shell out to JavaScript browser APIs for collation,
I suspect they won’t due to the lack of guarantees around those APIs.
A more likely scenario is languages continuing to leave locale-aware sorting
as an optional, opt-in feature - that also makes your application larger.
I think this a worthwhile problem to solve, so I am working on compression a
lgorithms for these files specifically in Zig to reduce them to only a few
tens of kilobytes.
https://github.com/jecolon/ziglyph/issues/3
Knock, knock.
Knock, knock.
Who’s there?
You.
You who?
Yoo-hoo! It's You Nicode.
Knock, knock.
Who’s there?
Sue.
Sue who?
It's Sue Nicode.