Accumulation of URLs about Unicode


Contents: Unicode standard Unicode general informations U+ notation, Unicode escape sequence Security title Segmentation, Grapheme Normalization, equivalence Character set String matching - Lower vs Casefold String matching - Collation Locale CLDR Common Locale Data Repository Case mappings Collation, sorting BIDI title Emoji Countries, flags Evidence of partial or wrong support of Unicode Optimization, SIMD Variation sequence Whitespaces, separators Hyphenation DNS title, Domain Name title, Domain Name System title All languages Classical languages Arabic language Indic languages CJK Korean Japanese Polish IME - Input Method Editor Text editing Text rendering, Text shaping library String Matching Fuzzy String Matching Levenshtein distance and string similarity String comparison JSON TOML serialization format CBOR Concise Binary Representation Binary encoding in Unicode Invalid format Mojibake Filenames WTF8 Codepoint/grapheme indexation Rope Encoding title ICU title ICU demos ICU bindings ICU4X title utf8proc title Twitter text parsing terminal / console / cmd QT Title IBM OS IBM RPG Lang IBM z/OS macOS OS Windows OS Language comparison Regular expressions Test cases, test-cases, tests files font bold, italic, strikethrough, underline, backwards, upside down youtube xxx lang Ada lang Awk lang C++ lang, cpp lang, Boost cRexx lang DotNet, CoreFx Dafny lang Dart lang Elixir lang Factor lang Fortran lang GO lang jRuby lang Java lang JavaScript lang Julia lang Kotlin lang Lisp lang Mathematica lang netrexx lang Oracle Perl lang (Perl 6 has been renamed to Raku) PHP lang Python lang R lang RAKU lang Rakudo lang (Perl6, Perl 6, MOAR-VM) Rexx lang Ruby lang Rust lang Saxon lang SQL lang Swift lang Typst lang XPath lang Zig lang, Ziglyph Knock, knock.

Unicode standard


Remember Don't know why, but the Unicode consortium has 2 different URLs: https://unicode.org/ https://www.unicode.org/ To avoid doubling URLs, I use the 2nd form. https://home.unicode.org/ https://www.unicode.org/ (same as home.unicode.org) https://www.unicode.org/versions/ https://www.unicode.org/versions/latest/ (latest version) https://www.unicode.org/versions/enumeratedversions.html (current and previous versions) https://www.unicode.org/Public/ (datas for current and previous versions) https://www.unicode.org/ucd/ UCD = Unicode Character Database https://www.unicode.org/Public/MAPPINGS (ISO8859) These tables are considered to be authoritative mappings between the Unicode Standard and different parts of the ISO/IEC 8859 standard. https://www.unicode.org/faq/specifications.html https://www.unicode.org/reports/ Unicode® Technical Reports A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS. A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR. Unicode Standard Annex (UAX) UAX #9, The Unicode Bidirectional Algorithm https://www.unicode.org/reports/tr9/ UAX #11, East Asian Width https://www.unicode.org/reports/tr11/ UAX #14, Unicode Line Breaking Algorithm https://www.unicode.org/reports/tr14/ UAX #15, Unicode Normalization Forms https://www.unicode.org/reports/tr15/ UAX #24, Unicode Script Property https://www.unicode.org/reports/tr24/ UAX #29, Unicode Text Segmentation https://www.unicode.org/reports/tr29/ UAX #31, Unicode Identifier and Pattern Syntax https://www.unicode.org/reports/tr31/ UAX #34, Unicode Named Character Sequences https://www.unicode.org/reports/tr34/ UAX #38, Unicode Han Database (Unihan) https://www.unicode.org/reports/tr38/ UAX #41, Common References for Unicode Standard Annexes https://www.unicode.org/reports/tr41/ UAX #42, Unicode Character Database in XML https://www.unicode.org/reports/tr42/ UAX #44, Unicode Character Database https://www.unicode.org/reports/tr44/ UAX #45, U-Source Ideographs https://www.unicode.org/reports/tr45/ UAX #50, Unicode Vertical Text Layout https://www.unicode.org/reports/tr50/ Unicode Technical Standard (UTS) UTS #22, UNICODE CHARACTER MAPPING MARKUP LANGUAGE (CharMapML) https://www.unicode.org/reports/tr22/ This document specifies an XML format for the interchange of mapping data for character encodings, and describes some of the issues connected with the use of character conversion. https://www.unicode.org/glossary Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF. Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set. Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF and E000 to 10FFFF inclusive. UNICODE COLLATION ALGORITHM Unicode has an official string collation algorithm called UCA https://www.unicode.org/reports/tr10/ https://www.unicode.org/reports/tr10/#S2.1.1 The Unicode Collation Algorithm takes an input Unicode string and a Collation Element Table, containing mapping data for characters. It produces a sort key, which is an array of unsigned 16-bit integers. Two or more sort keys so produced can then be binary-compared to give the correct comparison between the strings for which they were generated. 08/06/2021 Default Unicode Collation Element Table (DUCET) For the latest version, see: https://www.unicode.org/Public/UCA/latest/allkeys.txt --- UTS10-D1. Collation Weight: A non-negative integer used in the UCA to establish a means for systematic comparison of constructed sort keys. UTS10-D2. Collation Element: An ordered list of collation weights. UTS10-D3. Collation Level: The position of a collation weight in a collation element. https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms UNICODE NORMALIZATION FORMS https://www.unicode.org/reports/tr31/#Alternative_Identifier_Syntax https://www.unicode.org/reports/tr31/ UNICODE IDENTIFIER AND PATTERN SYNTAX jlf: there is ONE (just ONE) occurence of NFKC_CF: Comparison and matching should be done after converting to NFKC_CF format. Thus #MötleyCrüe should match #MÖTLEYCRÜE and other variants. --- In the UnicodeStandard PDF: - The mapping NFKC_Casefold (short alias NFKC_CF) is specified in the data file DerivedNormalizationProps.txt in the Unicode Character Database. - The derived binary property Changes_When_NFKC_Casefolded is also listed in the data file DerivedNormalizationProps.txt in the Unicode Character Database. Conformance 156 3.13 Default Case Algorithms For more information on the use of NFKC_Casefold and caseless matching for identifiers, see Unicode Standard Annex #31, “Unicode Identifier and Pattern Syntax https://www.unicode.org/reports/tr51/ Unicode emoji 23/05/2021 https://www.unicode.org/notes/tn28/ UNICODEMATH, A NEARLY PLAIN-TEXT ENCODING OF MATHEMATIC 𝑎𝑏𝑐 𝑑 𝑎 + 𝑐 𝑑 (𝑎 + 𝑏)𝑛 = ∑ (𝑛 𝑘) 𝑎𝑘𝑏𝑛−𝑘 https://www.unicode.org/notes/tn5/ Unicode Technical Note #5 CANONICAL EQUIVALENCE IN APPLICATIONS https://icu.unicode.org/design/normalizing-to-shortest-form Canonically Equivalent Shortest Form (CESF) This is usually, but not always, the NFC form. Conformance https://github.com/unicode-org/conformance This repository provides tools and procedures for verifying that an implementation is working correctly according to the data-based specifications. The tests are implemented on several platforms including NodeJS (JavaScript), ICU4X (RUST), ICU4C, etc. Data Driven Test was initiated in 2022 at Google. The first release of the package was delivered in October, 2022. https://www.unicode.org/main.html Unicode® Technical Site https://www.unicode.org/faq/ https://www.unicode.org/faq/char_combmark.html Characters and Combining Marks

Unicode general informations


https://codepoints.net/ Very detailled decription of each character Source of the WEB site: https://github.com/Codepoints/codepoints.net https://util.unicode.org/UnicodeJsps/ Lot of informations about a character. https://en.wikipedia.org/wiki/UTF-8 https://en.wikipedia.org/wiki/UTF-16 https://en.wikipedia.org/wiki/UTF-32 http://xahlee.info/comp/unicode_index.html http://xahlee.info/comp/unicode_invert_text.html Inverted text: :ʇxǝʇ pǝʇɹǝʌuI http://xahlee.info/comp/unicode_animals.html T-REXX: 🦖 https://www.fontspace.com/unicode/analyzer https://www.compart.com/en/unicode/ 22/05/2021 https://onlineunicodetools.com/ Online Unicode tools is a collection of useful browser-based utilities for manipulating Unicode text. 28/05/2021 https://unicode.scarfboy.com/ Search tool Provides plenty of information about Unicode characters but no encoding UTF16 https://unicode-table.com/en/ search by name Provides the encoding UTF16 https://www.minaret.info/test/menu.msp Minaret Unicode Tests Case Folding Character Type Collation Normalization Sorting Transliteration https://www.gosecure.net/blog/2020/08/04/unicode-for-security-professionals/ Unicode for Security Professionals by Philippe Arteau | Aug 4, 2020 jlf : this article covers many of the Unicode characteristics https://github.com/bits/UTF-8-Unicode-Test-Documents Every Unicode character / codepoint in files and a file generator http://www.ltg.ed.ac.uk/~richard/utf-8.html let convert utf8 to codepoint + symbolic name https://blog.lunatech.com/posts/2009-02-03-what-every-web-developer-must-know-about-url-encoding https://mothereff.in/utf-8 UTF-8 encoder/decoder https://corp.unicode.org/pipermail/unicode/ The Unicode Archives January 2, 2014 - current https://www.unicode.org/mail-arch/unicode-ml/ March 21, 2001 - April 2, 2020 https://www.unicode.org/mail-arch/unicode-ml/Archives-Old/ October 11, 1994 - March 19, 2001 https://www.unicode.org/search/ Search Unicode.org https://www.w3.org/TR/charmod/ Character Model for the World Wide Web 1.0: Fundamentals https://www.johndcook.com/blog/2021/11/01/number-sets-html/ Number sets in HTML and Unicode ℕ U+2115 ℕ ℕ ℤ U+2124 ℤ ℤ ℚ U+211A ℚ ℚ ℝ U+211D ℝ ℝ ℂ U+2102 ℂ ℂ ℍ U+210D ℍ ℍ https://gregtatum.com/writing/2021/encoding-text-utf-32-utf-16-unicode/ https://gregtatum.com/writing/2021/encoding-text-utf-8-unicode/ https://lwn.net/Articles/667669/ Is the current Unicode design impractical? jlf: this link is also in the section Raku Lang because it's about Perl6. jlf: worth reading. https://www.sciencedirect.com/science/article/pii/S1742287613000595 Unicode search of dirty data. This paper discusses problems arising in digital forensics with regard to Unicode, character encodings, and search. It describes how multipattern search can handle the different text encodings encountered in digital forensics and a number of issues pertaining to proper handling of Unicode in search patterns. Finally, we demonstrate the feasibility of the approach and discuss the integration of our developed search engine, lightgrep, with the popular bulk_extractor tool. --- There are UTF-16LE strings which contain completely different UTF-8 strings as prefixes. For example the byte sequence which is “nonsense” in UTF-8 is 潮獮湥敳 in UTF-16LE (!) "nonsense"~c2x= -- '6E6F6E73656E7365' "nonsense"~text("utf16be")~c2x= -- '6E6F 6E73 656E 7365' "nonsense"~text("utf16be")~c2u= -- 'U+6E6F U+6E73 U+656E U+7365' "nonsense"~text("utf16be")~utf8= -- T'湯湳敮獥' Le potage "nonsense"~text("utf16le")~c2x= -- '6E6F 6E73 656E 7365' "nonsense"~text("utf16le")~c2u= -- 'U+6F6E U+736E U+6E65 U+6573' "nonsense"~text("utf16le")~utf8= -- T'潮獮湥敳' marée https://github.com/simsong/bulk_extractor http://t-a-w.blogspot.com/2008/12/funny-characters-in-unicode.html SKULL AND CROSSBONES SNOWMAN POSTAL MARK FACE APL FUNCTIONAL SYMBOL TILDE DIAERESIS ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM THAI CHARACTER KHOMUT GLAGOLITIC CAPITAL LETTER SPIDERY HA VERY MUCH GREATER-THAN NEITHER LESS-THAN NOR GREATER-THAN HEAVY BLACK HEART FLORAL HEART BULLET, REVERSED ROTATED INTERROBANG 𝄞 (U+1D11E) MUSICAL SYMBOL G CLEF 𝕥 (U+1D565) MATHEMATICAL DOUBLE-STRUCK SMALL T 𝟶 (U+1D7F6) MATHEMATICAL MONOSPACE DIGIT ZERO 𠂊 (U+2008A) Han Character https://www.unicode.org/udhr/ UDHR in Unicode The goal of the UDHR in Unicode project is to demonstrate the use of Unicode for a wide variety of languages, using the Universal Declaration of Human Rights (UDHR) as a representative text. https://github.com/jagracey/Awesome-Unicode Awesome Unicode https://cldr.unicode.org/index/charts CLDR Charts By-Type Chart: Numbers:Symbols Question I am using the following code excerpt to format numbers: LocalizedNumberFormatter lnFmt = NumberFormatter.withLocale(Locale.US).unit(MeasureUnit.CELSIUS).unitWidth(NumberFormatter.UnitWidth.SHORT); System.out.println(lnFmt.format(-10).toString()); In the resulting string, minus sign is represented as 0x2d (ASCII HYPHEN-MINUS). Shouldn't it be U+2212 (Unicode MINUS SIGN)? Answer You can see the minus sign symbol being used for each locale here: https://unicode-org.github.io/cldr-staging/charts/latest/by_type/numbers.symbols.html#2f08b5ebf85e1e8b U+2212 is used in: ·fa· ·ps· ·uz_Arab· ·eo· ·et· ·eu· ·fi· ·fo· ·gsw· ·hr· ·kl· ·ksh· ·lt· ·nn· ·no· ·rm· ·se· ·sl· ·sv· Question Where this list of locales was taken from? I am particulary interested in ‘ru’: why U+2212 is not used for it? https://stackoverflow.com/questions/10143836/why-is-there-no-utf-24 Why is there no UTF-24? [duplicate] Well, the truth is : UTF-24 was suggested in 2007 : https://www.unicode.org/mail-arch/unicode-ml/y2007-m01/0057.html Possible Duplicate: Why UTF-32 exists whereas only 21 bits are necessary to encode every character? https://stackoverflow.com/questions/6339756/why-utf-32-exists-whereas-only-21-bits-are-necessary-to-encode-every-character https://unicodebook.readthedocs.io/ Book "Programming with Unicode" 2010-2011, Victor Stinner jlf: only one occurrence of the word "grapheme". Maybe at that time, it was not obvious that it would become an important concept. https://mcilloni.ovh/2023/07/23/unicode-is-hard/ Unicode is harder than you think 23 Jul 2023 --- jlf: good overview, with some ICU samples. https://www.kermitproject.org/utf8.html UTF-8 SAMPLER Last update: Sun Mar 12 14:21:05 2023 http://www.inter-locale.com/whitepaper/learn/learn-to-test.html International Testing Basics Testing non-English and non-ASCII (and/or Unicode) support in a product requires tests and test plans that exercise the edge cases in the software. https://www.youtube.com/watch?v=gd5uJ7Nlvvo Plain Text - Dylan Beattie - NDC Copenhagen 2022 --- jlf: many comments say it's a good talk did not watch todo: watch https://www.smashingmagazine.com/2012/06/all-about-unicode-utf8-character-sets/ Unicode, UTF8 & Character Sets: The Ultimate Guide jlf: maybe to read https://tonsky.me/blog/unicode/ The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!) https://news.ycombinator.com/item?id=37735801 What every software developer must know about Unicode in 2023 jlf: nothing new in this article, just reusing infos from other sites. jlf: did not read all the comments

U+ notation, Unicode escape sequence


29/05/2021 https://stackoverflow.com/questions/1273693/why-is-u-used-to-designate-a-unicode-code-point/8891355 The Python language defines the following string literals: u'xyz' to indicate a Unicode string, a sequence of Unicode characters '\uxxxx' to indicate a string with a unicode character denoted by four hex digits '\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits \N{name} Character named name in the Unicode database \uxxxx Character with 16-bit hex value xxxx. Exactly four hex digits are required. \Uxxxxxxxx Character with 32-bit hex value xxxxxxxx. Exactly eight hex digits are required. https://www.perl.com/article/json-unicode-and-perl-oh-my-/ Its \uXXXX escapes support only characters within Unicode’s BMP; to store emoji or other non-BMP characters you either have to encode to UTF-8 directly. or indicate a UTF-16 surrogate pair in \uXXXX escapes. https://corp.unicode.org/pipermail/unicode/2021-April/009410.html Need reference to good ABNF for \uXXXX syntax https://bit.ly/UnicodeEscapeSequences Unicode Escape Sequences Across Various Languages and Platforms

Security title


https://www.unicode.org/reports/tr39 UNICODE SECURITY MECHANISMS https://www.unicode.org/Public/security/latest/confusables.txt https://en.wikipedia.org/wiki/Homoglyph https://www.trojansource.codes/ https://api.mtr.pub/vhf/confusable_homoglyphs https://util.unicode.org/UnicodeJsps/confusables.jsp https://www.w3.org/TR/charmod-norm/#normalizationLimitations Confusable characters: "ΡРP"~text~characters== an Array (shape [3], 3 items) 1 : ( "Ρ" U+03A1 Lu 1 "GREEK CAPITAL LETTER RHO" ) 2 : ( "Р" U+0420 Lu 1 "CYRILLIC CAPITAL LETTER ER" ) 3 : ( "P" U+0050 Lu 1 "LATIN CAPITAL LETTER P" ) These confusable characters are not impacted by the lump option: "ΡРP"~text~nfc(lump:)~characters -- same result https://www.unicode.org/reports/tr36/#visual_spoofing UNICODE SECURITY CONSIDERATIONS http://www.unicode.org/reports/tr55/ Draft Unicode® Technical Standard #55 UNICODE SOURCE CODE HANDLING --- While the normative material for computer language specifications is part of the Unicode Standard, in Unicode Standard Annex #31, Unicode Identifiers and Syntax [UAX31], the algorithms specific to the display of source code or to higher-level diagnostics are specified in this document. Note: While, for the sake of brevity, many of the examples in this document make use of non-ASCII identifiers, most of the issues described here apply even if non-ASCII characters are confined to strings and comments. --- 3.1.1 Normalization and Case Case-insensitive languages should meet requirement UAX31-R4 with normalization form KC, and requirement UAX31-R5 with full case folding. They should ignore default ignorable code points in comparison. Conformance with these requirements and ignoring of default ignorable code points may be achieved by comparing identifiers after applying the transformation toNFKC_Casefold. Note: Full case folding is preferable to simple case folding, as it better matches expectations of case-insensitive equivalence. The choice between Normalization Form C and Normalization Form KC should match expectations of identifier equivalence for the language. In a case-sensitive language, identifiers are the same if and only if they look the same, so Normalization Form C (canonical equivalence) is appropriate, as canonical equivalent sequences should display the same way. In a case-insensitive language, the equivalence relation between identifiers is based on a more abstract sense of character identity; for instance, e and E are treated as the same letter. Normalization Form KC (compatibility equivalence) is an equivalence between characters that share such an abstract identity. Example: In a case-insensitive language, SO and so are the same identifier; if that language uses Normalization Form KC, the identifiers so and 𝖘𝖔 are likewise identical. Unicode 15.1 [icu-design] ICU 74 API proposal: bidiSkeleton and LTR- and RTL-confusabilities The Source Code Working Group, a limited-duration working group under the Properties & Algorithms Group of the Unicode Technical Committee, has added a new bidi-aware concept of confusability to UTS #39 in Unicode Version 15.1; until publication see the proposed update, https://www.unicode.org/reports/tr39/tr39-27.html#Confusable_Detection. The new UTS #55, Unicode Source Code Handling, to be published simultaneously with Unicode Version 15.1, recommends the use of this new kind of confusability: https://www.unicode.org/reports/tr55/tr55-2.html#Confusable-Detection. https://semanticdiff.com/blog/pull-request-unicode-tricks/ Unicode tricks in pull requests: Do review tools warn us?

Segmentation, Grapheme


29/05/2021 https://github.com/alvinlindstam/grapheme https://pypi.org/project/grapheme/ Here too, he says that CR+LF is a grapheme... Same here: https://www.reddit.com/r/programming/comments/m274cg/til_rn_crlf_is_a_single_grapheme_cluster/ https://www.unicode.org/reports/tr29/#Table_Combining_Char_Sequences_and_Grapheme_Clusters 01/06/2021 https://halt.software/optimizing-unicodes-grapheme-cluster-break-algorithm/ They claim this improvement: For the simple data set, this was 0.38 of utf8proc time. For the complex data set, this was 0.56 of utf8proc time. 01/06/2021 https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/ GraphemeCursor Cursor-based segmenter for grapheme clusters. GraphemeIndices External iterator for grapheme clusters and byte offsets. Graphemes External iterator for a string's grapheme clusters. USentenceBoundIndices External iterator for sentence boundaries and byte offsets. USentenceBounds External iterator for a string's sentence boundaries. UWordBoundIndices External iterator for word boundaries and byte offsets. UWordBounds External iterator for a string's word boundaries. UnicodeSentences An iterator over the substrings of a string which, after splitting the string on sentence boundaries, contain any characters with the Alphabetic property, or with General_Category=Number. UnicodeWords An iterator over the substrings of a string which, after splitting the string on word boundaries, contain any characters with the Alphabetic property, or with General_Category=Number. https://github.com/knighton/unicode Minimalist Unicode normalization/segmentation library. Python and C++. Abandonned, last commit 21/05/2015 https://hsivonen.fi/string-length/ First published: 2019-09-08 It’s Not Wrong that "🤦🏼‍♂️".length == 7 But It’s Better that "🤦🏼‍♂️".len() == 17 and Rather Useless that len("🤦🏼‍♂️") == 5 But I Want the Length to Be 1! jlf: "🤦🏼‍♂️"~text~length= -- 1 "🤦🏼‍♂️"~text~characters== an Array (shape [5], 5 items) 1 : ( "🤦" U+1F926 So 2 "FACE PALM" ) 2 : ( "🏼" U+1F3FC Sk 2 "EMOJI MODIFIER FITZPATRICK TYPE-3" ) 3 : ( "‍" U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" ) 4 : ( "♂" U+2642 So 1 "MALE SIGN" ) 5 : ( "️" U+FE0F Mn 0 "VARIATION SELECTOR-16", "VS16" ) 07/06/2021 https://news.ycombinator.com/item?id=20914184 String lengths in Unicode Claude Roux We went through a lot of pain to get this right in Tamgu ( https://github.com/naver/tamgu ). In particular, emojis can be encoded across 5 or 6 Unicode characters. A "black thumb up" is encoded with 2 Unicode characters: the thumb glyph and its color. This comes at a cost. Every time you extract a sub-string from a string, you have to scan it first for its codepoints, then convert character positions into byte positions. One way to speed up stuff a bit, is to check if the string is in ASCII (see https://lemire.me/blog/2018/05/16/validating-utf-8-strings-u ) and apply regular operator then. We implemented many techniques based on "intrinsics" instructions to speed up conversions and search in order to avoid scanning for codepoints. See https://github.com/naver/tamgu/blob/master/src/conversion.cxx for more information. https://github.com/naver/tamgu/wiki/4.-Speed-up-UTF8-string-processing-with-Intel's-%22intrinsics%22-instructions-(en) jlf: they have specific support for Korean... Probably because the NAVER company is from Republic of Korea ? 08/06/2021 https://twitter.com/hashtag/tamgu?src=hashtag_click https://twitter.com/hashtag/TAL?src=hashtag_click #tamgu le #langage_de_programmation pour le Traitement Automatique des Langues (#TAL). jlf 30/09/2021 I have a doubt about that: Is 👩‍👨‍👩‍👧' really a grapheme? When moving the cursor in BBEdit, I see a boundary between each character. [later] Ok, when moving the cursor in Visual Studio Code, it's really a unique grapheme, no way to put the cursor "inside". And the display is aligned with what I see in Google Chrome : one WOMAN followed by a family, and no way to put the cursor between the WOMAN and the family. --- https://www.unicode.org/review/pr-27.html (old, talk about Unicode 4) https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries (todo: review occurences of ZWJ) 29/10/2021 https://h3manth.com/posts/unicode-segmentation-in-javascript/ https://github.com/tc39/proposal-intl-segmenter https://news.ycombinator.com/item?id=21690326 Tailored grapheme clusters Grapheme clusters are locale-dependent, much like string collation is locale-dependent. What Unicode gives you by default, the (extended) grapheme cluster, is as useful as the DUCET (Default Unicode Collation Element Table); while you can live with them, you would be unsatisfied. In fact there are tons of Unicode bugs that can't be corrected due to the compatibility reason, and can only be fixed via tailored locale-dependent schemes. --- Hangul normalization and collation is broken in Unicode, albeit for slightly different reasons. The Unicode Collation Algorithm explictly devotes two sections related to Hangul; the first section, for "trailing weights" [1], is recommended for the detailed explanation. The Unicode Text Segmentation standard [2] explicitly mentions that Indic aksaras [3] require the tailoring to grapheme clusters. Depending on the view, you can also consider orthographic digraphs as examples (Dutch "ij" is sometimes considered a single character for example). https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme What's the difference between a character, a code point, a glyph and a grapheme? jlf: not very good... https://github.com/clipperhouse/words words is a command which splits strings into individual words, as defined by Unicode. It accepts text from stdin, and writes one word (token) per line to stdout. https://www.unicode.org/reports/tr29/#Random_Access jlf: Executor uses indexers for random access (ako breadcrumbs). Random access introduces a further complication. When iterating through a string from beginning to end, a regular expression or state machine works well. From each boundary to find the next boundary is very fast. By constructing a state table for the reverse direction from the same specification of the rules, reverse iteration is possible. However, suppose that the user wants to iterate starting at a random point in the text, or detect whether a random point in the text is a boundary. If the starting point does not provide enough context to allow the correct set of rules to be applied, then one could fail to find a valid boundary point. For example, suppose a user clicked after the first space after the question mark in “Are␣you␣there?␣ ␣No,␣I’m␣not”. On a forward iteration searching for a sentence boundary, one would fail to find the boundary before the “N”, because the “?” had not been seen yet. A second set of rules to determine a “safe” starting point provides a solution. Iterate backward with this second set of rules until a safe starting point is located, then iterate forward from there. Iterate forward to find boundaries that were located between the safe point and the starting point; discard these. The desired boundary is the first one that is not less than the starting point. The safe rules must be designed so that they function correctly no matter what the starting point is, so they have to be conservative in terms of finding boundaries, and only find those boundaries that can be determined by a small context (a few neighboring characters). This process would represent a significant performance cost if it had to be performed on every search. However, this functionality can be wrapped up in an iterator object, which preserves the information regarding whether it currently is at a valid boundary point. Only if it is reset to an arbitrary location in the text is this extra backup processing performed. The iterator may even cache local values that it has already traversed. Unicode 15.1 New rule GB9c for grapheme segmentation. https://www.unicode.org/reports/tr29/ --- No longer available: https://www.unicode.org/reports/tr29/proposed.html --- jlf: saw this review note "the new rule GB9c has been implemented in CLDR and ICU as a profile for some years" What is a profile? --- This specification defines default mechanisms; more sophisticated implementations can and should tailor them for particular locales or environments and, for the purpose of claiming conformance, document the tailoring in the form of a profile. ... Note that a profile can both add and remove boundary positions, compared to the results specified by UAX29-C1-1, UAX29-C2-1, or UAX29-C3-1. https://github.com/unicode-org/lstm_word_segmentation Python code for training an LSTM model for word segmentation in Thai, Burmese, and similar languages.

Normalization, equivalence


https://www.unicode.org/faq/normalization.html Normalization FAQ https://www.macchiato.com/unicode-intl-sw/nfc-faq NFC FAQ jlf: MUST READ! https://www.unicode.org/reports/tr15 UNICODE NORMALIZATION FORMS 26/11/2013 Text normalization in Go https://blog.golang.org/normalization 27/11/2013 The string type is broken https://mortoray.com/2013/11/27/the-string-type-is-broken/ https://news.ycombinator.com/item?id=6807524 https://www.reddit.com/r/programming/comments/1rkdip/the_string_type_is_broken/ In the comments Objective-C’s NSString type does correctly upper-case baffle into BAFFLE. (where the rectangle is a grapheme showing 2 small 'f') Q: What about getting the first three characters of “baffle”? Is “baf” the correct answer? A: That’s a good question. I suspect “baf” is the correct answer, and I wonder if there is any library that does it. I suspect if you normalize it first (since the ffl would disappear I think). A: The ligarture disappears in NFK[CD] but not in NF[CD]. Whether normalization to NFK[CD] is a good idea depends (as always) on the situation. For visual grapheme cluster counting, one would convert the entire text to NFKC. For getting teaser text from an article i would not a normalization step and let a ligature count as just one grapheme cluster even if it may resemble three of them logically. I assume, that articles are stored in NFC (the nondestructive normalization form with smallest memory footprint). The Unicode standard does not treat ligatures as containing more than one grapheme cluster for that normalization forms that permits them. So “efflab” (jlf: efflab) is the correct result of reversing “baffle” (jlf: baffle) and “baffle”[2] has to return “ffl” even when working on the grapheme cluster level! There may or may not be a need for another grapheme cluster definition that permits splitting of ligatures in NF[CD]. A straight forward way to implement a reverse function adhering to that special definition would NFKC each Unicode grapheme cluster on the fly. When that results in multiple Unicode grapheme clusters, that are used – else the original is preserved (so that “ℕ” does not become “N”). The real problem is to find a good name for that special interpretation of a grapheme cluster… Note : see also the comment of Tom Christiansen about casing. I don't copy-paste here, too long. https://github.com/blackwinter/unicode Unicode normalization library. (Mirror of Yoshida-san's code base to maintain the RubyGem.) Abandonned, last commit 07/07/2016 https://github.com/sjorek/unicode-normalization An enhanced facade to existing unicode-normalization implementations Last commit 25/03/2018 https://docs.microsoft.com/en-us/windows/win32/intl/using-unicode-normalization-to-represent-strings Using Unicode Normalization to Represent Strings https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize String.prototype.normalize() The normalize() method returns the Unicode Normalization Form of the string. https://forums.swift.org/t/string-case-folding-and-normalization-apis/14663/3 For the comments https://en.wikipedia.org/wiki/Unicode_equivalence Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character sets, which often included similar or identical characters. On Wed, Oct 28, 2020 at 9:54 AM Mark Davis ☕️ <mark@macchiato.com> wrote: Re: [icu-support] Options for Immutable Collation? I think your search for 'middle ground' is fruitless. An NFKD ordering is not correct for any human language, and changes with each new Unicode version. And even the default Unicode collation ordering is wrong for many languages, because there is no order that simultaneously satisfies all (eg German ordering and Swedish ordering are incompatible). Your 'middle ground' would be correct for nobody, and yet be unstable across Unicode versions; or worse yet, fail for new characters. IMO, the best practice for a file system (or like systems) is to store in codepoint order. When called upon to present a sorted list of files to a user, the displaying program should sort that list according to the user's language preferences. You are right: for a deterministic/reproducible list sorting for a cross-platform filesystem API, anything more complex would be an implementation hazard. However, after reviewing both developer discussions and implementation of Unicode handling in 6+ filesystems, IDNA200X, PRECIS and getting roped into work on an IETF i18n filesystem best-practices RFC ... I've got some thoughts. Thoughts that I will put into a new thread after I do some experimenting : ). Thank you all so much!!! -Zach Lym 08/06/2021 https://fr.wikipedia.org/wiki/Normalisation_Unicode NFD Les caractères sont décomposés par équivalence canonique et réordonnés canonical decomposition NFC Les caractères sont décomposés par équivalence canonique, réordonnés, et composés par équivalence canonique canonical decomposition followed by canonical composition NFKD Les caractères sont décomposés par équivalence canonique et de compatibilité, et sont réordonnés compatibility decomposition NFKC Les caractères sont décomposés par équivalence canonique et de compatibilité, sont réordonnés et sont composés par équivalence canonique compatibility decomposition followed by canonical composition FCD "Fast C or D" form; cf. UTN #5 FCC "Fast C Contiguous"; cf. UTN #5 09/06/2021 Rust https://docs.rs/unicode-normalization Decompositions External iterator for a string decomposition’s characters. Recompositions External iterator for a string recomposition’s characters. Replacements External iterator for replacements for a string’s characters. StreamSafe UAX15-D4: This iterator keeps track of how many non-starters there have been since the last starter in NFKD and will emit a Combining Grapheme Joiner (U+034F) if the count exceeds 30. is_nfc Authoritatively check if a string is in NFC. is_nfc_quick Quickly check if a string is in NFC, potentially returning IsNormalized::Maybe if further checks are necessary. In this case a check like s.chars().nfc().eq(s.chars()) should suffice. is_nfc_stream_safe Authoritatively check if a string is Stream-Safe NFC. is_nfc_stream_safe_quick Quickly check if a string is Stream-Safe NFC. is_nfd Authoritatively check if a string is in NFD. is_nfd_quick Quickly check if a string is in NFD. is_nfd_stream_safe Authoritatively check if a string is Stream-Safe NFD. is_nfd_stream_safe_quick Quickly check if a string is Stream-Safe NFD. is_nfkc Authoritatively check if a string is in NFKC. is_nfkc_quick Quickly check if a string is in NFKC. is_nfkd Authoritatively check if a string is in NFKD. is_nfkd_quick Quickly check if a string is in NFKD. Enums IsNormalized The QuickCheck algorithm can quickly determine if a text is or isn’t normalized without any allocations in many cases, but it has to be able to return Maybe when a full decomposition and recomposition is necessary. 08/06/2021 Pharo https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43 https://github.com/duerst/eprun Efficient Pure Ruby Unicode Normalization (eprun) According to julia/utf8proc, the interesting part is the tests. https://corp.unicode.org/pipermail/unicode/2020-December/009150.html Normalization Generics (NFx, NFKx, NFxy) https://6guts.wordpress.com/2015/04/12/this-week-unicode-normalization-many-rts/ https://gregtatum.com/writing/2021/diacritical-marks/ DIACRITICAL MARKS IN UNICODE https://news.ycombinator.com/item?id=29751641 Unicode Normalization Forms: When ö ≠ ö https://blog.opencore.ch/posts/unicode-normalization-forms/ https://unicode-org.github.io/icu/userguide/transforms/normalization/ ICU Documentation Normalization Has a few comments about NFKC_Casefold - NFKC_Casefold: ICU 4.4 supports the combination of NFKC, case folding and removing ignorable characters which was introduced with Unicode 5.2. - Data Generation Tool https://stackoverflow.com/questions/56995429/will-normalizing-a-string-give-the-same-result-as-normalizing-the-individual-gra Will normalizing a string give the same result as normalizing the individual grapheme clusters? --- No, that generally is not true. The Unicode Standard warns against the assumption that concatenating normalised strings produces another normalised string. From UAX #15: In using normalization functions, it is important to realize that none of the Normalization Forms are closed under string concatenation. That is, even if two strings X and Y are normalized, their string concatenation X+Y is not guaranteed to be normalized. https://stackoverflow.com/questions/7171377/separating-unicode-ligature-characters NFKD is no panacea: there are plenty of ligatures and other notionally combined forms it just does not work on at all. For example, it will not manage to decompose ß or ẞ to SS (even those there is a casefold thither!), nor Æ to AE or æ to ae, nor Œ to OE or œ to oe. It is also useless for turning ð or đ into d or ø into o. For all those things, you need the UCA (Unicode Collation Algorithm), not NFKD. NFD/NFKD also both have the annoying property of destroying singletons, if this matters to you. --- my understanding is that those decompositions you mention should not be done. They are not simply ligatures in the typographical sense, but real separate characters that are used differently! ß can be decomposed to ss if necessary (for example if you can only store ASCII), but they are not equivalent. The ff Ligature, on the other hand is only a typographical ligature.

Character set


https://www.gnu.org/software/libc/manual/html_mono/libc.html#Character-Set-Handling

String matching - Lower vs Casefold


https://stackoverflow.com/questions/45745661/lower-vs-casefold-in-string-matching-and-converting-to-lowercase https://www.w3.org/TR/charmod-norm/ Character Model for the World Wide Web: String Matching MUST READ, PLENTY OF EXAMPLES FOR CORNER CASES https://www.w3.org/TR/charmod-norm/#definitionCaseFolding Very good explanation! A few characters have a case folding that map one Unicode code point to two or more code points. This set of case foldings are called the full case foldings. character ß U+00DF LATIN SMALL LETTER SHARP S - The full case folding and the lower case mapping of this character is to two ASCII letters 's'. - The upper case mapping is to "SS". Because some applications cannot allocate additional storage when performing a case fold operation, Unicode provides a simple case folding that maps a code point that would normally fold to more or fewer code points to use a single code point for comparison purposes instead. Unlike the full folding, this folding invariably alters the content (and potentially the meaning) of the text. Unicode simple is not appropriate for use on the Web. character ᾛ [U+1F9B GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI] ᾛ ⇒ ἣι full case fold: U+1F23 GREEK SMALL LETTER ETA WITH DASIA AND VARIA + U+03B9 GREEK SMALL LETTER IOTA ᾛ ⇒ ᾓ simple case fold: U+1F93 GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI Language Sensitivity Another aspect of case mapping and case folding is that it can be language sensitive. Unicode defines default case mappings and case foldings for each encoded character, but these are only defaults and are not appropriate in all cases. Some languages need case mapping to be tailored to meet specific linguistic needs. One example of this are Turkic languages written in the Latin script: Default Folding I ⇒ i Default folding of letter I Turkic Language Folding I ⇒ ı Turkic language folding of dotless (ASCII) letter I İ ⇒ i Turkic language folding of dotted letter I https://www.w3.org/TR/charmod-norm/#matchingAlgorithm There are four choices for text normalization: - Default. This normalization step has no effect on the text and, as a result, is sensitive to form differences involving both case and Unicode normalization. - ASCII Case Fold. Comparison of text with the characters case folded in the ASCII (Basic Latin, U+0000 to U+007F) range. - Unicode Canonical Case Fold. Comparison of text that is both case folded and has Unicode canonical normalization applied. - Unicode Compatibility Case Fold. Comparison of text that is both case folded and has Unicode compatibility normalization applied. This normalization step is presented for completeness, but it is not generally appropriate for use on the Web. https://www.elastic.co/guide/en/elasticsearch/guide/current/languages.html Elasticsearch Dealing with Human Language https://stackoverflow.com/questions/319426/how-do-i-do-a-case-insensitive-string-comparison Related to Python, but the comments are very general and worth reading. --- Unicode Standard section 3.13 has two other definitions for caseless comparisons: (D146, canonical) NFD(toCasefold(NFD(str))) on both sides and (D147, compatibility) NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) on both sides. It states the inner NFD is solely to handle a certain Greek accent character. https://boyter.org/posts/unicode-support-what-does-that-actually-mean/ https://news.ycombinator.com/item?id=23524400 ſecret == secret == Secret ſatisfaction == satisfaction == ſatiſfaction == Satiſfaction == SatiSfaction === ſatiSfaction Another good example to consider is the character Æ. Under simple case folding rules the lower of Æ is ǣ. However with full case folding rules this also matches ae. Which one is correct? Well that depends on who you ask. See also https://github.com/unicode-org/icu4x/issues/3151 in the section "ICU4X title". https://lwn.net/Articles/784316/ Working with UTF-8 in the kernel jlf: interesting read about NTFS caseless, and about a drama because of lack of support for the turkish case.

String matching - Collation


https://unicode-org.github.io/icu/userguide/collation/string-search.html (ICU) String Search Service jlf: they give 3 issues applicable to text searching. Accented letters and conjoined letters are covered by Executor. But ignorable punctuation is not.

Locale


02/06/2021 https://www.php.net/manual/fr/function.setlocale.php Warning The locale information is maintained per process, not per thread. If you are running PHP on a multithreaded server API , you may experience sudden changes in locale settings while a script is running, though the script itself never called setlocale(). This happens due to other scripts running in different threads of the same process at the same time, changing the process-wide locale using setlocale(). On Windows, locale information is maintained per thread as of PHP 7.0.5. On Windows, setlocale(LC_ALL, '') sets the locale names from the system's regional/language settings (accessible via Control Panel). https://www.gnu.org/software/libc/manual/html_mono/libc.html#Locales Locales and Internationalization https://pubs.opengroup.org/onlinepubs/9699919799/ IEEE Std 1003.1-2017 Locale https://unix.stackexchange.com/questions/87745/what-does-lc-all-c-do/87763#87763 What does "LC_ALL=C" do? https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f027338b0fab0f5078971fbe stream_libarchive: workaround various types of locale braindeath (legendary C locales rant) https://stackoverflow.com/questions/30479607/explain-the-effects-of-export-lang-lc-ctype-and-lc-all The LANG, LC_CTYPE and LC_ALL are special environment variables which after they got exported to the shell environment, are available and ready to be rea by certain programs which supports a locale (natural language formatting for C). Each variable sets the C library's notion of natural language formatting style for particular sets of routines, for example: - LC_ALL - Set the entire locale generically - LC_CTYPE - Set a locale for the ctype and multibyte functions. This controls recognition of upper and lower case, alphabetic or non- alphabetic characters, and so on. and other such as LC_COLLATE (for string collation routines), LC_MESSAGES (for message catalogs), LC_MONETARY (for formatting monetary values), LC_NUMERIC (for formatting numbers), LC_TIME (for formatting dates and times). Regarding LANG, it is used as a substitute for any unset LC_* variable. See: man setlocale (BSD), man locale So when certain C functions are called (such as setlocale, ctype, multibyte, catopen, printf, etc.), they read the locale settings from the configuration files and local environment in order to control and format natural language formatting style as per C programming language standards. see: setlocale http://www.unix.com/man-page/freebsd/3/setlocale/ see: ctype http://www.unix.com/man-page/freebsd/3/ctype/ see: multibyte http://www.unix.com/man-page/freebsd/3/multibyte/ see: catopen http://www.unix.com/man-page/freebsd/3/catopen/ see:printf http://www.unix.com/man-page/freebsd/3/printf/ see: ISO C99 https://en.wikipedia.org/wiki/C99 see: C Library - <locale.h> https://www.tutorialspoint.com/c_standard_library/locale_h.htm AIX documentation https://www.ibm.com/docs/en/aix/7.1?topic=globalization-locales - Understanding locale - Understanding locale categories - Understanding locale environment variables - Understanding the locale definition source file - Multibyte subroutines - Wide character subroutines - Bidirectionality and character shaping - Code set independence - File name matching - Radix character handling - Programming model https://bugzilla.mozilla.org/show_bug.cgi?id=1612379 Narrow down the list of ICU locales we ship https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/data_management.md Data management in ICU4X https://pubs.opengroup.org/onlinepubs/9699919799/ localedef - define locale environment If the locale value begins with a slash, it shall be interpreted as the pathname of a file that was created in the output format used by the localedef utility; see OUTPUT FILES under localedef. Referencing such a pathname shall result in that locale being used for the indicated category.

CLDR Common Locale Data Repository


19/06/2021 https://github.com/twitter/twitter-cldr-rb Ruby implementation of the ICU (International Components for Unicode) that uses the Common Locale Data Repository to format dates, plurals, and more. https://github.com/twitter/twitter-cldr-js JavaScript implementation of the ICU (International Components for Unicode) that uses the Common Locale Data Repository to format dates, plurals, and more. Based on twitter-cldr-rb. https://unicode-org.atlassian.net/jira/software/c/projects/CLDR/issues/?filter=allissues CLDR tickets

Case mappings


Rule Final_Sigma in default case algorithms. https://github.com/php/php-src/pull/10268 jlf: difficult to implement, involves to scan arbitrarily far to the left and right of capital sigma. https://www.unicode.org/faq/casemap_charprop.html https://stackoverflow.com/questions/7360996/unicode-correct-title-case-in-java?noredirect=1&lq=1 Unicode-correct title case in Java https://docs.rs/unicode-case-mapping/latest/unicode_case_mapping/ Example assert_eq!(unicode_case_mapping::to_lowercase('İ'), ['i' as u32, 0x0307]); assert_eq!(unicode_case_mapping::to_lowercase('ß'), ['ß' as u32, 0]); assert_eq!(unicode_case_mapping::to_uppercase('ß'), ['S' as u32, 'S' as u32, 0]); assert_eq!(unicode_case_mapping::to_titlecase('ß'), ['S' as u32, 's' as u32, 0]); assert_eq!(unicode_case_mapping::to_titlecase('-'), [0; 3]); assert_eq!(unicode_case_mapping::case_folded('I'), NonZeroU32::new('i' as u32)); assert_eq!(unicode_case_mapping::case_folded('ß'), None); assert_eq!(unicode_case_mapping::case_folded('ẞ'), NonZeroU32::new('ß' as u32)); https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/titlecase.html fun Char.titlecase(): String val chars = listOf('a', 'Dž', 'ʼn', '+', 'ß') val titlecaseChar = chars.map { it.titlecaseChar() } val titlecase = chars.map { it.titlecase() } println(titlecaseChar) // [A, Dž, ʼn, +, ß] println(titlecase) // [A, Dž, ʼN, +, Ss] fun Char.titlecase(locale: Locale): String val chars = listOf('a', 'Dž', 'ʼn', '+', 'ß', 'i') val titlecase = chars.map { it.titlecase() } val turkishLocale = Locale.forLanguageTag("tr") val titlecaseTurkish = chars.map { it.titlecase(turkishLocale) } println(titlecase) // [A, Dž, ʼN, +, Ss, I] println(titlecaseTurkish) // [A, Dž, ʼN, +, Ss, İ] https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/177 jlf: good summary, was not so obvious before I understand there are simple and full case mappings... Also note that the Unicode standard only provides defaults for, but then goes on to say that locale/language specific mappings should really be used. The Unicode standard is very explicit that things like uppercase transformations should be able to handle language specific issues such as the Turkish dotted and dotless i, and that “ß” should be uppercased to “SS” in German. See: Q: Is all of the Unicode case mapping information in UnicodeData.txt? A: No. The UnicodeData.txt file includes all of the one-to-one case mappings. Since many parsers were built with the expectation that UnicodeData.txt would have at most a single character in each case mapping field, the file SpecialCasing.txt was added to provide the one-to-many mappings, such as the one needed for uppercasing ß (U+00DF LATIN SMALL LETTER SHARP S). In addition, CaseFolding.txt contains additional mappings used in case folding and caseless matching. For more information, see Section 5.18, Case Mappings in The Unicode Standard. and A: The Unicode Standard defines the default case mapping for each individual character, with each character considered in isolation. This mapping does not provide for the context in which the character appears, nor for the language-specific rules that must be applied when working in natural language text. https://www.b-list.org/weblog/2018/nov/26/case/ Truths programmers should know about case

Collation, sorting


https://www.unicode.org/reports/tr35/tr35-collation.html UNICODE LOCALE DATA MARKUP LANGUAGE (LDML) PART 5: COLLATION 01/06/2021 https://github.com/jgm/unicode-collation https://hackage.haskell.org/package/unicode-collation Haskell implementation of the Unicode Collation Algorithm https://icu4c-demos.unicode.org/icu-bin/collation.html ICU Collation Demo https://www.enterprisedb.com/docs/epas/latest/epas_guide/03_database_administration/06_unicode_collation_algorithm/ Unicode Collation Algorithm https://www.minaret.info/test/collate.msp This page provides a means to convert a string of Unicode characters into a binary collation key using the Java language version ("icu4j") of the IBM International Components for Unicode (ICU) library. A collation key is the basis for sorting and comparing strings in a language-sensitive Unicode environment. A collation key is built using a "locale" (a designation for a particular laguage or a variant) and a comparison level. The levels supported here (Primary, Secondary, Tertiary, Quaternary and Identical) correspond to levels "L1" through "Ln" as described in Unicode Technical Standard #10 - Unicode Collation Algorithm. When comparing collation keys for two different strings, both keys must have been created using the same locale and comparison level in order to be meaningful. The two keys are compared from left to right, byte for byte until one of the bytes is not equal to the other. Whichever byte is numerically less than the other causes the source string for that collation key to sort before the other string. https://lemire.me/blog/2018/12/17/sorting-strings-properly-is-stupidly-hard/ It's the comments section which is interesting. https://discourse.julialang.org/t/sorting-strings-by-unicode-collation-order/11195 Not supported 03/08/2022 https://discourse.julialang.org/t/unicode-15-0-beta-and-sorting-collation/83090 https://www.unicode.org/emoji/charts-15.0/emoji-ordering.html https://en.wikipedia.org/wiki/Natural_sort_order Natural sort order is an ordering of strings in alphabetical order, except that multi-digit numbers are ordered as a single character. Natural sort order has been promoted as being more human-friendly ("natural") than the machine-oriented pure alphabetical order. For example, in alphabetical sorting "z11" would be sorted before "z2" because "1" is sorted as smaller than "2", while in natural sorting "z2" is sorted before "z11" because "2" is sorted as smaller than "11". Alphabetical sorting: z11 z2 Natural sorting: z2 z11 Functionality to sort by natural sort order is built into many programming languages and libraries. 02/06/2021 https://www.postgresql.org/message-id/flat/BA6132ED-1F6B-4A0B-AC22-81278F5AB81E%40tripadvisor.com The dangers of streaming across versions of glibc: A cautionary tale SELECT 'M' > 'ஐ'; 'FULLWIDTH LATIN CAPITAL LETTER M' (U+FF2D) 'TAMIL LETTER AI' (U+0B90) Across different machines, running the same version of postgres, and in databases with identical character encodings and collations ('en_US.UTF-8') that select will return different results if the version of glibc is different. master:src/backend/utils/adt/varlena.c:1494,1497 These are the lines where postgres calls strcoll_l and strcoll, in order to sort strings in a locale aware manner. The reality is that there are different versions of glibc out there in the wild, and they do not sort consistently across versions/environments. https://collations.info/concepts/ a site devoted to working with Collations, Unicode, Encodings, Code Pages, etc in Microsoft SQL Server.

BIDI title


https://www.iamcal.com/understanding-bidirectional-text/ Understanding Bidirectional (BIDI) Text in Unicode https://www.w3.org/International/articles/inline-bidi-markup/uba-basics Unicode Bidirectional Algorithm basics jlf: the example are gif images :-(( no way to copy-paste the characters. https://www.unicode.org/notes/tn39/ BIDI BRACKETS FOR DUMMIES https://stackoverflow.com/questions/5801820/how-to-solve-bidi-bracket-issues How to solve BiDi bracket issues? https://gist.github.com/mvidner/e96ac917d9a54e09d9730220a34b0d24 Problems with Bidirectional (BiDi) Text https://www.w3.org/International/questions/qa-bidi-unicode-controls How to use Unicode controls for bidi text https://github.com/mvidner/bidi-test Testing bidirectional text https://terminal-wg.pages.freedesktop.org/bidi/ BiDi in Terminal Emulators https://www.w3.org/International/articles/inline-bidi-markup/uba-basics Unicode Bidirectional Algorithm basics W3C http://fribidi.org/ GNU FriBidi is an implementation of the Unicode Bidirectional Algorithm (bidi). jlf: dead... The latest release is fribidi-0.19.7.tar.bz2 from August 4, 2015. This release is based on Unicode 6.2.0 character database. --- jlf: maybe not dead, but low activity... v1.0.13 https://github.com/fribidi/fribidi https://news.ycombinator.com/item?id=37990523 Ask HN: Bidirectional Text Navigation

Emoji


https://www.unicode.org/Public/emoji/15.0/emoji-test.txt https://emojipedia.org/ http://xahlee.info/comp/unicode_emoji.html 29/05/2021 https://tonsky.me/blog/emoji/ 27/02/2023 https://news.ycombinator.com/item?id=34925446 Discussion about emoji and graphemes (again...). Nothing very interesting in this discussion. Remember: The "length" of a string in extended grapheme clusters is not stable across Unicode versions, which seems like a recipe for confusion. The length in code units is unambiguous and constant across versions. --- Executor: NinjaCat = "🐱‍👤" NinjaCat~description= 'UTF-8 not-ASCII (11 bytes)' NinjaCat~text~characters== an Array (shape [3], 3 items) 1 : ( "🐱" U+1F431 So 2 "CAT FACE" ) 2 : ( "‍" U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" ) 3 : ( "👤" U+1F464 So 2 "BUST IN SILHOUETTE" )

Countries, flags


22/05/2021 https://en.wikipedia.org/wiki/Regional_indicator_symbol Regional indicator symbol https://en.wikipedia.org/wiki/ISO_3166-1 ISO 3166-1 (Codes for the representation of names of countries and their subdivisions) https://observablehq.com/@jobleonard/which-unicode-flags-are-reversible

Evidence of partial or wrong support of Unicode


13/08/2013 We don’t need a string type https://mortoray.com/2013/08/13/we-dont-need-a-string-type/ 01/12/2013 Strings in Ruby are UTF-8 now… right? http://andre.arko.net/2013/12/01/strings-in-ruby-are-utf-8-now/ 14/07/2017 Testing Ruby's Unicode Support http://blog.honeybadger.io/ruby-s-unicode-support/ 22/05/2021 Emoji.length == 2 https://news.ycombinator.com/item?id=13830177 Lot of comments, did not read all, to continue 22/05/2021 https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ Let's Stop Ascribing Meaning to Code Points 18/07/2021 https://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-assumptions/ Breaking Our Latin-1 Assumptions

Optimization, SIMD


08/06/2021 https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/ [obsolete] https://github.com/lemire/fastvalidate-utf-8 header-only library to validate utf-8 strings at high speeds (using SIMD instructions) jlf 2023/06/16 (now obsolete) NOTE: The fastvalidate-utf-8 library is obsolete as of 2022: please adopt the simdutf library. It is much more powerful, faster and better tested. https://github.com/simdutf/simdutf simdutf: Unicode at gigabytes per second 08/06/2021 https://github.com/simdjson/simdjson simdjson : Parsing gigabytes of JSON per second The simdjson library uses commonly available SIMD instructions and microparallel algorithms to parse JSON 4x faster than RapidJSON and 25x faster than JSON for Modern C++. Minify JSON at 6 GB/s, validate UTF-8 at 13 GB/s, NDJSON at 3.5 GB/s https://arxiv.org/abs/2010.03090 Validating UTF-8 In Less Than One Instruction Per Byte John Keiser, Daniel Lemire The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available SIMD instructions. To ensure reproducibility, our work is freely available as open source software. https://r-libre.teluq.ca/2178/ Recherche et analyse de solutions performantes pour le traitement de fichiers JSON dans un langage de haut niveau [r-libre/2178] Referenced from https://lemire.me/blog/ Daniel Lemire's blog – Daniel Lemire is a computer science professor at the University of Quebec (TELUQ) in Montreal. His research is focused on software performance and data engineering. He is a techno-optimist. https://github.com/simdutf/simdutf https://news.ycombinator.com/item?id=32700315 Unicode routines (UTF8, UTF16, UTF32): billions of characters per second using SSE2, AVX2, NEON, AVX-512. https://jcdav.is/2016/09/01/How-the-JVM-compares-your-strings/ (jlf: also referenced in the section "String comparison") How the JVM compares your strings using the craziest x86 instruction you've never heard of --- Comment from a Swift thread: https://forums.swift.org/t/string-s-abi-and-utf-8/17676/25 PCMPxSTRx is no longer faster than equivalent "simple" vector instruction sequences for straightforward comparisons (this had already been the case for a few years when that article was written, which is curious). It can be used productively (with some care) for some other operations like substring matching, but that's not as much of a heavy-hitter. There's a bunch of string stuff that will benefit from general vectorization, and which is absolutely on our roadmap to tackle, but using the PCMPxSTRx instructions specifically isn't a source of wins on the most important operations. https://news.ycombinator.com/item?id=34267936 Transcoding Unicode with AVX-512: AMD Zen 4 vs. Intel Ice Lake (lemire.me) https://www.reddit.com/r/java/comments/qafjtg/faster_charset_encoding/ Java 17 uses avx in both encoding and decoding https://lemire.me/blog/2023/02/16/computing-the-utf-8-size-of-a-latin-1-string-quickly-avx-edition/ Computing the UTF-8 size of a Latin 1 string quickly (AVX edition)

Variation sequence


https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-variation-sequences.txt https://www.unicode.org/Public/15.1.0/ucd/emoji/emoji-variation-sequences.txt # emoji-variation-sequences.txt 22/05/2021 List of all code points that can display differently via a variation sequence http://randomguy32.de/unicode/charts/standardized-variants/#emoji Safari is better to display the characters. Google Chrome and Opera have the same limitations: some characters are not supported (ex: section Phags-Pa). https://sethmlarson.dev/unicode-variation-selectors Mahjong tiles and Unicode variation selectors

Whitespaces, separators


22/05/2021 https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/ A section about wcwidth. A section about spaces: There are actually two definitions of whitespace in Unicode. Unicode assigns every codepoint a category, and has three categories for what sounds like whitespace: “Separator, space”; “Separator, line”; “Separator, paragraph”. CR, LF, tab, and even vertical tab are all categorized as “Other, control” and not as separators. The only character in the “Separator, line” category is U+2028 LINE SEPARATOR, and the only character in “Separator, paragraph” is U+2029 PARAGRAPH SEPARATOR. Thankfully, all of these have the WSpace property. As an added wrinkle, the lone oddball character “⠀” renders like a space in most fonts. jlf: 2 cols x 3 lines of debossed dots. But it’s not whitespace, it’s not categorized as a separator, and it doesn’t have WSpace. It’s actually U+2800 BRAILLE PATTERN BLANK, the Braille character with none of the dots raised. (I say “most fonts” because I’ve occasionally seen it rendered as a 2×4 grid of open circles.)

Hyphenation


break words into syllables I need to break words into syllables:astronomical --> as - tro - nom - ic - al Is it possible to do this (in different languages) using ICU library? (if no, may be you suggest other tools for it?) Andreas Heigl: While it looks like this is not something for ICU[1], there are libraries out there handling that - most of the time based on the thesis of Marc Liang. I've built an implementation for PHP[2] but there are a lot of others out there[3]. [1] https://github.com/unicode-org/icu4x/issues/164#issuecomment-651410272 [2] https://github.com/heiglandreas/Org_Heigl_Hyphenator [3] https://github.com/search?q=hyphenate&type=repositories https://tug.org/docs/liang/liang-thesis.pdf

DNS title, Domain Name title, Domain Name System title


http://lambda-the-ultimate.org/node/5674#comment-97016 jlf: I created this section because of this comment Have you ever looked at how international encoding of DNS names are done in URLs? It uses Punycode, and it's a disaster. Here's a good starting point to read up on this: https://en.wikipedia.org/wiki/Internationalized_domain_name https://en.wikipedia.org/wiki/Internationalized_domain_name Internationalized domain name ToASCII leaves ASCII labels unchanged. It fails if the label is unsuitable for the Domain Name System. For labels containing at least one non-ASCII character, ToASCII applies the Nameprep algorithm (https://en.wikipedia.org/wiki/Nameprep) This converts the label to lowercase and performs other normalization. ToASCII then translates the result to ASCII, using Punycode (https://en.wikipedia.org/wiki/Punycode) Finally, it prepends the four-character string "xn--". This four-character string is called the ASCII Compatible Encoding (ACE) prefix. It is used to distinguish labels encoded in Punycode from ordinary ASCII labels. The ToASCII algorithm can fail in several ways. For example, the final string could exceed the 63-character limit of a DNS label. A label for which ToASCII fails cannot be used in an internationalized domain name. The function ToUnicode reverses the action of ToASCII, stripping off the ACE prefix and applying the Punycode decode algorithm. It does not reverse the Nameprep processing, since that is merely a normalization and is by nature irreversible. Unlike ToASCII, ToUnicode always succeeds, because it simply returns the original string if decoding fails. In particular, this means that ToUnicode has no effect on a string that does not begin with the ACE prefix. https://en.wikipedia.org/wiki/Punycode Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the letter–digit–hyphen (LDH) subset. For example, München (German name for Munich) is encoded as Mnchen-3ya.

All languages


https://www.omniglot.com/index.htm The online encyclopedia of writing systems & languages jlf: nothing about Unicode but good for culture générale.

Classical languages


https://docs.cltk.org/en/latest/ https://github.com/cltk/cltk The Classical Language Toolkit Python library The Classical Language Toolkit (CLTK) is a Python library offering natural language processing (NLP) for pre-modern languages. Pre-configured pipelines are available for 19 languages. Akkadian Arabic Aramaic Classical Chinese Coptic Gothic Greek Hindi Latin Middle High German English French Old Church Slavonic Old Norse Pali Panjabi Sanskrit (Some parts of the Sanskrit library are forked from the Indic NLP Library)

Arabic language


https://en.wikipedia.org/wiki/Arabic_script_in_Unicode Arabic script in Unicode

Indic languages


https://www.unicode.org/faq/indic.html Indic scripts in the narrow sense are the nine major Brahmi-derived scripts of India. In a wider sense, the term can cover all Brahmic scripts and Kharoshthi. What is ISCII? Indian Standard Code for Information Interchange (ISCII) is the character code for Indian scripts that originate from the Brahmi script. Keywords: nukta Vedic Sanskrit vowel signs (matras) vowel modifiers (candrabindu, anusvara) the consonant modifier (nukta) Tamil Bengali (Bangla) / Assamese Script Sindhi implosive consonants FAQ: How do I collate Indic language data? Collation order is not the same as code point order. A good treatment of some issues specific to collation in Indic languages can be found in the paper Issues in Indic Language Collation by Cathy Wissink (https://www.unicode.org/notes/tn1/) Collation in general must proceed at the level of language or language variant, not at the script or codepoint levels. See also UTS #10: Unicode Collation Algorithm. Some Indic-specific issues are also discussed in that report. This section illustrates that Unicode’s concepts like “extended grapheme cluster” are meant to provide some low-level, general segmentation, and are not going to be enough for ideal experience for end users. https://en.wikipedia.org/wiki/Devanagari#Conjunct_consonants https://en.wikipedia.org/wiki/Devanagari_conjuncts Conjunct consonants are a form of orthographic ligature characteristic of the Brahmic scripts. They are constructed of more than two consonant letters. Biconsonantal conjuncts are common, but longer conjuncts are increasingly constrained by the languages' phonologies and the actual number of conjuncts observed drops sharply. Ulrich Stiehl includes a five-letter Devanagari conjunct र्त्स्न्य (rtsny)[1] among the top 360 most frequent conjuncts found in Classical Sanskrit;[2] the complete list appears below. Conjuncts often span a syllable boundary, and many of the conjuncts below occur only in the middle of words, where the coda consonants of one syllable are conjoined with the onset c onsonants of the following syllable. [1] As in Sanskrit word कार्त्स्न्य (In Bengali Script কার্ৎস্ন্য), meaning "The Whole, Entirety" [2] Stiehl, Ulrich. "Devanagari-Schreibübungen" (PDF). www.sanskritweb.net. http://www.sanskritweb.net/deutsch/devanagari.pdf https://stackoverflow.com/questions/6805311/combining-devanagari-characters Combining Devanagari characters "बिक्रम मेरो नाम हो"~text~graphemes== a GraphemeSupplier 1 : T'बि' 2 : T'क्' <-- According the comments, these 2 graphemes should be only one: क्र 3 : T'र' <-- even ICU doesn't support that... it's a tailored grapheme cluster 4 : T'म' 5 : T' ' 6 : T'मे' 7 : T'रो' 8 : T' ' 9 : T'ना' 10 : T'म' 11 : T' ' 12 : T'हो' "बिक्रम मेरो नाम हो"~text~characters== an Array (shape [18], 18 items) 1 : ( "ब" U+092C Lo 1 "DEVANAGARI LETTER BA" ) 2 : ( "ि" U+093F Mc 0 "DEVANAGARI VOWEL SIGN I" ) 3 : ( "क" U+0915 Lo 1 "DEVANAGARI LETTER KA" ) 4 : ( "्" U+094D Mn 0 "DEVANAGARI SIGN VIRAMA" ) <-- influence segmentation 5 : ( "र" U+0930 Lo 1 "DEVANAGARI LETTER RA" ) 6 : ( "म" U+092E Lo 1 "DEVANAGARI LETTER MA" ) 7 : ( " " U+0020 Zs 1 "SPACE", "SP" ) 8 : ( "म" U+092E Lo 1 "DEVANAGARI LETTER MA" ) 9 : ( "े" U+0947 Mn 0 "DEVANAGARI VOWEL SIGN E" ) 10 : ( "र" U+0930 Lo 1 "DEVANAGARI LETTER RA" ) 11 : ( "ो" U+094B Mc 0 "DEVANAGARI VOWEL SIGN O" ) 12 : ( " " U+0020 Zs 1 "SPACE", "SP" ) 13 : ( "न" U+0928 Lo 1 "DEVANAGARI LETTER NA" ) 14 : ( "ा" U+093E Mc 0 "DEVANAGARI VOWEL SIGN AA" ) 15 : ( "म" U+092E Lo 1 "DEVANAGARI LETTER MA" ) 16 : ( " " U+0020 Zs 1 "SPACE", "SP" ) 17 : ( "ह" U+0939 Lo 1 "DEVANAGARI LETTER HA" ) 18 : ( "ो" U+094B Mc 0 "DEVANAGARI VOWEL SIGN O" ) In Devanagari, each grapheme cluster consists of an initial letter, optional pairs of virama (vowel killer) and letter, and an optional vowel sign. virama = u'\N{DEVANAGARI SIGN VIRAMA}' cluster = u'' last = None for c in s: cat = unicodedata.category(c)[0] if cat == 'M' or cat == 'L' and last == virama: cluster += c else: if cluster: yield cluster cluster = c last = c if cluster: yield cluster --- Let's cover the grammar very quickly: The Devanagari Block. As a developer, there are two character classes you'll want to concern yourself with: Sign: This is a character that affects a previously-occurring character. Example, this character: ्. The light-colored circle indicates the location of the center of the character it is to be placed upon. Letter / Vowel / Other: This is a character that may be affected by signs. Example, this character: क. Combination result of ् and क: क्. But combinations can extend, so क् and षति will actually become क्षति (in this case, we right-rotate the first character by 90 degrees, modify some of the stylish elements, and attach it at the left side of the second character). https://news.ycombinator.com/item?id=20058454 If I type anything like किमपि (“kimapi”) and hit backspace, it turns into किमप (“kimapa”). That is, the following sequence of codepoints: ‎0915 DEVANAGARI LETTER KA ‎093F DEVANAGARI VOWEL SIGN I ‎092E DEVANAGARI LETTER MA ‎092A DEVANAGARI LETTER PA ‎093F DEVANAGARI VOWEL SIGN I made of three grapheme clusters (containing 2, 1, and 2 codepoints respectively), turns after a single backspace into the following sequence: ‎0915 DEVANAGARI LETTER KA ‎093F DEVANAGARI VOWEL SIGN I ‎092E DEVANAGARI LETTER MA ‎092A DEVANAGARI LETTER PA This is what I expect/find intuitive, too, as a user. Similarly अन्यच्च is made of 3 grapheme clusters but you hit backspace 7 times to delete it (though there I'd slightly have preferred अन्यच्च→अन्यच्→अन्य→अन्→अ instead of अन्यच्च→अन्यच्→अन्यच→अन्य→अन्→अन→अ that's seen, but one can live with this). https://github.com/anoopkunchukuttan/indic_nlp_library The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. The library provides the following functionalities: Text Normalization Script Information Word Tokenization and Detokenization Sentence Splitting Word Segmentation Syllabification Script Conversion Romanization Indicization Transliteration Translation https://github.com/AI4Bharat/indicnlp_catalog The Indic NLP Catalog jlf: way beyond Unicode, tons of URLs... https://news.ycombinator.com/item?id=20056966 jlf: Devnagari seems to be an example where grapheme is not the right segmentation What does "index" mean? (hindi) "इंडेक्स" का क्या अर्थ है? Including the quote marks, spaces, and question mark, that's 18 characters. as a native speaker, shouldn't they be considered 15 characters? क्स, क्या and र्थ each form individual conjunct consonants. Counting them as two would then beget the question as to why डे is not considered two characters too, seeing as it is formed by combining ड and ए, much like क्स is formed by combining क् and स. ... Devnagari allows simple characters to form compound characters. Regarding क्स and डे, the difference between them is that the former is a combination of two consonants (pronounced "ks") while the latter is formed by a consonant and a vowel ("de"). However, looking at the visual representation is wrong, since डा (consonant+vowel) would also look like two characters. https://slidetodoc.com/indic-text-segmentation-presented-by-swaran-lata-senior/ INDIC TEXT SEGMENTATION https://github.com/w3c/iip/issues/34 the final rendered state of the text is what influences the segmentation, rather than the sequence of code points used. https://docs.microsoft.com/en-us/typography/ https://docs.microsoft.com/en-us/typography/script-development/tamil Developing OpenType Fonts for Tamil Script The first step is to analyze the input text and break it into syllable clusters. Then apply font features and computes ligatures and combine marks. https://docs.microsoft.com/en-us/typography/script-development/devanagari Developing OpenType Fonts for Devanagari Script https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/ Picking Apart the Crashing iOS String Posted by Manish Goregaokar on February 15, 2018 Indic scripts and consonant clusters jlf: he's a black belt! or is it his native tongue? https://stackoverflow.com/questions/75210512/how-to-split-devanagari-bi-tri-and-tetra-conjunct-consonants-as-a-whole-from-a-s How to split Devanagari bi-tri and tetra conjunct consonants as a whole from a string? "हिन्दी मुख्यमंत्री हिमंत" Current output: हि न् दी मु ख् य मं त् री हि मं त Desired ouput: हि न्दी मु ख्य मं त्री हि मं त https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf Proper Complex Script Support in Text Terminals page 8 Characters in one line will further be grouped into terminal clusters. A terminal cluster contains the characters that are combined together in the terminal environment. It is an instance of the tailored grapheme cluster defined in UAX #29. In Indic scripts, for example, syllables with virama conjoiners in the middle will be considered one single terminal cluster, while they are treated as multiple extended grapheme clusters in UAX #29. --- page9 In some writing systems, the form of a character may depend on the characters that follow it. One example of this is Devanagari’s repha forms. This requires the establishment of a work zone that contains the most recent characters, and the property of the characters in the work zone is considered volatile and may change depending on the incoming text from the guest. When the terminal receives text, it will first append the text into the work zone and measure the entire work zone to process potential property changes. If the measurement result says that the text in the work zone could be broken into multiple clusters, then the work zone will be shrunk to only contain the last (maybe incomplete) cluster. The text before that will be committed, and its properties will no longer change. As a result, at any time the work zone will contain at most one cluster. When the cursor moves (via the terminal receiving a cursor move command or a newline), all the text in the work zone will be committed—even if it is incomplete—and the work zone will be cleared. https://slideplayer.com/slide/11341056/ INDIC TEXT SEGMENTATION todo: read https://news.ycombinator.com/item?id=9219162 I Can Text You A Pile of Poo, But I Can’t Write My Name March 17th, 2015 jlf: the article is about Bengali, but HN comments are also for other languages. todo: read https://www.unicode.org/L2/L2023/23140-graphemes-expectations.pdf Unicode 15.1: Unicode grapheme clusters tend to be closer to the larger user-perceived units. Hangul text is clearly segmented into syllable blocks. For Brahmic scripts, things are less clear. Grapheme clusters may contain several base-level units, but up to Unicode 15 always broke after virama characters. This broke not only within orthographic syllables, but for a number of scripts also within the encoding of conjunct forms that users perceive as base-level units, such as Khmer coengs (see subsection Subscript Consonant Signs of section 16.4 Khmer of the Unicode Standard). In Unicode 15.1, this is being corrected for six scripts, while leaving the others broken.

CJK


https://resources.oreilly.com/examples/9781565922242/blob/master/doc/cjk.inf Version 2.1 (July 12, 1996) Online Companion to "Understanding Japanese Information Processing" This online document provides information on CJK (that is, Chinese, Japanese, and Korean) character set standards and encoding systems. --- jlf: 1996... but maybe some things to learn. https://en.wikipedia.org/wiki/Cangjie_input_method Cangjie input method jlf: nothing about Unicode... but maybe some things to learn.

Korean


22/05/2021 http://gernot-katzers-spice-pages.com/var/korean_hangul_unicode.html The Korean Writing System

Japanese


https://heistak.github.io/your-code-displays-japanese-wrong/ https://news.ycombinator.com/item?id=29022906 https://www.johndcook.com/blog/2022/09/25/katakana-hiragana-unicode/ https://news.ycombinator.com/item?id=32987710

Polish


https://www.twardoch.com/download/polishhowto/index.html Polish diacritics how to?

IME - Input Method Editor


https://hsivonen.fi/ime/ An IME is a piece of software that transforms user-generated input events (mostly keyboard events, but some IMEs allow some auxiliary pointing device interaction) into text in a manner more complex than a mere keyboard layout. Basically, if the relationship between the keys that a user presses on a hardware keyboard and the text that ends up in an applications text buffer is more complex than when writing French, an IME is in use.

Text editing


https://lord.io/text-editing-hates-you-too/ TEXT EDITING HATES YOU TOO

Text rendering, Text shaping library


https://faultlore.com/blah/text-hates-you/ Text Rendering Hates You Aria Beingessner September 28th, 2019 jlf: culture générale todo: read https://harfbuzz.github.io/ https://github.com/harfbuzz/harfbuzz jlf: referenced by ICU Users of ICU Layout are strongly encouraged to consider the HarfBuzz project as a replacement for the ICU Layout Engine. Uniscribe if you are writing Windows software CoreText on macOS

String Matching


https://www.w3.org/TR/charmod-norm/ String matching Case folding is the process of making two texts which differ only in case identical for comparison purposes, that is, it is meant for the purpose of string matching. This is distinct from case mapping, which is primarily meant for display purposes. As with the default case mappings, Unicode defines default case fold mappings ("case folding") for each Unicode code point.

Fuzzy String Matching


29/05/2021 https://github.com/logannc/fuzzywuzzy-rs Rust port of the Python fuzzywuzzy https://github.com/seatgeek/fuzzywuzzy --> moved to https://github.com/seatgeek/thefuzz

Levenshtein distance and string similarity


https://github.com/ztane/python-Levenshtein/ The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

String comparison


31/05/2021 https://stackoverflow.com/questions/49662585/how-do-i-compare-a-unicode-string-that-has-different-bytes-but-the-same-value A pair NFC considers different but a user might consider the same is 'µ' (MICRO SIGN) and 'μ' (GREEK SMALL LETTER MU). NFKC will collapse these two. https://www.unicode.org/reports/tr10/ Unicode® Technical Standard #10 UNICODE COLLATION ALGORITHM Collation is the general term for the process and function of determining the sorting order of strings of characters. Collation varies according to language and culture: Germans, French and Swedes sort the same characters differently. It may also vary by specific application: even within the same language, dictionaries may sort differently than phonebooks or book indices. For non-alphabetic scripts such as East Asian ideographs, collation can be either phonetic or based on the appearance of the character. Collation can also be customized according to user preference, such as ignoring punctuation or not, putting uppercase before lowercase (or vice versa), and so on. https://en.wikipedia.org/wiki/Unicode_equivalence Short definition of NFD, NFC, NFKD, NFKC In this article, a short paragraph which confirms that it's important to keep the original string unchanged ! Errors due to normalization differences When two applications share Unicode data, but normalize them differently, errors and data loss can result. In one specific instance, OS X normalized Unicode filenames sent from the Samba file- and printer-sharing software. Samba did not recognise the altered filenames as equivalent to the original, leading to data loss.[4][5] Resolving such an issue is non-trivial, as normalization is not losslessly invertible. http://sourceforge.net/p/netatalk/bugs/348/ #348 volcharset:UTF8 doesn't work from Mac https://www.unicode.org/faq/normalization.html Mode detailled description of normalization PHP http://php.net/manual/en/collator.compare.php Collator::compare -- collator_compare — Compare two Unicode strings Object oriented style public int Collator::compare ( string $str1 , string $str2 ) Procedural style int collator_compare ( Collator $coll , string $str1 , string $str2 ) http://php.net/manual/en/class.collator.php Provides string comparison capability with support for appropriate locale-sensitive sort orderings. Swift https://developer.apple.com/library/prerelease/watchos/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html Two String values (or two Character values) are considered equal if their extended grapheme clusters are canonically equivalent. Extended grapheme clusters are canonically equivalent if they have the same linguistic meaning and appearance, even if they are composed from different Unicode scalars behind the scenes. .characters.count for character in dogString.characters for codeUnit in dogString.utf8 for codeUnit in dogString.utf16 for scalar in dogString.unicodeScalars Nothing about ordered comparison in Swift doc ? http://oleb.net/blog/2014/07/swift-strings/ Ordering strings with the < and > operators uses the default Unicode collation algorithm. In the example below, "é" is smaller than i because the collation algorithm specifies that characters with combining marks follow right after their base character. "résumé" < "risotto" // -> true The String type does not (yet?) come with a method to specify the language to use for collation. You should continue to use -[NSString compare:options:range:locale:] or -[NSString localizedCompare:] if you need to sort strings that are shown to the user. In this example, specifying a locale that uses the German phonebook collation yields a different result than the default string ordering: let muffe = "Muffe" let müller = "Müller" muffe < müller // -> true // Comparison using an US English locale yields the same result let muffeRange = muffe.startIndex..<muffe.endIndex let en_US = NSLocale(localeIdentifier: "en_US") muffe.compare(müller, options: nil, range: muffeRange, locale: en_US) // -> .OrderedAscending // Germany phonebook ordering treats "ü" as "ue". // Thus, "Müller" < "Muffe" let de_DE_phonebook = NSLocale(localeIdentifier: "de_DE@collation=phonebook") muffe.compare(müller, options: nil, range: muffeRange, locale: de_DE_phonebook) // -> .OrderedDescending Java https://jcdav.is/2016/09/01/How-the-JVM-compares-your-strings/ How the JVM compares your strings using the craziest x86 instruction you've never heard of. --- A comment about this article: PCMPxSTRx is no longer faster than equivalent "simple" vector instruction sequences for straightforward comparisons (this had already been the case for a few years when that article was written, which is curious). It can be used productively (with some care) for some other operations like substring matching, but that's not as much of a heavy-hitter. There's a bunch of string stuff that will benefit from general vectorization, and which is absolutely on our roadmap to tackle, but using the PCMPxSTRx instructions specifically isn't a source of wins on the most important operations C# https://docs.microsoft.com/en-us/dotnet/standard/base-types/comparing https://docs.microsoft.com/en-us/dotnet/core/extensions/performing-culture-insensitive-string-comparisons

JSON


https://www.reddit.com/r/programming/comments/q5vmxc/parsing_json_is_a_minefield_2018/ https://seriot.ch/projects/parsing_json.html Parsing JSON is a Minefield Search for "unicode" 30/05/2021 https://datatracker.ietf.org/doc/html/rfc8259 The JavaScript Object Notation (JSON) Data Interchange Format See this section about strings and encoding: https://datatracker.ietf.org/doc/html/rfc8259#section-7

TOML serialization format


https://github.com/toml-lang/toml Tom's Obvious, Minimal Language TOML is a nice serialization format for human-maintained data structures. It’s line-delimited and—of course!—allows comments, and any Unicode code point can be expressed in simple hexadecimal. TOML is fairly new, and its specification is still in flux;

CBOR Concise Binary Representation


https://cbor.io/ RFC 8949 Concise Binary Object Representation CBOR improves upon JSON’s efficiency and also allows for storage of binary strings. Whereas JSON encoders must stringify numbers and escape all strings, CBOR stores numbers “literally” and prefixes strings with their length, which obviates the need to escape those strings. https://www.rfc-editor.org/rfc/rfc8949.html RFC 8949 Concise Binary Object Representation (CBOR) In contrast to formats such as JSON, the Unicode characters in this type are never escaped. Thus, a newline character (U+000A) is always represented in a string as the byte 0x0a, and never as the bytes 0x5c6e (the characters "\" and "n") nor as 0x5c7530303061 (the characters "\", "u", "0", "0", "0", and "a").

Binary encoding in Unicode


10/07/2021 https://qntm.org/unicodings Efficiently encoding binary data in Unicode in UTF-8, use Base64 or Base85 in UTF-16, use Base32768 in UTF-32, use Base65536 https://qntm.org/safe What makes a Unicode code point safe? https://github.com/qntm/safe-code-point Ascertains whether a Unicode code point is 'safe' for the purposes of encoding binary data https://github.com/qntm/base2048 Binary encoding optimised for Twitter Originally, Twitter allowed Tweets to be at most 140 characters. On 26 September 2017, Twitter allowed 280 characters. Maximum Tweet length is indeed 280 Unicode code points. Twitter divides Unicode into 4,352 "light" code points (U+0000 to U+10FF inclusive) and 1,109,760 "heavy" code points (U+1100 to U+10FFFF inclusive). Base2048 solely uses light characters, which means a new "long" Tweet can contain at most 280 characters of Base2048. Base2048 is an 11-bit encoding, so those 280 characters encode 3080 bits i.e. 385 octets of data, significantly better than Base65536. https://github.com/qntm/base65536 Unicode's answer to Base64 Base2048 renders Base65536 obsolete for its original intended purpose of sending binary data through Twitter. However, Base65536 remains the state of the art for sending binary data through text-based systems which naively count Unicode code points, particularly those using the fixed-width UTF-32 encoding.

Invalid format


22/07/2021 https://stackoverflow.com/questions/52131881/does-the-winapi-ever-validate-utf-16 Does the WinApi ever validate UTF-16? Windows wide characters are arbitrary 16-bit numbers (formerly called "UCS-2", before the Unicode Standard Consortium purged that notation). So you cannot assume that it will be a valid UTF-16 sequence. (MultiByteToWideChar is a notable exception that does return only UTF-16) 28/07/2021 https://invisible-island.net/xterm/bad-utf8/ Unicode replacement character in the Linux console. This test text examines, how UTF-8 decoders handle various types of corrupted or otherwise interesting UTF-8 sequences. jlf : difficult to understand what is the conclusion... What I notice in this review is : Unicode 10.0.0's chapter 3 (June 2017): each of the ill-formed code units is separately replaced by U+FFFD. That recommendation first appeared in Unicode 6's chapter 3 on conformance (February 2011). However the comments about “best practice” were removed in Unicode 11.0.0 (June 2018). The W3C WHATWG page entitled Encoding Standard started in January 2013. The constraints in the utf-8 decoder above match “Best Practices for Using U+FFFD” from the Unicode standard. No other behavior is permitted per the Encoding Standard (other algorithms that achieve the same result are obviously fine, even encouraged). Although Unicode withdrew the recommendation more than two years ago, to date (August 2020) that is not yet corrected in the WHATWG page. 30/07/2021 https://hsivonen.fi/broken-utf-8/ --- The Unicode Technical Committee retracted the change in its meeting on August 3 2017, so the concern expressed below is now moot. --- Not all byte sequences are valid UTF-8. When decoding potentially invalid UTF-8 input into a valid Unicode representation, something has to be done about invalid input. The naïve answer is to ignore invalid input until finding valid input again (i.e. finding the next byte that has a lead-byte value), but this is dangerous and should never be done. The danger is that silently dropping bogus bytes might make a string that didn’t look dangerous with the bogus bytes present become valid active content. Most simply, <scr�ipt> (� standing in for a bogus byte) could become <script> if the error is ignored. So it’s non-controversial that every sequence of bogus bytes should result in at least one REPLACEMENT CHARACTER and that the next lead-valued byte is the first byte that’s no longer part of the invalid sequence. But how many REPLACEMENT CHARACTERs should be generated for a sequence of multiple bogus bytes? jlf: the answer is not clear to me... https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt UTF-8 decoder capability and stress test

Mojibake


https://github.com/LuminosoInsight/python-ftfy ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else 03/07/2021 Notebook in python-ftfy: Services such as Slack and Discord don't use Unicode for their emoji. They use ASCII strings like :green-heart: and turn them into images. These won't help you test anything. I recommend getting emoji for your test cases by copy-pasting them from emojipedia.org. https://emojipedia.org/ https://en.wikipedia.org/wiki/Mojibake

Filenames


https://opensource.apple.com/source/subversion/subversion-52/subversion/notes/unicode-composition-for-filenames.auto.html 2 problems follow: 1) We can't generally depend on the OS to give us back the exact filename we gave it 2) The same filename may be encoded in different codepoints https://linux.die.net/man/1/convmv convmv - converts filenames from one encoding to another https://news.ycombinator.com/item?id=33986655 jlf: discussion about text vs byte for filenames https://news.ycombinator.com/item?id=33991506 Python already has the "surrogateescape" error handler [0] that performs something similar to what you described: undecodable bytes are translated into unpaired U+DC80 to U+DCFF surrogates. Of course, this isn't standardized in any way, but I've found it useful myself for smuggling raw pathnames through Java. [0] https://peps.python.org/pep-0383/ https://news.ycombinator.com/item?id=33988943 I’m a little confused, how can a file name be non-decodable? A file with that name exists, so someone somewhere knows how to decode it. Why wouldn’t Python just always use the same encoding as the OS it’s running on? Is this some locale-related thing? --- > A file with that name exists, so someone somewhere knows how to decode it. No. A unix filename is just a bunch of bytes (two of them being off-limits). There is no requirement that it be in any encoding. You can always use a fallback encoding (an iso-8859) to get something out of the garbage, but it's just that, garbage. Windows has a similar issue, NTFS paths are sequences of UCS2 code units, but there's no guarantee that they form any sort of valid UTF-16 string, you can find random lone surrogates for instance. And I'm sure network filesystems have invented their own even worse issues, because being awful is what they do. > Why wouldn’t Python just always use the same encoding as the OS it’s running on? 1. because OS don't really have encodings, Python has a function to try and retrieve FS encoding[0] but per the above there's no requirement that it is correct for any file, let alone the one you actually want to open (hell technically speaking it's not even a property of the FS) 2. because OS lie and user configurations are garbage, you can't even trust the user's locale to be configured properly for reading files (an other mistake Python 3 made, incidentally) 3. because the user may not even have created the file, it might come from a broken archive, or some random download from someone having fun with filenames, or from fetching crap from an FTP or network share There are a few FS / FS configurations which are reliable, in that case they either error or pre-mangle the files on intake. IIRC ZFS can be configured to only accept valid UTF-8 filenames, HFS(+) requires valid unicode (stored as UTF-16) and APFS does as well (stored as UTF-8). [0] https://docs.python.org/3/library/sys.html#sys.getfilesystem... https://news.ycombinator.com/item?id=33986421 Stefan Karpinski: On UNIX, paths are UTF-8 by convention, but not forced to be valid. Treating paths as UTF-8 works very well as long as you hadn't also make the mistake of requiring your UTF-8 strings to be valid (which Python did, unfortunately). On Windows, paths are UTF-16 by convention, but also not forced to be valid. However, invalid UTF-16 can be faithfully converted to WTF-8 and converted back losslessly, so you can translate Windows paths to WTF-16 and everything Just Works™ [1]. There aren't any operating systems I'm aware of where paths are actually Shift-JIS by convention, so that seems like a non-issue. Using "UTF-8 by convention" strings works on all modern OSes. [1] Ok, here's why the WTF-8 thing works so well. If we write WTF-16 for potentially invalid UTF-16 (just arbitrary sequences of 16-bit code units), then the mapping between WTF-16 and WTF-8 space is a bijection because it's losslessly round-trippable. But more importantly, this WTF-8/16 bijection is also a homomorphism with respect to pretty much any string operation you can think of. For example `utf16_concat(a, b) == utf8_concat(wtf8(a), wtf8(b))` for arbitrary UTF-16 strings a and b. Similar identities hold for other string operations like searching for substrings or splitting on specific strings. --- > There aren't any operating systems I'm aware of where paths are actually Shift-JIS by convention, so that seems like a non-issue. Using "UTF-8 by convention" strings works on all modern OSes. Nonsense. Unix paths use the system locale by convention, and it's entirely normal for that to be Shift-JIS. https://news.ycombinator.com/item?id=33985510 Stefan Karpinski: Absolutely right. Deprecating direct string indexing would have been the right move. Require writing `str.chars()` to get something that lets you slice by Unicode characters (i.e. code points); provide `str.graphemes()` and `str.grapheme_clusters()` to get something that lets you slice by graphemes and grapheme clusters, respectively. Cache an index structure that lets you do that kind of indexing efficiently once you've asked for it the first time. Provide an API to clear the caches. Not allowing strings to represent invalid Unicode is also a huge mistake (and essentially forced by the representation strategy that they adopted). It forces any programmer who wants to robustly handle potentially invalid string data to use byte vectors instead. Which is exactly what they did with OS paths, but that's far from the only place you can get invalid strings. You can get invalid strings almost anywhere! Worse, since it's incredibly inconvenient to work with byte vectors when you want to do stringlike stuff, no one does it unless forced to, so this design choice effectively guarantees that all Python code that works with strings will blow up if it encounters anything invalid—which is a very common occurrence. If only there was a type that behaves like a string and supports all the handy string operations but which handles invalid data gracefully. Then you could write robust string code conveniently. But at that point, you should just make that the standard string type! This isn't hypothetical, it's exactly how Burnt Sushi's bstr type [1] works in Rust and how the standard String type works in Julia. [1] https://github.com/BurntSushi/bstr --- Jasper_ It's worth noting that Python str's are sequences of code points, not scalar values. This was a truly horrendous mistake made mostly out of ignorance, but now they rely upon it in surrogateescape to hide "invalid" data, so... I have ranted for long hours go friends about the insanity of Python 3's text model before. It's mostly the blind leading the blind. --- Animats: Unicode string indexing should have been made lazy, rather than deprecated. Random access to strings is rare. Mostly, operations are moving forward linearly or using saved positions. So, only build the index for random access if needed. Optimize "advance one glyph" and "back up one glyph" expressed as indexing, and you'll get most of the frequently used cases. Have the "index" functions that return a string index return an opaque type that's a byte index. Attempting to convert that to an integer forces creation of the string index. This preserves the user visible semantics but keeps performance. PyPy does something like this.

WTF8


https://news.ycombinator.com/item?id=9611710 The WTF-8 encoding (simonsapin.github.io) https://news.ycombinator.com/item?id=9613971 https://simonsapin.github.io/wtf-8/#acknowledgments Thanks to Coralie Mercier for coining the name WTF-8. --- The name is unserious but the project is very serious, its writer has responded to a few comments and linked to a presentation of his on the subject[0]. It's an extension of UTF-8 used to bridge UTF-8 and UCS2-plus-surrogates: while UTF8 is the modern encoding you have to interact with legacy systems, for UNIX's bags of bytes you may be able to assume UTF8 (possibly ill-formed) but a number of other legacy systems used UCS2 and added visible surrogates (rather than proper UTF-16) afterwards. Windows and NTFS, Java, UEFI, Javascript all work with UCS2-plus-surrogates. Having to interact with those systems from a UTF8-encoded world is an issue because they don't guarantee well-formed UTF-16, they might contain unpaired surrogates which can't be decoded to a codepoint allowed in UTF-8 or UTF-32 (neither allows unpaired surrogates, for obvious reasons). WTF8 extends UTF8 with unpaired surrogates (and unpaired surrogates only, paired surrogates from valid UTF16 are decoded and re-encoded to a proper UTF8-valid codepoint) which allows interaction with legacy UCS2 systems. WTF8 exists solely as an internal encoding (in-memory representation), but it's very useful there. [0] http://exyr.org/2015/!!Con_WTF-8/slides.pdf https://twitter.com/koalie/status/506821684687413248 Coralie Mercier @koalie I have a hunch we use "wtf-8" encoding. Appreciate the irony of: " the future of publishing at W3C" 16/07/2021 Windows allows unpaired surrogates in filenames https://github.com/golang/go/issues/32334 syscall: Windows filenames with unpaired surrogates are not handled correctly #32334 https://github.com/rust-lang/rust/issues/12056 path: Windows paths may contain non-utf8-representable sequences #12056 I don't know the precise details, but there exist portions of Windows in which paths are UCS2 rather than UTF-16. I ignored it because I thought it wasn't going to be an issue but at some point someone (and I wish I could remember who) showed me some output that showed that they were actually getting a UCS2 path from some Windows call and Path was unable to parse it. --- JLF: this is the birth of WTF-8 in 2014. The result is: https://simonsapin.github.io/wtf-8

Codepoint/grapheme indexation


https://nullprogram.com/blog/2019/05/29/ ObjectIcon http://objecticon.sourceforge.net/Unicode.html ucs (standing for Unicode character string) is a new builtin type, whose behaviour closely mirrors that of the conventional Icon string. It operates by providing a wrapper around a conventional conventional Icon string, which must be in utf-8 format. This has several advantages, and only one serious disadvantage, namely that a utf-8 string is not randomly accessible, in the sense that one cannot say where the representation for unicode character i begins. To alleviate this disadvantage, the ucs type maintains an index of offsets into the utf-8 string to make random access faster. The size of the index is only a few percent of the total allocation for the ucs object. Jlf: I made a code review, but could not understand how they do that :-( Not clear if it's a codepoint indexation or a grapheme indexation. https://lwn.net/Articles/864994/ jlf: discussion about Raku NFG and its technical limitations. It's also the traditional discussion about "why do you need a direct access to the graphemes".

Rope


See also ZenoString (from Alan Kay - Saxonica) https://github.com/josephg/librope Little C library for heavyweight utf-8 strings (rope). https://news.ycombinator.com/item?id=8065608 Discussion about ropes, ideal of strings... https://github.com/xi-editor/xi-editor/blob/e8065a3993b80af0aadbca0e50602125d60e4e38/doc/rope_science/rope_science_03.md https://news.ycombinator.com/item?id=34948308 Several references to older papers https://news.ycombinator.com/item?id=37820532 Text showdown: Gap Buffers vs. Ropes https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation Text Buffer Reimplementation https://en.wikipedia.org/wiki/Piece_table In computing, a piece table is a data structure typically used to represent a text document while it is edited in a text editor.

Encoding title


https://www.iana.org/assignments/character-sets/character-sets.xhtml Character Sets (IANA Character Sets registry) These are the official names for character sets that may be used in the Internet and may be referred to in Internet documentation. These names are expressed in ANSI_X3.4-1968 which is commonly called US-ASCII or simply ASCII. The character set most commonly use in the Internet and used especially in protocol standards is US-ASCII, this is strongly encouraged. The use of the name US-ASCII is also encouraged. --- jlf: see encoding.spec.whatwg.org elsewhere in this document. They say: "User agents have also significantly deviated from the labels listed in the IANA Character Sets registry. To stop spreading legacy encodings further, this specification is exhaustive about the aforementioned details and therefore has no need for the registry." https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape OCTOBER 12, 2022 JeanHeyd Meneide Project Editor for ISO/IEC JTC1 SC22 WG14 - Programming Languages, C. The Wonderfully Terrible World of C and C++ Encoding APIs (with Some Rust) --- jlf: Is he criticizing the work of Zach Laine? ( https://github.com/tzlaine/text ) "someone was doing something wrong on the internet and I couldn’t let that pass:" Same person: https://github.com/ThePhD https://github.com/soasis Any Encoding, Ever - ztd.text and Unicode for C++ - JUNE 30, 2021 : https://thephd.dev/any-encoding-ever-ztd-text-unicode-cpp Starting a Basis - Shepherd's Oasis and Text - MAY 01, 2020: https://thephd.dev/basis-shepherds-oasis-text-encoding https://ztdtext.readthedocs.io/en/latest/index.html ztd.text The premiere library for handling text in different encoding forms and reducing transcoding bugs in your C++ software. List of encodings: https://ztdtext.readthedocs.io/en/latest/encodings.html List of Unicode encodings: https://ztdtext.readthedocs.io/en/latest/known%20unicode%20encodings.html Design Goals and Philosophy: https://ztdtext.readthedocs.io/en/latest/design.html --- jlf: don't know what to think about that... related to https://github.com/soasis https://github.com/soasis/text JeanHeyd Meneide This repository is an implementation of an up and coming proposal percolating through SG16, P1629 - Standard Text Encoding ( https://thephd.dev/_vendor/future_cxx/papers/d1629.html ) --- https://github.com/soasis Shepherd's Oasis Software Services and Consulting. https://encoding.spec.whatwg.org/ Encoding The Encoding Standard defines encodings and their JavaScript API. --- The table below lists all encodings and their labels user agents must support. User agents must not support any other encodings or labels. <table> --- Most legacy encodings make use of an index. An index is an ordered list of entries, each entry consisting of a pointer and a corresponding code point. Within an index pointers are unique and code points can be duplicated. Note: An efficient implementation likely has two indexes per encoding. One optimized for its decoder and one for its encoder. https://www.git-tower.com/help/guides/faq-and-tips/faq/encoding/windows Character encoding for commit messages --- When Git creates and stores a commit, the commit message entered by the user is stored as binary data and there is no conversion between encodings. The encoding of your commit message is determined by the client you are using to compose the commit message. Git stores the name of the commit encoding if the config key "i18n.commitEncoding" is set (and if it's not the default value "utf-8"). If you commit changes from the command line, this value must match the encoding set in your shell environment. Otherwise, a wrong encoding is stored with the commit and can result in garbled output when viewing the commit history. If you view the commit log on the command line, the config value "i18n.logOutputEncoding" (which defaults to "i18n.commitEncoding") needs to match your shell encoding as well. The command converts messages from the commit encoding to the output encoding. If your shell encoding does not match the output encoding, you will again receive garbled output! https://www.git-scm.com/docs/gitattributes/2.18.0#_working_tree_encoding gitattributes - Defining attributes per path working-tree-encoding Git recognizes files encoded in ASCII or one of its supersets (e.g. UTF-8, ISO-8859-1, …​) as text files. Files encoded in certain other encodings (e.g. UTF-16) are interpreted as binary and consequently built-in Git text processing tools (e.g. git diff) as well as most Git web front ends do not visualize the contents of these files by default. In these cases you can tell Git the encoding of a file in the working directory with the working-tree-encoding attribute. If a file with this attribute is added to Git, then Git reencodes the content from the specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded content in its internal data structure (called "the index"). On checkout the content is reencoded back to the specified encoding. --- jlf: there is a number of pitfalls, read the article. https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text How to determine the encoding of text jlf: for Python, not reviewed, may bring interesting infos. https://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file How can I detect the encoding/codepage of a text file? jlf: for C#, not reviewed, may bring interesting infos. https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1 What is the difference between UTF-8 and ISO-8859-1? jlf: the interesting part are the comments about ISO-8859-1. --- ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. --- cp1252 is a superset of the ISO-8859-1, containing additional printable characters in the 0x80-0x9F range, notably the Euro symbol € and the much maligned "smart quotes" “”. This frequently leads to a mismatch where 8859-1 can be displayed as 1252 perfectly fine, and 1252 may seem to display fine as 8859-1, but will misbehave when one of those extra symbols shows up. --- jlf: so the previous comment says that ISO-8859-1 is not defined in the 0x80-0x9F range... IS IT or IS IT NOT??? --- One more important thing to realise: if you see iso-8859-1, it probably refers to Windows-1252 rather than ISO/IEC 8859-1. They differ in the range 0x80–0x9F, where ISO 8859-1 has the C1 control codes, and Windows-1252 has useful visible characters instead. For example, ISO 8859-1 has 0x85 as a control character (in Unicode, U+0085, ``), while Windows-1252 has a horizontal ellipsis (in Unicode, U+2026 HORIZONTAL ELLIPSIS, …). The WHATWG Encoding spec (as used by HTML) expressly declares iso-8859-1 to be a label for windows-1252, and web browsers do not support ISO 8859-1 in any way. https://www.mobilefish.com/tutorials/character_encoding/character_encoding_quickguide_iso8859_1.html jlf: not sure this page is a good reference. The fact they wrote "Unicode, a 16-bit character set." brings a doubt about the rest of their page... I reference it for their definition of ISO-8859-1. --- HTML and HTTP protocols make frequent reference to ISO Latin-1 and the character code ISO-8859-1. The HTTP specification mandates the use of the code ISO-8859-1 as the default character code that is passed over the network. ISO-8859-1 explicitly does not define displayable characters for positions 0-31 and 127-159, and the HTML standard does not allow those to be used for displayable characters. The only characters in this range that are used are 9, 10 and 13, which are tab, newline and carriage return respectively. Note: ISO-8859-1 is also known as Latin-1. --- jlf: so they say - 00..1F is not defined except 09, 0A, 0D (so they are different from https://en.wikipedia.org/wiki/ISO/IEC_8859-1) where all 00..1F is undefined. - 7F..9F is not defined Confirmed by their text file: https://www.mobilefish.com/download/character_set/iso8859_1.txt

ICU title


https://icu.unicode.org https://unicode-org.github.io/icu/ ICU documentation https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ Entry point of API Reference https://icu-project.org/docs/ ICU Documents and Papers jlf: old? https://unicode-org.atlassian.net/jira/software/c/projects/ICU/issues/?filter=allissues ICU tickets https://github.com/microsoft/icu jlf: fork by Microsoft http://stackoverflow.com/questions/8253033/what-open-source-c-or-c-libraries-can-convert-arbitrary-utf-32-to-nfc What open source C or C++ libraries can convert arbitrary UTF-32 to NFC? std::string normalize(const std::string &unnormalized_utf8) { // FIXME: until ICU supports doing normalization over a UText // interface directly on our UTF-8, we'll use the insanely less // efficient approach of converting to UTF-16, normalizing, and // converting back to UTF-8. // Convert to UTF-16 string auto unnormalized_utf16 = icu::UnicodeString::fromUTF8(unnormalized_utf8); // Get a pointer to the global NFC normalizer UErrorCode icu_error = U_ZERO_ERROR; const auto *normalizer = icu::Normalizer2::getInstance(nullptr, "nfc", UNORM2_COMPOSE, icu_error); assert(U_SUCCESS(icu_error)); // Normalize our string icu::UnicodeString normalized_utf16; normalizer->normalize(unnormalized_utf16, normalized_utf16, icu_error); assert(U_SUCCESS(icu_error)); // Convert back to UTF-8 std::string normalized_utf8; normalized_utf16.toUTF8String(normalized_utf8); return normalized_utf8; } https://begriffs.com/posts/2019-05-23-unicode-icu.html Unicode programming, with examples https://en.wikipedia.org/wiki/Trie Tries are a form of string-indexed look-up data structure, which is used to store a dictionary list of words that can be searched on in a manner that allows for efficient generation of completion lists. Tries can be efficacious on string-searching algorithms such as predictive text, approximate string matching, and spell checking in comparison to a binary search trees. A trie can be seen as a tree-shaped deterministic finite automaton. https://icu.unicode.org/design/struct/utrie ICU Code Point Tries We use a form of "trie" adapted to single code points. The bits in the code point integer are divided into two or more parts. The first part is used as an array offset, the value there is used as a start offset into another array. The next code point bit field is used as an additional offset into that array, to fetch another value. The final part yields the data for the code point. Non-final arrays are called index arrays or tables. --- For a general-purpose structure, we want to be able to be able to store a unique value for every character. This determines the number of bits needed in the last index table. With 136,690 characters assigned in Unicode 10, we need at least 18 bits. We allocate data values in blocks aligned at multiples of 4, and we use 16-bit index words shifted left by 2 bits. This leads to a small loss in how densely the data table can be used, and how well it can be compacted, but not nearly as much as if we were using 32-bit index words. https://icu.unicode.org/design/struct/tries/bytestrie It maps from arbitrary byte sequences to 32-bit integers. (Small non-negative integers are stored more efficiently. Negative integers are the least efficient.) The BytesTrie and UCharsTrie structures are nearly the same, except that the UCharsTrie uses fewer, larger units. https://icu.unicode.org/design/struct/tries/ucharstrie Same design as a BytesTrie, but mapping any UnicodeString (any sequence of 16-bit units) to 32-bit integer values. https://icu.unicode.org/charts/charset https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/convrtrs.txt ICU alias table jlf: the ultimate reference? --- # Here is the file format using BNF-like syntax: # # converterTable ::= tags { converterLine* } # converterLine ::= converterName [ tags ] { taggedAlias* }'\n' # taggedAlias ::= alias [ tags ] # tags ::= '{' { tag+ } '}' # tag ::= standard['*'] # converterName ::= [0-9a-zA-Z:_'-']+ # alias ::= converterName --- standard # The * after the standard tag denotes that the previous alias is the # preferred (default) charset name for that standard. There can only # be one of these default charset names per converter. --- Affinity tags If an alias is given to more than one converter, it is considered to be an ambiguous alias, and the affinity list will choose the converter to use when a standard isn't specified with the alias. The general ordering is from specific and frequently used to more general or rarely used at the bottom. { UTR22 # Name format specified by https://www.unicode.org/reports/tr22/ IBM # The IBM CCSID number is specified by ibm-* WINDOWS # The Microsoft code page identifier number is specified by windows-*. The rest are recognized IE names. JAVA # Source: Sun JDK. Alias name case is ignored, but dashes are not ignored. IANA # Source: http://www.iana.org/assignments/character-sets MIME # Source: http://www.iana.org/assignments/character-sets } https://github.com/unicode-org/icu/tree/main/icu4c/source/data/mappings Encodings https://unicode-org.atlassian.net/browse/ICU-22422 Collation folding jlf: see Markus Scherer feedback https://sourceforge.net/p/icu/mailman/icu-design/thread/SN6PR00MB04468327B475F4D6A19CF26FAFFFA%40SN6PR00MB0446.namprd00.prod.outlook.com/#msg38268251 [icu-design] Collation Folding Tables jlf: this is a discussion related to ICU-22422 https://www.unicode.org/reports/tr10/#Collation_Folding Collation Folding Matching can be done by using the collation elements, directly, as discussed above. However, because matching does not use any of the ordering information, the same result can be achieved by a folding. That is, two strings would fold to the same string if and only if they would match according to the (tailored) collation. For example, a folding for a Danish collation would map both "Gård" and "gaard" to the same value. A folding for a primary-strength folding would map "Resume" and "résumé" to the same value. That folded value is typically a lowercase string, such as "resume". jlf: Chrome matches "Gård" with "gard", but not with "gaard". A comparison between folded strings cannot be used for an ordering of strings, but it can be applied to searching and matching quite effectively. The data for the folding can be smaller, because the ordering information does not need to be included. The folded strings are typically much shorter than a sort key, and are human-readable, unlike the sort key. The processing necessary to produce the folding string can also be faster than that used to create the sort key. Transliterate "micro sign" to "u" using Transliterator from icu4j jlf: next is an answer on icu-support@lists.sourceforge.net https://sourceforge.net/p/icu/mailman/message/58712806/ On Wed, Dec 13, 2023 at 7:52 PM <go.al.ni@gmail.com> wrote: > Micro sign transliterated to "m" in one case, but not in another. While I don't know enough about the Any-Latin transliteration rules to be able to tell you why this happens, the thing that happens is that when you have any preceding Greek letter the transliterator will afterwards treat also the micro sign (U+00B5) as a Greek letter, while it otherwise will leave it as-is, as any other symbol. If you want to transliterate only Greek letters you could explicitly create a Greek transliterator, which then will always treat also the micro sign (U+00B5) as a Greek letter: var tr = Transliterator.getInstance("Greek-Latin"); Or, if you want to first treat any symbols that are also Greek letters explicitly as Greek letters and then perform the Any-Latin transliteration: var tr = Transliterator.getInstance("Greek-Latin; Any-Latin;"); Or, if you want just Any-Latin but with a special case for the micro sign (U+00B5): var tr = Transliterator.createFromRules("MyAnyLatin", "µ > m; ::Any-Latin;", Transliterator.FORWARD); [icu-support] CollationKey for efficient collation-aware in-place substring comparison Question https://sourceforge.net/p/icu/mailman/message/58741675/ I have a question regarding the use of CollationKey <https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/CollationKey.html> to check whether one string "contains" the other (i.e. right string is found anywhere in the left string, accounting for any specified rule-based collation using ICU4J). With this, my use case in Java would be something like: *contains(String left, String right, String collation)*. Suppose that *collation* here is a parameter indicating the collation at hand (for example: "Latin1_General_CS_AI"), and is used to get the appropriate instance of *com.ibm.icu.text.Collator* (exact routing for this collation is handled elsewhere in the codebase). Problem description Due to the nature of this operation, using *Collator.compare(String, String)* proves inefficient for this problem, because it would require allocating O(N) substrings of *left *before calling *compare(left.substring(), right)*. Suppose N here is the length of the *left* string. Example: *contains*("Abć", "a", "Latin1_General_CS_AI"); // returns false - calls: *collator.compare("A", "a")* // returns false ("A" here is "Abć".substring(0,1)) - calls: *collator.compare("b", "a")* // returns false ("b" here is "Abć".substring(1,2)) - calls: *collator.compare("ć", "a")* // returns false ("ć" here is "Abć".substring(2,3)) Here, this approach allocates *3 new strings* in order to do the comparisons. Using CollationKey As I understood, *com.ibm.icu.text.CollationKey* is the way to go for repeated comparison of strings. Here, I would like to compare strings in a way that only requires generating one key for *left* (let's call it *leftKey*) and one key for *right* (let's call it *rightKey*), and then comparing these arrays in-place, byte per byte. However, it doesn't seem that this operation is supported out-of-the-box with *CollationKey*. While one can easily use two collation keys for equality comparison and collation-aware ordering, I'm not sure if this holds for substring operations as well? Given a collation key for "Abć", is there a constant-time way to obtain collation keys for "A", "b", and "ć"? Ideally, I would want to only traverse the "Abć" collation key (*leftKey*) as a plain byte array, and do in-place comparison with the "ć" collation key (*rightKey*) as a plain byte array. However, it doesn't seem straightforward given the structure of the collation key (suffixes, etc.) public boolean contains(String left, String right, String collation) { > Collator collator = ...(collation); > // get collation keys > CollationKey leftKey = collator.getCollationKey(left); > CollationKey rightKey = collator.getCollationKey(right); > // get byte arrays > byte[] lBytes = leftKey.toByteArray(); > byte[] rBytes = rightKey.toByteArray(); > // in-place comparison > for (int i = 0; i <= lBytes.length - rBytes.length; i++) { > if (compareKeys(lBytes, rBytes, i)) { > return true; > } > } > return false; > } Suppose there's a simple helper function such as: > private boolean compareKeys(byte[] lBytes, byte[] rBytes, int offset) { > int len = rBytes.length; > // compare lBytes[i, i+len] to rBytes[0, len] in-place, byte by byte... > } Could you please provide any support regarding how to implement this solution so that it fully takes into account the collation key byte array structure? As of now, this simple comparison doesn't work because there are some suffixes in both *leftKey* and *rightKey*, so exact comparison is not possible, but I'm wondering if there is a way to go around this. Alternative It turns out that making use of *Collator.compare(Object, **Object**)* instead of *Collator.compare(String, **String**)* doesn't prove to be any better either, because it does *toString()* anyway, regressing the performance in a similar fashion. Ideally, an implementation such as *Collator.compare(Character, **Character**)* could do the trick, however only under the condition that it would *not allocate* a new *String* for the two arguments. This would allow traversing *left* and *right* strings and comparing individual characters just by using *String.charAt* (with no extra *String* allocation whatsoever). However, I don't believe there is currently anything like *Collator.compare(**Character**, **Character**)* that works exactly like this. So for now, I'm trying to implement this functionality using *CollationKey*. Answer from Markus Sherer https://sourceforge.net/p/icu/mailman/message/58741856/ Yes, but CollationKey is too low-level, and you would have to compute and store the CollationKey for the entire left string at once, which could be large. “Don't do this at home” :-) Please use class <https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html> StringSearch <https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html> https://unicode-org.github.io/icu/userguide/collation/string-search.html I don't remember if StringSearch automatically loads "search" tailorings; it's possible that you may have to request that explicitly. https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback https://www.unicode.org/reports/tr10/#Searching

ICU demos


https://icu4c-demos.unicode.org/icu-bin/icudemos todo: review https://icu4c-demos.unicode.org/icu-bin/collation.html ICU Collation Demo https://icu4c-demos.unicode.org/icu-bin/convexp Demo Converter Explorer https://icu4c-demos.unicode.org/icu-bin/scompare ICU Unicode String Comparison Interactive demo application

ICU bindings


02/06/2021 https://gitlab.pyicu.org/main/pyicu Python extension wrapping the ICU C++ libraries. 02/06/2021 https://docs.microsoft.com/en-us/windows/win32/intl/international-components-for-unicode--icu- In Windows 10 Creators Update, ICU was integrated into Windows, making the C APIs and data publicly accessible. The version of ICU in Windows only exposes the C APIs. It is impossible to ever expose the C++ APIs due to the lack of a stable ABI in C++. Getting started 1) Your application needs to target Windows 10 Version 1703 (Creators Update) or higher. 2) Add in the header: #include <icu.h> 3) Link to: icu.lib Example: void FormatDateTimeICU() { UErrorCode status = U_ZERO_ERROR; // Create a ICU date formatter, using only the 'short date' style format. UDateFormat* dateFormatter = udat_open(UDAT_NONE, UDAT_SHORT, nullptr, nullptr, -1, nullptr, 0, &status); if (U_FAILURE(status)) { ErrorMessage(L"Failed to create date formatter."); return; } // Get the current date and time. UDate currentDateTime = ucal_getNow(); int32_t stringSize = 0; // Determine how large the formatted string from ICU would be. stringSize = udat_format(dateFormatter, currentDateTime, nullptr, 0, nullptr, &status); if (status == U_BUFFER_OVERFLOW_ERROR) { status = U_ZERO_ERROR; // Allocate space for the formatted string. auto dateString = std::make_unique<UChar[]>(stringSize + 1); // Format the date time into the string. udat_format(dateFormatter, currentDateTime, dateString.get(), stringSize + 1, nullptr, &status); if (U_FAILURE(status)) { ErrorMessage(L"Failed to format the date time."); return; } // Output the formatted date time. OutputMessage(dateString.get()); } else { ErrorMessage(L"An error occured while trying to determine the size of the formatted date time."); return; } // We need to close the ICU date formatter. udat_close(dateFormatter); } http://www.boost.org/doc/libs/1_58_0/libs/locale/doc/html/index.html Boost.Locale creates the natural glue between the C++ locales framework, iostreams, and the powerful ICU library http://blog.lukhnos.org/post/6441462604/using-os-xs-built-in-icu-library-in-your-own Using OS X’s Built-in ICU Library in Your Own Project

ICU4X title


https://icu4x.unicode.org/ lead by Shane Carr (https://www.sffc.xyz) https://github.com/unicode-org/icu4x https://docs.rs jlf: if there is a version number in the path, you can replace it with "latest" https://www.unicode.org/faq/unicode_license.html jlf: ICU4X uses UNICODE LICENSE V3 The Unicode License is a permissive MIT type of license. However, there are several additional considerations identified separately in the associated Unicode Terms of Use (https://www.unicode.org/copyright.html). --- Comparison with other licenses: https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licenses jlf: hum... the "unicode license" is not in this table... https://www.reddit.com/r/rust/comments/q4xaig/icu_vs_rust_icu/ icu vs rust_icu Oct 10, 2021 --- jlf : here "icu" is ICU4X and rust_icu is another crate. Well... it's a mess, plenty of separated crates more or less finalized. There is a comment from an ICU4X committer saying "ICU4X does not have normalization". Of course, it's now supported but it's to say that ICU4X is far to be as complete as ICU. https://news.ycombinator.com/item?id=35608997 ICU4X 1.2: Now With Text Segmentation and More on Low-Resource Devices MONDAY, APRIL 17, 2023 http://blog.unicode.org/2022/09/announcing-icu4x-10.html SEPTEMBER 29, 2022 Announcing ICU4X 1.0 This week, after 2½ years of work by Google, Mozilla, Amazon, and community partners, the Unicode Consortium has published ICU4X 1.0, its first stable release. Lightweight: ICU4X is Unicode's first library to support static data slicing and dynamic data loading. Portable: ICU4X supports multiple programming languages out of the box. ICU4X can be used in the Rust programming language natively, with official wrappers in C++ via the foreign function interface (FFI) and JavaScript via WebAssembly. ICU4X does not seek to replace ICU4C or ICU4J; rather, it seeks to replace the large number of non-Unicode, often-unmaintained, often-incomplete i18n libraries that have been written to bring i18n to new programming languages and resource-constrained environments One of the most visible departures that ICU4X makes from ICU4C and ICU4J is an explicit data provider argument on most constructor functions. ICU4X team member Manish Goregaokar wrote a blog post series detailing how the zero-copy deserialization works under the covers. https://manishearth.github.io/blog/2022/08/03/zero-copy-1-not-a-yoking-matter/ https://manishearth.github.io/blog/2022/08/03/zero-copy-2-zero-copy-all-the-things/ https://manishearth.github.io/blog/2022/08/03/zero-copy-3-so-zero-its-dot-dot-dot-negative/ (jlf: Related to ICU4X, but should I read that ? It's internal Rust stuff) https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/cpp.md Using ICU4X from C++ https://www.reddit.com/r/programming/comments/xrmine/the_unicode_consortium_announces_icu4x_10_its_new/ The C and C++ APIs are header-only, you use them by linking to the icu_capi crate (more on this here). https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/cpp.md The C API is just not that idiomatic, so we don't advertise it as much. It exists more as a crutch for other languages to be able to call in, and it's optimized for cross language interop. That said, it has been pointed out to me that it's not that unidiomatic when you compare it with other large C libraries, so perhaps that's okay. We do have some tests that use it directly and it's .... fine to work with. Not an amazing experience, not terrible either. --- jlf: to investigate The C wrapper is probably better to use from Executor, because there is no hidden magic for memory management. The C++ wrapper is difficult to understand (at least to me, for the moment) because it's modern C++. https://www.reddit.com/r/rust/comments/xrh7h6/announcing_icu4x_10_new_internationalization/ icu_segmenter implements rule based segmentation, so you can actually customize the segmentation rules based on your needs by writing some toml and feeding it to datagen. The concept of a "character" or "word" has no single cross-linguistic meaning; it is not uncommon to need to tailor these algorithms by use case or even just the language being used. E.g. handling viramas in Indic scripts as a part of grapheme segmentation is a thing people might need, but may also not need, and UAX29 doesn't support that at the moment¹. CLDR contains a bunch of common tailorings for specific locales here, but as I mentioned folks may tailor further based on use case. Furthermore, icu_segmenter supports dictionary-based segmentation: for languages like Japanese and Thai where spaces are not typically used, you need a large dictionary to be able to segment them accurately (and again, it's language-specific). ICU4X's flexible data model means that you don't need to ship your application with this data and instead fetch it when it's actually necessary. We both support using dictionaries and an LSTM model depending on your code size/data size needs. https://docs.google.com/document/d/1ojrOdIchyIHYbg2G9APX8j2p0XtmVLj0f9jPIbFYVUE/edit#heading=h.xy9pq2mk1ypz ICU4X Segmenter Investigation https://github.com/unicode-org/icu4x/issues/1397 Character names jlf: Not yet supported by ICU4X, too bad... I need that for Executor. https://github.com/unicode-org/icu4x/issues/545 Reconsider UTF-32 support jlf: see also the comments about PyICU https://github.com/unicode-org/icu4x/issues/131 Port BytesTrie to ICU4X #131 with feedback from Markus Scherer (ICU) https://github.com/unicode-org/icu4x/issues/2721 Specialized zerovec collections for stringy types Sketch of a potential AsciiTrie. https://github.com/unicode-org/icu4x/pull/2722 Experimental AsciiTrie implementation https://github.com/unicode-org/icu4x/issues/2755 Get word break type When you iterate through text using the WordBreakIterator, you get the boundaries of words, spaces, punctuation, etc. It does not appear to tell you what kind of token or break that is has found. The C-language version of ICU has a function on the iterator called getRuleStatus() that returns an enum that describes the last break it found. The documentation is here: https://unicode-org.github.io/icu/userguide/boundaryanalysis/ https://github.com/unicode-org/icu4x/pull/2777/files added initial benchmarks for normalizer. https://github.com/unicode-org/icu4x/discussions/2877 How to use segmenter https://github.com/unicode-org/icu4x/issues/2886 Consider providing mapping to Standardized Variants for CJK Compatibility Ideographs Across GitHub, I found 3 users of this feature in unicode-normalization: https://github.com/sunfishcode/basic-text (by the implementor of the unicode-normalization feature) https://github.com/logannc/fuzzywuzzy-rs (unclear to me why you'd want this for a fuzzy match; I'd expect a fuzzy match not to want to distinguish the variations) https://github.com/crlf0710/runestr-rs https://github.com/unicode-org/icu4x/issues/2975 How supported do we consider non-keyextract users? https://github.com/unicode-org/icu4x/issues/2908 Time zone needs for calendar application Use case by team member of Mozilla Thunderbird Not related to Unicode, but related to the fact I put the ICU4X cdylib in Executor github... https://github.com/ankane/polars-ruby/blob/master/ext/polars/Cargo.toml Is it a way to avoid bundling the original rust lib? https://news.ycombinator.com/item?id=34425233 --- Not clear to me: for Python, are the lib binaries installed by https://pypi.org/project/polars/ ? apparently yes, see https://pypi.org/project/polars/#files --- For ruby, is it built by a github workflow? https://github.com/ankane/polars-ruby/blob/master/.github/workflows/release.yml https://github.com/unicode-org/icu4x/pull/2779/files add collator initial bench https://github.com/unicode-org/icu4x/issues/3151 icu_casemapping feature request: methods fold and full_fold should apply Turkic mappings depending on locale --- Markus Scherer: Applying Turkic case foldings automatically is dangerous. While case mappings are intended for human consumption and take a locale parameter, case foldings are used for processing (case-insensitive matching) not for display, and in most cases it is very surprising when "IBM" and "ibm" don't match when the locale is Turkish or Azerbaijani. It is much safer to let the developer control this explicitly. (By comparison, ICU4C/ICU4J have folding functions that take a boolean parameter for default vs. Turkic foldings. This also models the boolean condition in the relevant Unicode data file.) --- lucatrv If I understand correctly, icu_collator should be used when strings need to be sorted, while a case-folding method of icu_casemapping should be used when strings need just to be matched. However icu_collator can also be used to match strings, see for instance examples using Ordering::Equal here, so it is not clear to me which one to use in this case. Finally, another source of confusion (at least for me) is that icu_casemapping can be used for both case mapping and case folding, but its documentation mentions only "Case mapping for Unicode characters and strings". --- sffc The collator does a fuzzier match. The example you cited shows that it considers "às" and "as" to be equal, for example. @markusicu is it safe to say that most users who are looking for a fuzzy string comparison utility should favor the collator over casefold+nfd? --- sffc See also https://github.com/tc39/ecma402/issues/256 --- hsivonen Casefold+NFD and ignoring combining diacritics after the NFD operation gives a general case-insensitive, diacritic-insensitive match. To further match the root search collation (apart from the Hangul aspect for which I don't understand the use case), you'd have to also ignore certain Arabic marks and the Thai phinthu (virama). (The Hebrew aspect of the search root is gone from CLDR trunk already.) Apart from Turkic case-insensitivity, the key thing that the search collation tailorings provide on top of the above is being able to have a diacritic-insensitive mode where certain things that technically are diacritics but that are on a per-language basis considered to form a distinct base letter are not ignored on a locale-sensitive basis. For example, o and ö are distinct for Finnish, Swedish, Icelandic, and Turkish (not sure if them being equal for Estonian search is intentional or a CLDR bug) in collator-based search even when ignoring diacritics. Based on observing the performance of Firefox's ctrl/cmd-f (not collator based) relative to Chrome's and Safari's (collator-based), I believe that casefold+NFD and ignoring certain things post-NFD will be faster than collator-based search. However, if you also want not to ignore certain diacritics on a per-locale basis, it's up to you to implement those rules. That is, ICU4X doesn't do it for you. You can find out what the rules are by reading the CLDR search collation sources. (FWIW, Firefox's ctrl/cmd-f does not have locale-dependent rules for diacritics. The checkbox either ignores all of them or none.) ECMA-402 and ICU4X don't have API surface for collator-based substring match. You can only do full-string comparison, so you can search in the sense of filtering a set/list of items by a search key. --- Markus Scherer > If I understand correctly, CaseMapping::to_full_fold applies full case folding > + NFD and ignores combining diacritics. I think not. I believe it just applies the “full” Case_Folding mappings to each character, as opposed to the Simple_Case_Folding. Normalization and removing diacritics etc. would be separate steps / function calls. https://www.unicode.org/reports/tr44/#Case_Folding > Therefore it actually provides the fuzziest match (general case-insensitive > and diacritic-insensitive match). To my understanding this should be equivalent > to the icu_collator primary strength level, > https://icu4x.unicode.org/doc/icu_collator/enum.Strength.html#variant.Primary No. Similar in effect, but as Henri said, collation mappings do a lot more, such as ignoring control codes and variation selectors. > which I guess is independent from locale Not really. There are language-specific collation mappings, such as German "ä"="ae" (on primary level), but of course for the majority of Unicode characters each tailoring behaves like the Unicode default. Collation also provides for a number of parametric settings, although most of those are relevant for sorting, not for matching and searching. They do let you select things like “ignore punctuation” and “ignore diacritics but not case”. https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options --- lucatrv Referring to Section 3.13, Default Case Algorithms in the Unicode Standard, now I understand that CaseMapping::full_fold applies the toCasefold(X) operation (R4 page 155), which is the Case_Folding property. To allow proper caseless matching of strings interpreted as identifiers, in my opinion another method CaseMapping::NFKC_full_fold should be added, to apply the toNFKC_Casefold(X) operation (R5 page 155), which is the NFKC_Casefold property. Then another method should be added to allow identifier caseless matching, which could be either the combined function toNFKC_Casefold(NFD(X)) (D147 page 158) or the lower level NFD(X) normalization function. Otherwise to keep things simpler, maybe just a method named CaseMapping::caseless could be added which applies toNFKC_Casefold(NFD(X)) (D147 page 158). Do you agree, or otherwise how can I perform proper caseless categorization and matching? --- eggrobin For case-insensitive identifier comparison (identifiers include programming language identifiers, but also things like usernames: @EGGROBIN and @eggrobin are the same person), Unicode provides the operation toNFKC_Casefold, used in the definition of identifier caseless match (D147 in Default Caseless Matching). Earlier versions of Unicode (prior to 5.2) recommended the use of NFKC and casefolding directly, without the removal of default ignorable code points performed by toNFKC_Casefold. The foldings thus have stability guarantees that make them suitable for usage in identifier comparison in conjunction with NFKC (see https://www.unicode.org/policies/stability_policy.html#Case_Folding). As @markusicu wrote above, since identifier systems typically need to use a locale-independent comparison, the Turkic foldings need to be used with great care: whether @eggrobin is the same as @EGGROBIN should not depend on someone’s language. @markusicu is it safe to say that most users who are looking for a fuzzy string comparison utility should favor the collator over casefold+nfd? ^ @macchiati for advice on the most recommended way to perform fuzzy string matching. I am neither Markus nor Mark, but I would say that for general-purpose matching that does not have stability requirements, something collation-based is more appropriate. In particular, Chrome’s Ctrl+F search uses that. This is, as has been mentioned, fuzzier (beyond the accents already mentioned, note that ŒUF and œuf are primary-equal to oeuf, whereas they are not identifier caseless matches). An important consideration is that, being unstable (there is a somewhat squishy stability policy, see https://www.unicode.org/policies/collation_stability.html and https://www.unicode.org/collation/ducet-changes.html), fuzzy matching based on collation can be improved. Most recently the UTC approved (in consensus 174-C4) a change to the collation of punctuation marks that look like the ASCII ' and ", which has the effect that O'Connor will now be primary-equal to O’Connor. https://github.com/unicode-org/icu4x/issues/3178 Consider supporting three layers of collation data for search collations Markus Scherer Outside of ICU4X we usually try to make code & data work according to the algorithms, not according to what the known data looks like right now. ICU4C/J allow users to build custom tailorings at build time and at runtime. It should be possible to tailor relative to something that is tailored in the intermediate root search. https://github.com/unicode-org/icu4x/issues/3174#issuecomment-1624080765 Should search collation be a different data key + constructor? #3174 --- jlf Don't know if this long comment brings something useful for Rexx. They are searching for use-cases. whole-string matching, collation, substring or prefix matching. https://www.unicode.org/reports/tr10/#Searching: It's typically used for a substring match, like Ctrl-F in a browser. Why is collation the way it is? There's a use case for diacritic-insensitive string matching. And there is also the observation that you need special handling for certain diacritics like German umlauts. It seems weird that Thai for example has certain tailorings that are not in other Brahmic languages. https://github.com/unicode-org/icu4x/discussions/3981#discussioncomment-6882618 String search with collators references this ICU link: https://unicode-org.github.io/icu/userguide/collation/string-search.html https://github.com/unicode-org/icu4x/issues/3174#issuecomment-1624080765 Should search collation be a different data key + constructor? jlf: referenced from #3981 with this comment: We've had discussions about search collations in the past, such as #3174 Basically, we need a client with a clear and compelling use case who ideally can make some contributions, and then the team can provide mentorship to help land this type of feature. icu_collator version 1.3.3 is released. https://github.com/unicode-org/icu4x/releases/tag/ind%2Ficu_collator%401.3.3 https://docs.rs/icu_collator/latest/icu_collator/ Comparing strings according to language-dependent conventions. jlf: with examples jlf: implementation notes. https://docs.rs/icu_collator/latest/icu_collator/docs/index.html They use NFD? "The key design difference between ICU4C and ICU4X is that ICU4C puts the canonical closure in the data (larger data) to enable lookup directly by precomposed characters while ICU4X always omits the canonical closure and always normalizes to NFD on the fly." jlf: ok, on the fly, so part of their algorithm. https://github.com/unicode-org/icu4x/discussions/3231#discussioncomment-5599221 @sffc , Will ICU4X Test Data provider give correct results for Lao language? I was running segment_utf16 on Lao string but its results are not inline with ICU4C results. The ICU4X Test Data provider supports Japanese and Thai. For the other languages, you should follow the steps in the tutorial to generate your own data; in general the testdata provider is intended for testing. You can also track #2945 which will make it possible to get full data without needing to build it using the tool. https://www.youtube.com/watch?v=ZzsbN7HBd7E Rust Zürisee, Dec 2022: Next Generation i18n with Rust Using ICU4X Talk by Shane Carr (starts at 11:20, with some intros from the organizers first) https://github.com/unicode-org/icu4x/discussions/3522 Some word segmentation results are different than we get in ICU4C - Khmer string មនុស្សទាំងអស់ is giving 13 index as a breakpoint in ICU4X while ICU4C gives 6 - ຮ່ສົ່ສີ 5 in ICU4C while 7 in ICU4X - กระเพรา 3 in ICU4C while 7 in ICU4X I'm using the full data blob with all keys and locales. jlf: see the discussion, there is some code. https://github.com/unicode-org/icu4x/issues/2945 Default constructors with full data jlf: remember "close #2743 in favour of #2945. the solution we're working on there trivially extends to FFI." sffc We have built data providers as a first-class feature in ICU4X. We currently tutor clients on how to build their data file and detail all the knobs at their disposal, which is essential to ICU4X's mission. https://github.com/unicode-org/icu4x/issues/3552#issuecomment-1600050638 /// ICU4C's TestGreekUpper #[test] fn test_greek_upper() { let cm = CaseMapping::new_with_locale(&locale!("el")); // https://unicode-org.atlassian.net/browse/ICU-5456 assert_eq!(cm.to_full_uppercase_string("άδικος, κείμενο, ίριδα"), "ΑΔΙΚΟΣ, ΚΕΙΜΕΝΟ, ΙΡΙΔΑ"); // https://bugzilla.mozilla.org/show_bug.cgi?id=307039 // https://bug307039.bmoattachments.org/attachment.cgi?id=194893 assert_eq!(cm.to_full_uppercase_string("Πατάτα"), "ΠΑΤΑΤΑ"); assert_eq!(cm.to_full_uppercase_string("Αέρας, Μυστήριο, Ωραίο"), "ΑΕΡΑΣ, ΜΥΣΤΗΡΙΟ, ΩΡΑΙΟ"); assert_eq!(cm.to_full_uppercase_string("Μαΐου, Πόρος, Ρύθμιση"), "ΜΑΪΟΥ, ΠΟΡΟΣ, ΡΥΘΜΙΣΗ"); assert_eq!(cm.to_full_uppercase_string("ΰ, Τηρώ, Μάιος"), "Ϋ, ΤΗΡΩ, ΜΑΪΟΣ"); assert_eq!(cm.to_full_uppercase_string("άυλος"), "ΑΫΛΟΣ"); assert_eq!(cm.to_full_uppercase_string("ΑΫΛΟΣ"), "ΑΫΛΟΣ"); assert_eq!(cm.to_full_uppercase_string("Άκλιτα ρήματα ή άκλιτες μετοχές"), "ΑΚΛΙΤΑ ΡΗΜΑΤΑ Ή ΑΚΛΙΤΕΣ ΜΕΤΟΧΕΣ"); // http://www.unicode.org/udhr/d/udhr_ell_monotonic.html assert_eq!(cm.to_full_uppercase_string("Επειδή η αναγνώριση της αξιοπρέπειας"), "ΕΠΕΙΔΗ Η ΑΝΑΓΝΩΡΙΣΗ ΤΗΣ ΑΞΙΟΠΡΕΠΕΙΑΣ"); assert_eq!(cm.to_full_uppercase_string("νομικού ή διεθνούς"), "ΝΟΜΙΚΟΥ Ή ΔΙΕΘΝΟΥΣ"); // http://unicode.org/udhr/d/udhr_ell_polytonic.html assert_eq!(cm.to_full_uppercase_string("Ἐπειδὴ ἡ ἀναγνώριση"), "ΕΠΕΙΔΗ Η ΑΝΑΓΝΩΡΙΣΗ"); assert_eq!(cm.to_full_uppercase_string("νομικοῦ ἢ διεθνοῦς"), "ΝΟΜΙΚΟΥ Ή ΔΙΕΘΝΟΥΣ"); // From Google bug report assert_eq!(cm.to_full_uppercase_string("Νέο, Δημιουργία"), "ΝΕΟ, ΔΗΜΙΟΥΡΓΙΑ"); // http://crbug.com/234797 assert_eq!(cm.to_full_uppercase_string("Ελάτε να φάτε τα καλύτερα παϊδάκια!"), "ΕΛΑΤΕ ΝΑ ΦΑΤΕ ΤΑ ΚΑΛΥΤΕΡΑ ΠΑΪΔΑΚΙΑ!"); assert_eq!(cm.to_full_uppercase_string("Μαΐου, τρόλεϊ"), "ΜΑΪΟΥ, ΤΡΟΛΕΪ"); assert_eq!(cm.to_full_uppercase_string("Το ένα ή το άλλο."), "ΤΟ ΕΝΑ Ή ΤΟ ΑΛΛΟ."); // http://multilingualtypesetting.co.uk/blog/greek-typesetting-tips/ assert_eq!(cm.to_full_uppercase_string("ρωμέικα"), "ΡΩΜΕΪΚΑ"); assert_eq!(cm.to_full_uppercase_string("ή."), "Ή."); } https://github.com/unicode-org/icu4x/discussions/3688#discussioncomment-6456010 Recommended data provider type for libraries depending on ICU4X --- I finished creating a library that uses ICU4X as its backend, while learning Rust. For my library I used the DataProvider for as the interface to CLDR data (currently just using icu_testdata, though seen the page to generate customised datasets). So now I am wondering what would be the recommended data provider to use for a library using ICU4X as its backend? --- If you know the data you want at build time, I suggest using a baked data provider, otherwise use a Blob one with postcard. You can generate data using these steps https://github.com/unicode-org/icu4x/blob/main/docs/tutorials/data_management.md In the 1.3 release there will be a compiled_data feature that lets you include data by default, kinda like testdata but intended for production. --- compiled_data feature may just be what my library could use without the need for users to supply data provider for my library, if I understand the intended purpose of this up coming feature. Where is this feature located in the master, so I may start looking at it for design purposes, while waiting for 1.3 release? --- jlf: this answer is seriously incomprehensible! The feature is present on all of the component crates and it exposes functions like DateTimeFormatter::try_new() that don't have a provider argument. https://unicode-org.github.io/icu4x/docs/icu/datetime/struct.DateTimeFormatter.html#method.try_new The crate also does contain an unstable baked provider that users can pass in themselves, but note that it only implements data stuff from that particular crate and they'll need to combine it with providers from other crates if the type they are using uses data from everywhere (like DateTimeFormat: it uses plurals and decimal data too) https://unicode-org.github.io/icu4x/docs/icu/datetime/provider/struct.Baked.html --- This is a good question; what should intermediate libraries expose to their users? I'll schedule this for a discussion at an upcoming developers call. https://github.com/unicode-org/icu4x/issues/3709 Chinese and Dangi inconsistent with ICU implementations for extreme dates The current implementation of the Chinese calendar, as well as the Dangi calendar in #3694, are not consistent with ICU for all dates; based on writing a number of manual test cases (see the aforementioned PR), this seems to only be an issue for dates very far in the past or far in the future (ex. year -3000 ISO). Furthermore, the ICU4X Chinese/Dangi and astronomy functions are newly-written and have several algorithms based on the most recent edition of Calendrical Calculations, while the existing ICU code seems to be from 2000, incorporating algorithms from the 1997 edition of Calendrical Calculations. --- jlf: I take note of this because it's interesting to see the differences with ICU. Calendars in https://github.com/unicode-org/icu4x/pull/3744#discussion_r1277062568 they reference this common lisp code https://github.com/EdReingold/calendar-code2/blob/main/calendar.l#L2352 --- jlf: I take note of this to remember ;;;; The Functions (code, comments, and definitions) contained in this ;;;; file (the "Program") were written by Edward M. Reingold and Nachum ;;;; Dershowitz (the "Authors") ;;;; These Functions are explained in the Authors' ;;;; book, "Calendrical Calculations", 4th ed. (Cambridge University ;;;; Press, 2016) --- https://en.wikipedia.org/wiki/Calendrical_Calculations https://reingold.co/calendars.shtml The resource page for the book makes all the source code for the book available for download. https://www.cambridge.org/ch/universitypress/subjects/computer-science/computing-general-interest/calendrical-calculations-ultimate-edition-4th-edition?format=PB&isbn=9781107683167#resources The code has been ported to Python https://github.com/espinielli/pycalcal https://github.com/uni-algo/uni-algo/issues/31 L with stroke letter (U+0141, U+0142) doesn't normalize. auto const polish = std::string{"ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ"}; auto norm = una::norm::to_unaccent_utf8(polish); Everything is normalized except 'ł' and 'Ł'. Everything is normalized except 'ł' and 'Ł'. --- Strokes are not accents. As far as I know there is no data table in Unicode that maps L with stroke to L so no plans to implemented it, you need to do it manually if needed. -- jlf: idem with utf8proc "ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ"~text~nfc(stripmark:)= -- T'acełnoszz ACEŁNOSZZ' "ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ"~text~nfd(stripmark:)= -- T'acełnoszz ACEŁNOSZZ' --- https://en.wikipedia.org/wiki/%C5%81 Character Ł ł Unicode 321 0141 322 0142 CP 852 157 9D 136 88 CP 775 173 AD 136 88 Mazovia 156 9C 146 92 Windows-1250, ISO-8859-2 163 A3 179 B3 Windows-1257, ISO-8859-13 217 D9 249 F9 Mac Central European 252 FC 184 B8 https://github.com/unicode-org/icu4x/issues/2715 Minor and patch release policy https://docs.google.com/document/d/1yg_2l5FFo0aAuNi4jpgcIhIYjHqJyUoJWtMduyQ0vR8/edit ICU4X Data Versioning Design This document has been migrated to Markdown in https://github.com/unicode-org/icu4x/pull/2919 jlf: I don't see any markdown... https://github.com/unicode-org/icu4x/issues/1471 Decide on data file versioning policy jlf: For the comment of Marcus Scherer https://github.com/unicode-org/icu4x/issues/165 Data Version jlf: maybe to read As far as semantic versioning, I no longer give deference to it as the preferred way to do versioning or see the topic so singularly after seeing this talk. https://www.youtube.com/watch?v=oyLBGkS5ICk jlf: Spec-ulation Keynote - Rich Hickey The comments say it's good, did not watch. DateTime https://github.com/unicode-org/icu4x/issues/3347 DateTimeFormatter still lacks power user APIs jlf: this ticket contains potentially interesting links: Class hiearchy: https://github.com/unicode-org/icu4x/issues/380 Design doc: https://docs.google.com/document/d/1vJKR1s--RBmXLNIJSCtiTNPp08mab7ZwcTGxIZ9-ytI/edit# https://github.com/unicode-org/icu4x/pull/4334#discussion_r1403198515 Add is_normalized_up_to to Normalizer #4334 jlf remember: the Web-exposed ICU4C-backed behavior of current String.prototype.normalize in both SpiderMonkey and V8 retains unpaired surrogates in the normalization process (even after the first point in the string that needs to change under normalization). We've previously decided that ICU4X operates on the Unicode Scalar Value / Rust char value space and, therefore, will perform replacement of unpaired surrogates with the REPLACEMENT CHARACTER. https://github.com/unicode-org/icu4x/issues/4365 Segmenter does not work correctly in some languages "as `নমস্কাৰ, আপোনাৰ কি খবৰ?`"'0D'x"hi `हैलो, क्या हाल हैं?`"'0D'x"mai `नमस्ते अहाँ केना छथि?`"'0D'x"mr `नमस्कार, कसे आहात?`"'0D'x"ne `नमस्ते, कस्तो हुनुहुन्छ?`"'0D'x"or `ନମସ୍କାର ତୁମେ କେମିତି ଅଛ?`"'0D'x"sa `हे त्वं किदं असि?`"'0D'x"te `హాయ్, ఎలా ఉన్నారు?`" icu4c: 151 rust: 161 executor: 151 --- ICU4X and ICU4C are just using different definitions of EGCs; ICU4C has had a tailoring for years which has just been incorporated into Unicode 15.1, whereas ICU4X implements the 15.0 version without that tailoring. The difference is the handling of aksaras in some indic scripts: in Unicode 15.1 (and in any recent ICU4C) क्या is one EGC, but it is two EGCs (क्, या) in untailored Unicode 15.0 (and in ICU4X). --- eggrobin (For what it’s worth, क्या would be three legacy grapheme clusters, namely क्, य, and ा, see Table 1a of UAX29, whereas it is two 15.0 extended grapheme clusters and a single 15.1 extended grapheme cluster.) --- Fixed by #4536 https://github.com/unicode-org/icu4x/pull/4334 is_normalized_up_to and unpaired surrogates --- jlf: interesting discussion about the support of ill-formed strings https://github.com/unicode-org/icu4x/pull/4389 Line breaking --- jlf: they don't want to support a tailored line breaking, because this requires more than one code point of lookahead. https://github.com/unicode-org/icu4x/issues/4342 Add functions to get ICU4X, CLDR, and Unicode versions --- jlf: strange that they did not consider that earlier... https://github.com/unicode-org/icu4x/issues/2689 Consider exposing sort keys --- jlf : interesting for the description of the use cases (encryption, xpath) I created a section Xpath with their comments. https://github.com/unicode-org/icu4x/issues/3336 Add support for Unicode BCP 47 locale identifiers --- jlf: what is that? it's defined in https://www.unicode.org/reports/tr35/ UNICODE LOCALE DATA MARKUP LANGUAGE (LDML) Unicode LDML uses stable identifiers based on [BCP47] for distinguishing among languages, locales, regions, currencies, time zones, transforms, and so on. https://www.rfc-editor.org/rfc/bcp/bcp47.txt https://github.com/unicode-org/icu4x/issues/3247#issuecomment-1856577508 This month @anba landed Intl.Segmenter in Firefox based on the ICU4X Segmenter impl, reviewed by @dminor https://phabricator.services.mozilla.com/D195803 I had been under the impression that Intl.Segmenter was not implementable without support for random access in order to implement the containing() function. It looks like @anba's implementation loops from the start of the string and repeatedly calls next() until we reach the index. While this strategy gets the job done, I'm concerned about the performance of this with large strings where we need to reach an index deep into the string. I therefore hope that we can continue to prioritize this issue on the basis of 402 compatibility. --- jlf: to watch https://github.com/unicode-org/icu4x/issues/4523 Linebreak generated before CL (Close Punctuation) --- https://www.unicode.org/reports/tr14/#CL UNICODE LINE BREAKING ALGORITHM https://github.com/typst/typst/issues/3082 Chinese punctuation is placed at the beginning of the line in some cases --- jlf: Linebreak referenced from icu4x/issues/4523 The example is wrong, a better example is provided in icu4x/issues/4523. https://github.com/unicode-org/icu4x/pull/4389 Fix Unicode 15.0 line breaking jlf: Linebreak https://github.com/unicode-org/icu4x/issues/4146 icu_segmenter::LineSegmenter incorrectly applies rule LB8a --- jlf: Linebreak, for the examples of line breaks. https://github.com/unicode-org/icu4x/discussions/4525#discussioncomment-8155602 Mapping between browser Intl and ICU4X jlf: I don't understand what they are talking about, but there are maybe good to know informations in this thread. In particular this URL: "Sensitivity" in browsers maps to a combination of strength and case level. https://searchfox.org/mozilla-central/rev/1aa61dcd48e128a8cbfbe59b7ba43d31bd3c248a/intl/components/src/Collator.cpp#171-185 https://github.com/unicode-org/icu4x/issues/3284#issuecomment-1911226051 Should the Segmenter types accept a locale? --- Steven Loomis: Please put it into the API. I was doing planning on a work item to move this forward. This is for example languages that want to keep "ch" together etc. --- jlf: so it appears from the discussion that ICU4C implements specific rules that are not part of UAX #29. --- sffc The conclusions from the discussion of this issue with the CLDR design group: - Grapheme clusters should not be language-specific; baked into much low-level processing (e.g., Swift, font mappings) which we don’t want to be language-specific - Content locale/text language parameter (not UI locale): Potential for accuracy; make it optional, name it well - Ok to leave the locale on the constructor; benefit: more specific data loading even for existing dictionaries & models My suggested path forward for this issue, then, is to add an options bag to the WordSegmenter, LineSegmenter, and SentenceSegmenter constructors with an optional content_locale field of type &LanguageIdentifier. --- Steven Loomis This makes no sense and contradicts the long standing requests. I would have joined, did not realize this was coming up today. https://github.com/unicode-org/icu4x/issues/58 Design a cohesive solution for supported locales https://github.com/tc39/proposal-intl-segmenter/issues/133 Custom Dictionaries and a political point of view from a Hong Kong immigrant. https://github.com/unicode-org/icu4x/issues/3284 Should the Segmenter types accept a locale? Markus Scherer: No language parameter for grapheme cluster segmenter +1 Language parameter for the other three segmenters +1 https://github.com/unicode-org/icu4x/issues/3990 Consider supporting retrieval of the language preference list from the system --- jlf: some infos and pointers, for general culture. https://github.com/unicode-org/icu4x/issues/4705 Bridge the gap between icu::properties::Script and icu::locid::subtags::Script --- jlf: this is about script names --- Markus Scherer Conversion is probably fine, but in the end they are just script codes, so it also makes sense to define the full set once and have Unicode APIs use a subset of the values. The ones in the UCD are a subset of the full set. And only the ones in the UCD have Unicode-defined long value names (identifiers). Eggrobin https://unicode.org/iso15924/codelists.html https://unicode.org/iso15924/iso15924.txt The PVA column is from https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt. Markus Scherer Also https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry look for Type: script which becomes this in CLDR: https://github.com/unicode-org/cldr/blob/main/common/validity/script.xml Note that the CLDR list includes one or more private use script subtags: https://www.unicode.org/reports/tr35/#unicode_script_subtag_validity https://www.unicode.org/reports/tr35/#Private_Use_Codes Qaag is current but yucky... Don't include Qaai which has become an alias for Zinh https://github.com/unicode-org/icu4x/issues/3014 Provide the Numeric_Value character property ICU4X is missing an API for querying the Numeric_Value property of a character. Markus Scherer Note that Numeric_Value is easy when Numeric_Type=Decimal or Numeric_Type=Digit. And maybe you need/want it only if Numeric_Type=Decimal. When Numeric_Type=Numeric, then the Numeric_Value can be negative, huge, or a fraction. These are rarely useful. https://www.unicode.org/reports/tr44/#Numeric_Value I would start with an API that returns the value of a decimal digit. Markus Scherer Most of the nt=digit characters are not part of a contiguous 0..9 range of characters. In particular, there is often no zero. Some of them are simply nt=digit because their nv is 0..9 although they are part of a larger set of "numbered list bullets" where the nv>9 numbers have nt=numeric. In UTS46, they are variously disallowed/mapped/valid. See https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ant%3Ddigit%3A%5D&g=uts46&i= It makes sense to me to have an API that returns the nv of nt=decimal but the nv of other characters is rarely useful to programmers. https://github.com/unicode-org/icu4x/issues/4771 LineBreakStrictness::Anywhere gives the wrong breakpoints for Arabic in icu_segmenter I am aware this is probably a unicode spec issue, rather than a rust library issue, but I thought I would point it out regardless. This is the minimal application I was using to test this behavior: use icu_segmenter::{LineBreakOptions, LineBreakStrictness, LineSegmenter}; fn main() { let test = "الخيل والليل"; let mut options = LineBreakOptions::default(); options.strictness = LineBreakStrictness::Anywhere; let segmenter = LineSegmenter::new_auto_with_options(options); let breakpoints = segmenter.segment_str(test); for bp in breakpoints { println!("{bp}: {}", &test[bp..]); } } This gives the following output: (jlf: bbedit doesn't support well this text, can't indent the whole block, can't indent a single line) 0: الخيل والليل 2: لخيل والليل 4: خيل والليل 6: يل والليل 8: ل والليل 10: والليل 11: والليل 13: الليل 15: لليل 17: ليل 19: يل 21: ل 23: as you can tell, it is breaking after every single letter, without respect to the letters' connections. However, as I am sure you are aware, the letters' connections are not optional. The output I expected is the following: 0: الخيل والليل 2: لخيل والليل 10: والليل 11: والليل 13: الليل 15: لليل 23: Putting the break points across the visual boundaries of the letters. This is not the current orthodoxy, but any looser breaks than that and you'd be rendering the text illegible and unnatural. Note: This is how old written manuscripts break their words. --- Closed as not planned https://github.com/unicode-org/icu4x/issues/4780 Unexpected grapheme boundary with regional indicators (GB12) use icu::segmenter::GraphemeClusterSegmenter; fn main() { let segmenter = GraphemeClusterSegmenter::new(); let text = "🇺🇸🏴󠁧󠁢󠁥󠁮󠁧󠁿"; segmenter .segment_str(text) .for_each(|i| println!("{}", i)); } Reports the following break points: 0 4 8 36 which means "🇺🇸" is split into two graphemes, which should be disallowed per GB12 --- This is fixed by #4536. --- jlf: utf8proc is ok "🇺🇸"~graphemes== a CharacterSupplier 1 : T'🇺🇸' "🇺🇸"~unicodecharacters== an Array (shape [2], 2 items) 1 : ( "🇺" U+1F1FA So 1 "REGIONAL INDICATOR SYMBOL LETTER U" ) 2 : ( "🇸" U+1F1F8 So 1 "REGIONAL INDICATOR SYMBOL LETTER S" ) #4536 https://github.com/unicode-org/icu4x/pull/4536 Update grapheme cluster break rules to Unicode 15.1 jlf: lot of discussions about stability that I did not try to understand.

utf8proc title


https://codeberg.org/dnkl/foot/pulls/100 Grapheme shaping using libutf8proc #100 jlf tag: character width jlf: to read?

Twitter text parsing


https://github.com/twitter/twitter-text Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform. https://swiftpack.co/package/nysander/twitter-text This is the Swift implementation of the twitter-text parsing library. The library has methods to parse Tweets and calculate length, validity, parse @mentions, #hashtags, URLs, and more.

terminal / console / cmd


https://www.reddit.com/r/bash/comments/wfbf3w/determine_if_the_termconsole_supports_utf8/ Determine if the term/console supports UTF8? https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line jlf: with my current version of Windows (21H2 - 10.0.19044), I have the input bug describe below: In general using codepage 65001 will only work without bugs in Windows 10 with the Creators update. In Windows 7 it will have both output and input bugs. In Windows 8 and older versions of Windows 10 it only has the input bug, which limits input to 7-bit ASCII. Eryk Sun Sep 9, 2017 at 13:43 jlf: the sentence above is not true, I have the input bug with my version of Windows which is AFTER Creators update. http://archives.miloush.net/michkap/archive/2006/03/13/550191.html Who broke the UTF-8 support? by Michael S. Kaplan, published on 2006/03/13 03:21 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/13/550191.aspx --- jlf : we are in 2022 and the UTF-8 support in cmd is still broken... https://stackoverflow.com/questions/39736901/chcp-65001-codepage-results-in-program-termination-without-any-error jlf : Thanks to this post, I suddenly understood why ooRexxShell no longer supports UTF-8 input. It's because I deactivated readline on Dec 20, 2020. When readline is on, ooRexxShell delegates to cmd to read a line: set /p inputrx="My prompt> " This input mode is not impacted by the UTF-8 input bug! https://stackoverflow.com/questions/10651975/unicode-utf-8-with-git-bash git-bash (Windows) https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10) Describes how to set the system locale (language for non-Unicode programs) to UTF-8. Optional reading: Why the Windows PowerShell ISE is a poor choice --- jlf: this is a clear description of the UTF-8 input bug. For ReadFile from the console, even in Windows 10, you'll be limited to 7-bit ASCII if the input codepage is set to UTF-8, due to buggy assumptions in the console host, conhost.exe. In Windows 10, it returns non-ASCII characters as null ("\0") in the buffer. In older versions, the read succeeds with 0 bytes read, which looks like EOF. Eryk Sun Jul 21, 2019 at 13:31 https://stackoverflow.com/questions/49476326/displaying-unicode-in-powershell/49481797#49481797 Displaying Unicode in Powershell https://akr.am/blog/posts/using-utf-8-in-the-windows-terminal Using UTF-8 in the Windows Terminal https://github.com/microsoft/terminal https://github.com/Microsoft/Cascadia-Code https://github.com/PowerShell/PowerShell/issues/7233 Make console windows fully UTF-8 by default on Windows, in line with the behavior on Unix-like platforms mklement0 opened this issue on Jul 5, 2018 --- jlf still opened on 2023.08.08 https://github.com/contour-terminal/terminal-unicode-core Unicode Core specification for Terminal (grapheme clusters, character widths, ...) jlf: only a poor tex file... dead? no commit since 2 years. https://news.ycombinator.com/item?id=37804829 ZERO comment in HN

QT Title


https://bugreports.qt.io/browse/QTBUG-48726 Combining diacritics misplaced when using monospace fonts jlf tag: character width

IBM OS


https://www.ibm.com/docs/en/personal-communications/15.0?topic=pages-contents#ToC Host Code Page Reference Referenced in a IBM-MAIN thread about "TN3270, EBCDIC and ASCII" >>There are a few layers to getting the codepages right for using a terminal >>emulator and ISPF Edit and Browse on the host. >>For example, in Personal Communications I first define my host codepage. I >>have a lot of choices. From 420 (Arabic) to 1130 (Vietnamese). I tend to >>use 1047 (U.S.) to get my square brackets right. jlf: tables of character codes https://www.ibm.com/docs/en/zos/3.1.0?topic=317-zos-unix-directory-list-utility-line-commands z/OS UNIX directory list utility line commands Referenced in a IBM-MAIN thread about "TN3270, EBCDIC and ASCII" >>Then on the host side. If you are using the ISPF UDLIST interface to Unix >>(OMVS) you can use either EBCDIC, ASCII, or UTF8 for EDIT or VIEW. Actions: E—edit regular file EA—edit ASCII file EU—edit UTF-8 file V—view regular file VA—view ASCII file VU—view UTF8 file https://www.ibm.com/docs/en/zos/3.1.0?topic=information-pdf-browse-primary-commands PDF Browse primary commands Referenced in a IBM-MAIN thread about "TN3270, EBCDIC and ASCII" >>In ISPF Browse, you can use the DISPLAY command to view data as UTF8, >>UTF32, UCS2, UNICODE, ASCII, USASCII, and EBCDIC, or specify the numeric >>CCSID. Syntax diagram DISPLAY CCSIDccsid_number ASCII USASCII EBCDIC UCS2 UTF8 UTF16 UTF32 Syntax diagram FIND UTF8 ASCII USASCII

IBM RPG Lang


https://www.ibm.com/docs/en/i/7.4?topic=cdt-processing-string-data-by-natural-size-each-character Processing string data by the natural size of each character String data can have characters of different sizes. - UTF-8 data can have characters with 1, 2, 3, or 4 bytes. For example, the character 'a' has one byte, and the character 'á' has two bytes. UTF-8 data is defined as alphanumeric with CCSID(*UTF8) or CCSID(1208). - UTF-16 data can have characters with 2 or 4 bytes. UTF-16 data is defined as UCS-2 with CCSID(*UTF16) or CCSID(1200). - EBCDIC mixed SBCS/DBCS data can have characters with 1 or 2 bytes. Additionally, double-byte data is surrouned by shift bytes. The shift-out byte x'0E' begins a section of DBCS data and the shift-in byte x'0F' ends the section of DBCS data. - ASCII mixed SBCS/DBCS data can have characters with 1 or 2 bytes. ASCII mixed SBCS/DBCS data is defined as alphanumeric with a CCSID that represents mixed SBCS/DBCS data such as 950. Default behaviour, CHARCOUNT STDCHARSIZE By default, data is processed using the standard-character-size mode. The compiler processes string data by bytes or double bytes without regard for size of each character. When CHARCOUNT NATURAL is in effect: The compiler processes string operations by the natural size of each character. The compiler sets the CHARCOUNT NATURAL mode for a file if the CHARCOUNT is not specified for the file. The CHARCOUNT mode for the file affects the movement of data from RPG fields to the output buffer and key buffer used for the file operations. https://www.ibm.com/docs/en/i/7.4?topic=fdk-charcountnatural-stdcharsize CHARCOUNT(*NATURAL | *STDCHARSIZE) The CHARCOUNT keyword controls how RPG handles string truncation when moving data from RPG program variables to the output buffer and key buffer for the file. *NATURAL If the data type of the field in the output buffer or key buffer is relevant according to the CHARCOUNTTYPES Control keyword, any necessary truncation when data is moved is done according to the CHARCOUNT NATURAL mode for assignment. *STDCHARSIZE Any necessary truncation when data is moved is done by bytes or double bytes, without regard for the size of each character. When the CHARCOUNT keyword is not specified, the current CHARCOUNT setting is used for the file, as determined by the CHARCOUNT Control keyword or the most recent /CHARCOUNT directive preceding the definition for the file. https://www.ibm.com/docs/en/i/7.4?topic=keywords-charcounttypesutf8-utf16-jobrun-mixedebcdic-mixedascii CHARCOUNTTYPES(*UTF8 *UTF16 *JOBRUN *MIXEDEBCDIC *MIXEDASCII) The Control keyword CHARCOUNTTYPES specifies the types of data that are processed by characters rather than by bytes or double bytes when CHARCOUNT NATURAL mode is in effect. *UTF8 Specify *UTF8 if your module might work with UTF-8 data which has characters of different lengths. For example, the UTF-8 character 'a' has one byte, and the UTF-8 character 'á' has two bytes. *UTF16 Specify *UTF16 if your module might work with UTF-16 data which has some 4-byte characters. *JOBRUN Specify *JOBRUN if your job CCSID might support mixed SBCS and DBCS data, and the RPG variables in your module defined to have the job CCSID might contain some DBCS data. *MIXEDEBCDIC Specify *MIXEDEBCDIC if your module might work with EBCDIC data which supports both SBCS and DBCS characters. This includes data defined with CCSID(*JOBRUNMIX) and data defined with a mixed SBCS/DBCS CCSID such as 937. *MIXEDASCII Specify *ASCII if your module might work with ASCII data which supports both SBCS and DBCS characters.

IBM z/OS


https://www.ibm.com/docs/en/zos/2.5.0?topic=mvs-zos-unicode-services-users-guide-reference Unicode services https://www.ibm.com/docs/en/zos/2.5.0?topic=reference-application-programmer-information Character conversion Case conversion Normalization Collation Bidi transformation Stringprep conversion --- jlf: There is this note at the begining of the page "Bidi transformation": "IBM does not intend to enhance the bidi transformation service. Instead, it is recommended that you use the character conversion 'extended bidi support' for all new development and for the highest level of bidi support." Can't find where is described this 'extended bidi support'. https://www-40.ibm.com/servers/resourcelink/svc00100.nsf/pages/zOSV2R5IndexFile/$file/index.html search Ctrl+F "unicode": only one result: cunu100_v2r5.pdf SA38-0680-50 z/OS Unicode Services User's Guide and Reference https://www.ibm.com/docs/en/zos/2.5.0 Search "Unicode" in z/OS 2.5 documentation: https://www.ibm.com/docs/en/search/unicode?scope=SSLTBW_2.5.0 jlf: not sure it's very interesting... All the links are just one page with few informations. https://listserv.ua.edu/cgi-bin/wa?A2=IBM-MAIN;5304fbc3.2304&S= Re: TSO Rexx C2X Incorrect Output Events such as this affirm my belief in minimal munging of user data by default. jlf: this sentence is to remember when designing how Unicode should be supported by Rexx... https://stackoverflow.com/questions/76569347/what-are-the-supported-code-points-for-special-characters-for-valid-z-os-datas What are the supported code points for 'special characters' for valid z/OS datasets? jlf: the link above was given in this IBM-MAIN thread https://listserv.ua.edu/cgi-bin/wa?A2=ind2307&L=IBM-MAIN&D=0&P=121856 --- Matt Hogstrom: I did some testing by creating a file in USS in CP047 with the characters “@#$” and then used iconv to convert them to a variety of code pages and compare the results. Some conversions failed but when looking at the code pages that failed they didn’t appear to me to be what I would consider mainstream. For the ones I’m familiar with they all converted correctly. The command was 'iconv -f 1047 -t 37 special > converted;chtag -t -c 37 converted;cmp special converted’ I changed to the encoding of 37 to other code pages and most worked fine. You can find the list of cps supported by issuing 'iconv -l’ and there are a lot of them. https://listserv.ua.edu/cgi-bin/wa?A2=ind2307&L=IBM-MAIN&D=0&P=183611 Python 3.11 on z/OS - UTF-8 errors --- I am trying to get a python package (psutil) to run on z/OS. I downloaded the package from github and then tar'ed it and uploaded it binary to my home-dir in OMVS. In my homedir I untar'ed to files and ran the command "chtag -tc IBM-1047 *' to set the files to UTF-8. I got make to work by converting the tab char to x'05' - no problem - and I got the C compiler to work also. Now my problem is that I can not make Python compile the setup.py file. It dies with a UTF-error on a char x'97' in statement 48 pos 2: from _common import AIX # NOQA --- It's this package https://github.com/giampaolo/psutil/blob/master/INSTALL.rst --- I believe UTF-8 is IBM-1208. --- Have you tried the z/OS Open Tools phytonport - https://github.com/ZOSOpenTools --- Have you considered cloning the repository and utilizing Git's file tagging feature? It can handle the tagging process for you. If you don't have internet access, a suggestion would be to tag all the files as ISO8859-1. It's advisable to avoid using UTF-8, as it may cause issues with some ported tools that will not work. That includes the majority of Rocket ported tools. If you list the IBM Python runtime library you will notice that all source files are tagged "iso8859-1" even though Python mandates UTF-8. --- I'm doing this on the company sandbox so I can not make a git clone. And trying 8859-1 (cp 819) does not change anything: /home/bc6608/psutil:chtag -p setup.py t ISO8859-1 T=on setup.py PYTHONWARNINGS=all python3 setup.py build_ext -i `python3 -c "import sys, os; py36 = sys.version_info[:2] >= (3, 6); cpus = os.cpu_count() or 1 if py36 else 1; print('--parallel %s' % cpus if cpus > 1 else '')"` Traceback (most recent call last): File "/home/bc6608/psutil/setup.py", line 47, in <module> from _common import AIX # NOQA ^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 2: invalid start byte --- Found the error. The error was not the codepage of the setup.py, but the codepage of the imported file _common . Once it got chtag -tc 1047 _common.py I got further. --- I can’t recreate your problem but I used a different method. I downloaded a zip file from Github, uploaded it to z/OS and followed these steps: jar xf psutill-master.zip cd psutil-master chtag -R -tc iso8859-1 . python3 setup.py --- A quick question - Will the same chtag command work for, say, Java packages/projects? Answer: yes Or, would I have to use chtag -R -tc UTF-8 if a project expects to things to be in UTF8? Answer: I'd like to understand your reasons for wanting to encode your Java source files in UTF-8. It's important to note that the default encoding on z/OS is IBM-1047 (EBCDIC). We typically use ISO8859-1 and have to specify the "-encoding iso8859-1" option when using the javac compiler. As mentioned earlier, tagging files as UTF-8 can lead to unexpected issues, which is why it's not commonly done. If you examine the file attributes of modern languages like Python, Node.js, Go, etc., you'll notice that their source files are tagged as ISO8859-1. A while ago, one of our ported tools developers provided me with a detailed explanation regarding the challenges associated with UTF-8 for ported tools. Although I don't recall all the specifics, it had something to do with double conversions. Therefore, the general rule of thumb is to avoid using UTF-8 unless it is necessary, such as when embedding a YAML document into a Java JAR file. --- We specify <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> in our Maven builds as most of the time we are building off host on machines with UTF8 locales. However, we tag our files ISO8859-1 on z/OS other then some YAML docs that must be tagged UTF-8 or else SnakeYaml barfs when reading it from the class path which doesn’t support tags :). The server runs with file.encoding=ISO8859-1 as well. If we cared about the euro sign we could change it to ISO8859-15 which is still an 8-bit character set. It’s those pesky codes above 0x7F in UTF-8 that cause the issues. https://www.ibm.com/support/pages/system/files/inline-files/Managing%20the%20code%20page%20conversion%20when%20migrating%20zOS%20source%20files%20to%20Git%20-%201.0.pdf (PDF) Managing the code page conversion when migrating z/OS source files to Git --- Git has proven to be the de-facto standard in the Open Source world, and the z/OS platform can interact with Git through the z/OS Git client, which is maintained by Rocket Software in its “Open Source Languages and Tools for z/OS” package. https://www.ibm.com/docs/en/zos/2.2.0?topic=server-different-end-line-characters-in-text-files Different end of line characters in text files --- In general, z/OS UNIX text files contain a newline character at the end of each line. In ASCII, newline is X'0A'. In EBCDIC, newline is X'15'. (For example, ASCII code page ISO8859-1 and EBCDIC code page IBM-1047 translate back and forth between these characters.) Windows programs normally use a carriage return followed by a line feed character at the end of each line of a text file. In ASCII, carriage return/line feed is X'0D'/X'0A'. In EBCDIC, carriage return/line feed is X'0D'/X'15'. The tr command shown in the preceding example deletes all of the carriage return characters. (Line feed and newline characters have the same hexadecimal value.) The SMB server can translate end of line characters from ASCII to EBCDIC and back but it does not change the type of delimiter (PC versus z/OS UNIX) nor the number of characters in the file. https://www.ibm.com/docs/en/zos/2.5.0?topic=options-record-format-recfm Record Format (RECFM) RECFM specifies the characteristics of the records in the data set as fixed-length (F), variable-length (V), ASCII variable-length (D), or undefined-length (U). Blocked records are specified as FB, VB, or DB. Spanned records are specified as VS, VBS, DS, or DBS. You can also specify the records as fixed-length standard by using FS or FBS. You can request track overflow for records other than standard format by adding a T to the RECFM parameter (for example, by coding FBT). Track overflow is ignored for PDSEs. The type of print control can be specified to be in ANSI format-A, or in machine code format-M. See Using Optional Control Characters (https://www.ibm.com/docs/en/SSLTBW_2.5.0/com.ibm.zos.v2r5.idad400/occ.htm#occ) and z/OS DFSMS Macro Instructions for Data Sets (https://www.ibm.com/docs/en/SSLTBW_2.5.0/com.ibm.zos.v2r5.idad500/abstract.htm) for information about control characters. https://docs.tibco.com/pub/mftps-zos/8.0.0/doc/html/GUID-A0CF702B-C126-43BE-86B2-8DF589FAD6BF.html TIBCO® Managed File Transfer Platform Server for z/OS RECFM={ F | FB | V | VB | U | VS | VBS} Default=V This parameter defines the significance of the character logical record length (semantics of LRECL boundaries). You can specify fixed, variable, or system default The valid values are as follows: - F: each string contains exactly the number of characters defined by the string length parameter. - FB: all blocks and all logical record are fixed in size. One or more logical records reside in each block. - V: the length of each string is less than or equal to the string length parameter. - VB: blocks as well as logical record length can be of any size. One or more logical records reside in each block. - U: blocks are of variable size. No logical records are used. The logical record length is displayed as zero. This record format is usually only used in load libraries. Block size must be used if you are specifying U. - VS: records are variable and can span logical blocks. RECFM=VS is not supported when checkpoint restart is used. - VBS: blocks as well as logical record length can be of any size. One or more logical records reside in each block. Records are variable and can span logical blocks. RECFM=VBS is not supported when checkpoint restart is used.

macOS OS


you can enter emoji (and other Unicode characters) using standard operating system tools—like ctrl cmd space. https://eclecticlight.co/2021/05/08/explainer-unicode-normalization-and-apfs/ Explainer: Unicode, normalization and APFS hoakley May 8, 2021 --- One of the oldest problems with Apple’s APFS file system is how it encodes file and directory names using Unicode.

Windows OS


https://learn.microsoft.com/en-us/windows/win32/intl/international-support jlf: I search which functionalities are available only to unicode apps... - can be multilingual without managing code pages - IME? not sure if it's only for unicode apps - other? https://stackoverflow.com/questions/59404120/what-is-the-difference-in-using-cstringw-cstringa-and-ct2w-ct2a-to-convert-strin What is the difference in using CStringW/CStringA and CT2W/CT2A to convert strings? CString offers a number of conversion constructors to convert between ANSI and Unicode encoding. They are as convenient as they are dangerous, often masking bugs. By contrast, the Cs2d macros (where s = source, d = destination) work on raw C-style strings; no CString instances are created in the process of converting between character encodings. Both of the above perform a conversion with an implied ANSI codepage (either CP_THREAD_ACP or CP_ACP in case the _CONVERSION_DONT_USE_THREAD_LOCALE preprocessor symbol is defined). CP_ACP is particularly troublesome, as it's a process-global setting, that any thread can change at any time. Which one should you choose for your conversions? Neither of the above. Use the EX versions instead (see string and text classes for a full list). https://learn.microsoft.com/en-us/cpp/atl/string-and-text-classes?view=msvc-170 String and Text Classes https://stackoverflow.com/questions/15362859/getclipboarddata-cf-unicodetext GetClipboardData (CF_UNICODETEXT) https://jerrington.me/posts/2015-12-31-windows-debugging-for-fun-and-profit.html jlf: I reference this page for the code related to clipboard. Search for "locale". https://learn.microsoft.com/en-us/windows/win32/dataxchg/standard-clipboard-formats Standard Clipboard Formats CF_LOCALE Locale identifier (LCID) associated with text in the clipboard. The system uses the code page associated with CF_LOCALE to implicitly convert from CF_TEXT to CF_UNICODETEXT. CF_TEXT Text format. Each line ends with a carriage return/linefeed (CR-LF) combination. A null character signals the end of the data. Use this format for ANSI text. CF_UNICODETEXT Unicode text format. Each line ends with a carriage return/linefeed (CR-LF) combination. A null character signals the end of the data. Locale https://learn.microsoft.com/en-us/windows/win32/intl/language-identifiers A language identifier is a standard international numeric abbreviation for the language in a country or geographical region. Each language has a unique language identifier (data type LANGID), a 16-bit value that consists of a primary language identifier and a sublanguage identifier. +-------------------------+-------------------------+ | SubLanguage ID | Primary Language ID | +-------------------------+-------------------------+ 15 10 9 0 bit https://learn.microsoft.com/en-us/windows/win32/intl/sort-order-identifiers A sort order identifier is defined in the form "_sortorder", at the end of the locale name used in the identifier, for example, "de-DE_phoneb", where "phoneb" is the sort order. The corresponding locale identifier is created as follows: MAKELCID(MAKELANGID(LANG_GERMAN, SUBLANG_GERMAN), SORT_GERMAN_PHONE_BOOK). https://learn.microsoft.com/en-us/windows/win32/intl/locale-identifiers Each locale has a unique identifier, a 32-bit value that consists of a language identifier and a sort order identifier. +-------------+---------+-------------------------+ | Reserved | Sort ID | Language ID | +-------------+---------+-------------------------+ 31 20 19 16 15 0 bit https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuilanguageinfo https://learn.microsoft.com/en-us/previous-versions/windows/embedded/ms930130(v=msdn.10)?redirectedfrom=MSDN Locale Code Table jlf: obsolete, but for the moment I don't haver better. Correspondance Locale identifier (LCID) <--> Default code page --- LCID Code page Language: sublanguage 0x0436 1252 Afrikaans: South Africa 0x041c 1250 Albanian: Albania 0x1401 1256 Arabic: Algeria 0x3c01 1256 Arabic: Bahrain etc... https://devblogs.microsoft.com/oldnewthing/20161007-00/?p=94475 How can I get the default code page for a locale? UINT GetAnsiCodePageForLocale(LCID lcid) { UINT acp; int sizeInChars = sizeof(acp) / sizeof(TCHAR); if (GetLocaleInfo(lcid, LOCALE_IDEFAULTANSICODEPAGE | LOCALE_RETURN_NUMBER, reinterpret_cast<LPTSTR>(&acp), sizeInChars) != sizeInChars) { // Oops - something went wrong } return acp; } https://www.w3.org/TR/ltli/#dfn-locale-neutral Locale neutral jlf: je pige que dalle Locale-neutral. A non-linguistic field is said to be locale-neutral when it is stored or exchanged in a format that is not specifically appropriate for any given language, locale, or culture and which can be interpreted unambiguously for presentation in a locale aware way. Many specifications use a serialization scheme, such as those provided by [XMLSCHEMA11-2] or [JSON-LD], to provide a locale neutral encoding of non-linguistic fields in document formats or protocols. A locale-neutral representation might itself be linked to a specific cultural preference, but such linkages should be minimized. http://archives.miloush.net/michkap/archive/2005/04/18/409095.html A few of the gotchas of WideCharToMultiByte by Michael S. Kaplan, published on 2005/04/18 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/18/409095.aspx http://archives.miloush.net/michkap/archive/2005/04/19/409566.html A few of the gotchas of MultiByteToWideChar by Michael S. Kaplan, published on 2005/04/19 04:30 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/19/409566.aspx --- jlf: I reached this page because the flag MB_COMPOSITE is not working! This page brings the answer: the Microsoft doc has this note Note For UTF-8 or code page 54936 (GB18030, starting with Windows Vista), dwFlags must be set to either 0 or MB_ERR_INVALID_CHARS. Otherwise, the function fails with ERROR_INVALID_FLAGS. Uh? http://archives.miloush.net/michkap/archive/2005/02/26/381020.html What the &%#$ does MB_USEGLYPHCHARS do? by Michael S. Kaplan, published on 2005/02/26 15:26 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/26/381020.aspx https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page Use UTF-8 code pages in Windows apps https://mastodon.gamedev.place/@AshleyGullen/111109299141510319 what it takes to pass a file path to a Windows API in C++ https://github.com/neacsum/utf8 This library simplifies usage of UTF-8 encoded strings under Win32 Related articles: https://www.codeproject.com//Articles/5252037/Doing-UTF-8-in-Windows https://www.codeproject.com/Articles/5259868/Doing-UTF-8-in-Windows-Part-2-Tolower-or-Not-to-Lo https://www.codeproject.com/Tips/5263944/UTF-8-in-Windows-INI-Files --- Reddit review: https://www.reddit.com/r/cpp/comments/174ee8q/doing_utf8_in_windows/ --- This article about UTF-8 in Windows that does not discuss how to use a manifest to get UTF-8 process ANSI codepage, directs people back to the 1990's. Or pre-2019, at any rate. --- Something else to note, if you're in the habit of keeping UTF-8 strings in `std::string`, is that the Visual C++ version of `std::filesystem::path` initialized from a `std::string` will use the default codepage for the process to convert the path to UTF-16. That will result in interesting failures on systems whose default codepage is MBCS. All without a single Windows API to be seen in your source. The solution to this is to upgrade to C++20 and use `std::u8string`, or to keep filenames in `std::wstring` if you don't want to deal with the odd and occasionally surprising limitations of `std::u8string`. https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests#activeCodePage Application manifests - activeCodePage --- On Windows 10, this element forces a process to use UTF-8 as the process code page. On Windows 10, the only valid value for activeCodePage is UTF-8. Starting in Windows 11, this element also allows selection of either the legacy non-UTF-8 code page, or code pages for a specific locale for legacy application compatibility. Modern applications are strongly encouraged to use Unicode. On Windows 11, activeCodePage may also be set to the value Legacy or a locale name such as en-US or ja-JP. https://devblogs.microsoft.com/oldnewthing/20210527-00/?p=105255 How can I convert between IANA time zones and Windows registry-based time zones? A copy of ICU has been included with Windows since Windows 10 Version 1703 (build 15063). All you have to do is include icu.h, and you’re off to the races. An advantage of using the version that comes with Windows is that it is actively maintained and updated by the Windows team. If you need to run on older systems, you can build your own copy from their fork of the ICU repo, https://github.com/microsoft/icu but the job of servicing the project is now on you.

Language comparison


https://blog.kdheepak.com/my-unicode-cheat-sheet Vim, Python, Julia and Rust.

Regular expressions


https://regex101.com/ Testing a regular expression. There is even a debugger! https://www.regular-expressions.info/unicode.html \X matches a grapheme https://www.regular-expressions.info/posixbrackets.html POSIX Bracket Expressions jlf: see the table in the section Character Classes https://pypi.org/project/regex/ >>> a = "बिक्रम मेरो नाम हो" >>> regex.findall(r'\X', a) ['बि', 'क्', 'र', 'म', ' ', 'मे', 'रो', ' ', 'ना', 'म', ' ', 'हो'] --- https://regex101.com/r/eD0eZ9/1 --- jlf: the results above are correct extended grapheme clusters, but tailored grapheme clusters will group 'क्' 'र' in one cluster क्र https://blog.burntsushi.net/ripgrep/ ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} search for "unicode" and read... https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions Character classes in regular expressions https://github.com/micromatch/posix-character-classes POSIX character classes for creating regular expressions. jlf: careful, not official. Looks similar to the table at https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions POSIX class Equivalent to Matches [:alnum:] [A-Za-z0-9] digits, uppercase and lowercase letters [:alpha:] [A-Za-z] upper- and lowercase letters [:ascii:] [\x00-\x7F] ASCII characters [:blank:] [ \t] space and TAB characters only [:cntrl:] [\x00-\x1F\x7F] Control characters [:digit:] [0-9] digits [:graph:] [^ [:cntrl:]] graphic characters (all characters which have graphic representation) [:lower:] [a-z] lowercase letters [:print:] [[:graph:] ] graphic characters and space [:punct:] [-!"#$%&'()*+,./:;<=>?@[]^_`{|}~] all punctuation characters (all graphic characters except letters and digits) [:space:] [ \t\n\r\f\v] all blank (whitespace) characters, including spaces, tabs, new lines, carriage returns, form feeds, and vertical tabs [:upper:] [A-Z] uppercase letters [:word:] [A-Za-z0-9_] word characters [:xdigit:] [0-9A-Fa-f] hexadecimal digits https://unicode-org.github.io/icu/userguide/icu/posix.html C/POSIX Migration Character classes, point 7: For more about the problems with POSIX character classes in a Unicode context see Annex C: Compatibility Properties in Unicode Technical Standard #18: Unicode Regular Expressions http://www.unicode.org/reports/tr18/#Compatibility_Properties and see the mailing list archives for the unicode list (on unicode.org). See also the ICU design document about C/POSIX character classes https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/main/design/posix_classes.html https://stackoverflow.com/questions/50570322/regex-pattern-matching-in-right-to-left-languages Regex pattern matching in right-to-left languages --- jlf: only one answer. Why control characters? What I understand is that the bytes are in the spelling order of the characters. The "/" ooRexx returns the same sequence of bytes under macOS. --- /Store/عرمنتجات/عرع 2F53746F72652F D8B9D8B1D985D986D8AAD8ACD8A7D8AA 2F D8B9D8B1D8B9 |--------------| |--------------------------------| |--| |------------| "/Store/" عرمنتجات / i عرع /Store/عرع/عرمنتجات 2F53746F72652F D8B9D8B1D8B9 2F D8B9D8B1D985D986D8AAD8ACD8A7D8AA |--------------| |------------| |--| |--------------------------------| "/Store/" عرع / i عرمنتجات /Store/عرمنتجات/whatever 2F53746F72652F D8B9D8B1D985D986D8AAD8ACD8A7D8AA 2F 7768617465766572 |------------| |------------------------------| |--| |--------------| "/Store/" عرمنتجات / whatever https://stackoverflow.com/questions/20641297/unicode-characters-in-regex Unicode characters in Regex

Test cases, test-cases, tests files


https://github.com/lemire/unicode_lipsum

font bold, italic, strikethrough, underline, backwards, upside down


I remember I saw an open-sourced implementation, but forgot to note it. The URLs below are not providing a link to an open-sourced implementation, to remove sooner or later. https://convertcase.net/unicode-text-converter/ https://yaytext.com/ https://capitalizemytitle.com/ https://capitalizemytitle.com/fancy-text-generator/ http://slothsoft.net/UnicodeMapper/ https://www.fontgenerator.org/ https://peterwunder.de/projects/prettify/ https://texteditor.com/ https://gwern.net/utext https://news.ycombinator.com/item?id=38016735 Utext: Rich Unicode Documents (gwern.net) An esoteric document proposal: abuse Unicode to create the fanciest possible ‘plain text’ documents. https://fonts.google.com/noto https://www.unicode.org/mail-arch/unicode-ml/y2019-m01/0121.html Encoding italic (was: A last missing link)

youtube


https://www.youtube.com/playlist?list=PLMc927ywQmTNQrscw7yvaJbAbMJDIjeBh Videos from Unicode's Overview of Internationalization and Unicode Projects

xxx lang


https://rosettacode.org/wiki/Unicode_strings https://langdev.stackexchange.com/questions/1493/how-have-modern-language-designs-dealt-with-unicode-strings How have modern language designs dealt with Unicode strings? Asked 2023-06-13 Answer for - Swift - Rust - Python 3 - Treat it as a (mostly) library issue jlf: the Swift part is interesting, the rest is bof. In order to speed up repeated accesses to utf16, UTF-8 strings may put a breadcrumbs pointer after the null terminator: https://github.com/apple/swift/blob/1532fb188c55f29de7bf8aaee94896557b3a3db1/stdlib/public/core/StringStorage.swift#L157 The breadcrumbs are a list of the UTF-8 offsets of every 64th UTF-16 code unit: https://github.com/apple/swift/blob/483087a47dfb56e78fcc20ef2b43085ebfb48ea0/stdlib/public/core/StringBreadcrumbs.swift A string stores whether it has breadcrumbs in an unused bit in its capacity field: https://github.com/apple/swift/blob/1532fb188c55f29de7bf8aaee94896557b3a3db1/stdlib/public/core/StringStorage.swift#L45 http://xahlee.info/comp/unicode_essays_index.html Unicode for Programers jlf: this page contains several URL for programming languages. Short articles but there is maybe something to learn. [later] After review, no so many things to learn, the articles are very very short...

Ada lang


https://docs.adacore.com/live/wave/xmlada/html/xmlada_ug/unicode.html http://www.dmitry-kazakov.de/ada/strings_edit.htm UXStrings Ada Unicode Extended Strings https://www.reddit.com/r/ada/comments/t4hpip/ann_uxstrings_package_available_uxs_20220226/ https://github.com/Blady-Com/UXStrings --- 2023.10.14 https://groups.google.com/g/comp.lang.ada/c/rWqDxiOwa1g [ANN] Release of UXStrings 0.6.0 - Add string convenient subprograms [2]: Contains, Ends_With,Starts_With, [2] https://github.com/Blady-Com/UXStrings/blob/master/src/uxstrings3.ads#L346 jlf: see https://github.com/Blady-Com/UXStrings/blob/master/src/uxstrings3.adb After a quick look, I still don't know which kind of position is managed. There is a parameter Case_Sensitivity, but I never see it used with a position (that's the tricky part) https://github.com/AdaForge/Thematics/wiki/Unicode-and-String-manipulations Unicode and String manipulations in UTF-8, UTF-16, ... https://stackoverflow.com/questions/48829940/utf-8-on-windows-with-ada UTF-8 on Windows with Ada https://github.com/AdaCore/VSS/ High level string and text processing library https://blog.adacore.com/vss-cursors-iterators-and-markers VSS (Virtual String Subsystem): Cursors, Iterators and Markers jlf: bof...

Awk lang


Brian Kernighan adds Unicode support to Awk https://github.com/onetrueawk/awk/commit/9ebe940cf3c652b0e373634d2aa4a00b8395b636 https://github.com/onetrueawk/awk/tree/unicode-support https://news.ycombinator.com/item?id=32534173

C++ lang, cpp lang, Boost


https://en.cppreference.com/w/cpp/language/string_literal String literal (referenced by Adrian) Some examples: https://learn.microsoft.com/en-us/cpp/cpp/string-and-character-literals-cpp https://www.youtube.com/watch?v=iQWtiYNK3kQ A Crash Course in Unicode for C++ Developers - Steve Downey - [CppNow 2021] jlf: good video for pronunciation 57:16 Algorithms 1:12:27 The future for C++ (you can stop here, not very interesting) 02/06/2021 http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1238r1.html SG16 initial Unicode direction and guidance for C++20 and beyond. https://github.com/sg16-unicode/sg16 SG16 is an ISO/IEC JTC1/SC22/WG21 C++ study group tasked with improving Unicode and text processing support within the C++ standard. https://github.com/sg16-unicode/sg16-meetings Summaries of SG16 virtual meetings https://lists.isocpp.org/mailman/listinfo.cgi/sg16 SG16 mailing list https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1629r1.html P1629R1 Transcoding the 🌐 - Standard Text Encoding Published Proposal, 2020-03-02 --- jlf: referenced by Zach Laine in P2728R0. [P1629R1] from JeanHeyd Meneide is a much more ambitious proposal that aims to standardize a general-purpose text encoding conversion mechanism. This proposal is not at odds with P1629; the two proposals have largely orthogonal aims. This proposal only concerns itself with UTF interconversions, which is all that is required for Unicode support. P1629 is concerned with those conversions, plus a lot more. Accepting both proposals would not cause problems; in fact, the APIs proposed here could be used to implement parts of the P1629 design. 01/06/2021 Zach Laine https://www.youtube.com/watch?v=944GjKxwMBo https://tzlaine.github.io/text/doc/html/boost_text__proposed_/the_text_layer.html https://tzlaine.github.io/text/doc/html/index.html The Text Layer https://tzlaine.github.io/text/doc/html/ Chapter 1. Boost.Text (Proposed) - 2018 https://github.com/tzlaine/text last commit : master 26/09/2020 boost_serialization 24/10/2019 coroutines 25/08/2020 experimental 13/11/2019 gh-pages 04/09/2020 optimization 27/10/2019 rope_free_fn_reimplementation 26/07/2020 No longer working on this project ? --- Restart working on 22/03/2022 Zach's library was last discussed at the 2023-05-10 SG16 meeting; see https://github.com/sg16-unicode/sg16-meetings#may-10th-2023. --- https://www.youtube.com/watch?v=AoLl\_ZZqyOk Applying the Lessons of std::ranges to Unicode in the C++ Standard Library - Zach Laine CppNow 2023 https://isocpp.org/files/papers/P2728R0.html (see more recent version below) Unicode in the Library, Part 1: UTF Transcoding Document #: P2728R0 Date: 2022-11-20 Reply-to: Zach Laine <whatwasthataddress@gmail.com> --- New version: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2728r5.html Document #: P2728R5 Date: 2023-07-05 --- latest published version: https://wg21.link/p2728 https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2729r0.html Unicode in the Library, Part 2: Normalization Document #: P2729R0 Date: 2022-11-20 Reply-to: Zach Laine <whatwasthataddress@gmail.com> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2773r0.pdf paper D2773R0 by Corentin Jabot https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1949r7.html C++ Identifier Syntax using Unicode Standard Annex 31 Document #: P1949R7 Date: 2021-04-12 --- Adopt Unicode Annex 31 as part of C++ 23. - That C++ identifiers match the pattern (XID_Start + _ ) + XID_Continue*. - That portable source is required to be normalized as NFC. - That using unassigned code points be ill-formed. This proposal also recommends adoption of Unicode normalization form C (NFC) for identifiers to ensure that when compared, identifiers intended to be the same will compare as equal. Legacy encodings are generally naturally in NFC when converted to Unicode. Most tools will, by default, produce NFC text. Some scripts require the use of characters as joiners that are not allowed by base UAX #31, these will no longer be available as identifiers in C++. As a side-effect of adopting the identifier characters from UAX #31, using emoji in or as identifiers becomes ill-formed. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2528r0.html C++ Identifier Security using Unicode Standard Annex 39 Document #: P2538R0 Date: 2022-01-22 14/06/2021 https://hsivonen.fi/non-unicode-in-cpp/ Same contents in sg16 mailing list + feedbacks https://lists.isocpp.org/sg16/2019/04/0309.php 03/07/2021 https://news.ycombinator.com/item?id=27695412 Any Encoding, Ever – ztd.text and Unicode for C++ 14/07/2021 https://hsivonen.fi/non-unicode-in-cpp/ It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++ The Microsoft Code Page 932 Issue https://stackoverflow.com/questions/58878651/what-is-the-printf-formatting-character-for-char8-t/58895428#58895428. What is the printf() formatting character for char8_t *? jlf: todo read it? not sure yet if it's useful to read. Referenced from https://corp.unicode.org/pipermail/unicode/2020-April/008579.html Basic Unicode character/string support absent even in modern C++ https://github.com/nemtrif/utfcpp/ referenced from https://corp.unicode.org/pipermail/unicode/2020-April/008582.html Basic Unicode character/string support absent even in modern C++ https://www.boost.org/doc/libs/1_80_0/libs/locale/doc/html/index.html Boost.Locale Boost.Locale uses the-state-of-the-art Unicode and Localization library: ICU - International Components for Unicode. https://github.com/uni-algo/uni-algo Unicode Algorithms Implementation for C/C++ https://www.reddit.com/r/cpp/comments/xspvn4/unialgo_v050_modern_unicode_library/ uni-algo v0.5.0: Modern Unicode Library https://www.reddit.com/r/cpp/comments/vtgckq/new_unicode_library/ Older post with more infos https://github.com/uni-algo/uni-algo-single-include Single include version for Unicode Algorithms Implementation This repository contains single include version of uni-algo library. https://www.reddit.com/r/cpp/comments/14t2lzm/unialgo_v100_modern_unicode_library/ uni-algo v1.0.0: Modern Unicode Library --- jlf: see the critics of Zach Laine's library... mg152 has good arguments. --- jlf: this library is referenced in the comments https://github.com/hikogui/hikogui/tree/main/src/hikogui/unicode https://github.com/hikogui/hikogui/tree/main/tools/ucd https://github.com/hikogui/hikogui/tree/main/src/hikogui/unicode https://github.com/hikogui/hikogui Modern accelerated GUI jlf: the point is not the GUI, but the tools to parse Unicode UCD. See https://github.com/hikogui/hikogui/tree/main/tools --- Comment of the author in https://www.reddit.com/r/cpp/comments/vtgckq/new_unicode_library/ I recently discovered a way to compress the unicode-data-set, while still being able to do quick lookups, with a single associative indirection. Basically you chunk the data in groups of 32 entries. Then you de-duplicate these chunks and make a index table (about 64kbyte) that points to the chunks. This works because a code-point is only 21 bits, which you can split in 16 bit msb and 5 bit lsb. This means that the index table has less than 64k uint16_t entries. My data is including the index around 700 KByte. With the following data: general category: 5 bit grapheme cluster break: 4 line break class: 6 word break property: 5 sentence break property: 4 east asian width: 3 bidi class:5 bidi bracket type: 2 bidi mirroring glyph: 16 ccc: 8 script: 8 decomposition type: 5 decomposition index: 21 (decomposition table not included in the 700kbyte) composition index: 14 (composition table not included in the 700kbyte) Of the 128 bits per entry, 22 bits are currently unused. It is also possible to compress a single entry. For example ccc is always zero for non-composing code-points, so it could share those bits with properties that are only allowed for non-composing code-points. https://news.ycombinator.com/item?id=38424689 Bjarne Stroustrup Quotes (stroustrup.com) --- Interesting discussion about strings (not limited to C++): search for "string". https://www.sandordargo.com/blog/2023/11/29/cpp23-unicode-support C++23: Growing unicode support --- The standardization committee has accepted (at least) four papers which clearly show a growing Unicode support in C++23. - C++ Identifier Syntax using Unicode Standard Annex 31 - Remove non-encodable wide character literals and multicharacter wide character literals - Delimited escape sequences - Named universal character escapes U'\N{LATIN CAPITAL LETTER A WITH MACRON}' // Equivalent to U'\u0100' u8"\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}" // Equivalent to u8"\u0100\u0300" One of the concerns was the sheer size of the Unicode name database that contains the codes (e.g. U+0100) and the names (e.g. {LATIN CAPITAL LETTER A WITH MACRON}). It’s around 1.5 MiB which can significantly impact the size of compiler distributions. The authors proved that a non-naive implementation can be around 300 KiB or even less. jlf: next point sounds discutable, no? Another open question was how to accept the Unicode-assigned names. Is {latin capital letter a with macron} just as good as {LATIN CAPITAL LETTER A WITH MACRON}? Or what about {LATIN_CAPITAL_LETTER_A_WITH_MACRON}? While the Unicode consortium standardized an algorithm called UAX44-LM2 for that purpose and it’s quite permissive, language implementors barely follow it. C++ is going to require an exact match with the database therefore the answer to the previous question is no, {latin capital letter a with macron} is not the same as {LATIN CAPITAL LETTER A WITH MACRON}. On the other hand, if there will be a strong need, the requirements can be relaxed in a later version. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2071r2.html Named universal character escapes --- jlf: they don't want to support UAX44-LM2 jlf: todo, read the section "Design considerations"

cRexx lang


cRexx uses this library: https://github.com/sheredom/utf8.h --- Codepoint Case Various functions provided will do case insensitive compares, or transform utf8 strings from one case to another. Given the vastness of unicode, and the authors lack of understanding beyond latin codepoints on whether case means anything, the following categories are the only ones that will be checked in case insensitive code: ASCII Latin-1 Supplement Latin Extended-A Latin Extended-B Greek and Coptic Cyrillic

DotNet, CoreFx


28/07/2021 https://github.com/dotnet/corefxlab/issues/2368 Scenarios and Design Philosophy - UTF-8 string support https://gist.github.com/GrabYourPitchforks/901684d0aa1d2440eb378d847cfc8607 (jlf: replaced by the following URL) https://github.com/dotnet/corefx/issues/34094 (go directly to next URL) https://github.com/dotnet/runtime/issues/28204 Motivations and driving principles behind the Utf8Char proposal https://github.com/dotnet/runtime/issues/933 The NuGet package generally follows the proposal in dotnet/corefxlab#2350, which is where most of the discussion has taken place. It's a bit aggravating that the discussion is split across so many different forums, I know. :( ceztko I noticed dotnet/corefxlab#2350 just got closed. Did the discussion moved somewhere else about more UTF8 first citizen support efforts? @ceztko The corefxlab repo was archived, so open issues were closed to support that effort. That thread also got so large that it was difficult to follow. @krwq is working on restructuring the conversation so that we can continue the discussion in a better forum. jlf Not clear where the discussion is continued... This URL just show some tags, one of them is "Future". https://github.com/orgs/dotnet/projects/7#card-33368432 https://github.com/dotnet/corefxlab/issues/2350 Utf8String design discussion - last edited 14-Sep-19 Tons of comments, with this conclusion: The discussion in this issue is too long and github has troubles rendering it. I think we should close this issue and start a new one in dotnet/runtime. https://github.com/dotnet/runtime/tree/main .Net runtime jlf: could be useful https://github.com/dotnet/runtime/blob/main/src/libraries/System.Console/src/System/Console.cs

Dafny lang


https://corp.unicode.org/pipermail/unicode/2021-May/009434.html Dafny natively supports expressing statements about sets and contract programming and a toy implementation turned out to be a fairly rote translation of the Unicode spec. Dafny is also transpilation focused, so the primary interface must be highly functional and encoding neutral.

Dart lang


Dart SDK uses ICU4X? jlf: to investigate... --- On Fuchsia, the Dart SDK uses createTimeZone() with metazone names obtained from the OS (usage site). ICU4X currently only supports this stuff with BCP-47 ids. We should have a way to go from metazone names to BCP-47 ids. I suspect this is already part of the plan but I'm not sure if there's a specific issue filed (@nordzilla?) --- In the link you posted, it shows "America/New_York", which is an IANA time zone name, not a metazone name. Did you mean to ask about IANA-to-BCP47 mapping? That would be #2909 https://github.com/dart-lang/sdk/blob/main/sdk/lib/core/string.dart https://github.com/dart-lang/sdk/blob/e995cb5f7cd67d39c1ee4bdbe95c8241db36725f/pkg/analyzer/lib/source/source_range.dart https://github.com/dart-lang/ https://github.com/dart-lang/language https://github.com/dart-lang/sdk https://dart.dev/guides/language/language-tour#strings A Dart string (String object) holds a sequence of UTF-16 code units. https://dart.dev/guides/language/language-tour#runes-and-grapheme-clusters In Dart, runes expose the Unicode code points of a string. You can use the characters package to view or manipulate user-perceived characters, also known as Unicode (extended) grapheme clusters. https://dart.dev/guides/libraries/library-tour#strings-and-regular-expressions https://pub.dev/packages/characters Characters are strings viewed as sequences of user-perceived characters, also known as Unicode (extended) grapheme clusters. The Characters class allows access to the individual characters of a string, and a way to navigate back and forth between them using a CharacterRange. https://medium.com/dartlang/dart-string-manipulation-done-right-5abd0668ba3e Like many other programming languages designed before emojis started to dominate our daily communications and the rise of multilingual support in commercial apps, Dart represents a string as a sequence of UTF-16 code units. --- jlf: they say that the Dart users are not aware of the Characters package. They try to improve the situation in the Flutter framework, but they are not very happy of the situation: Those mitigations can help, but they are limited to string manipulations performed in the context of a Flutter project. We need to carefully measure their effectiveness after they become available. A more complete solution at the Dart language level will likely require migration of at least some existing code, although a few options (for example, static extension types) might make breaking changes manageable. More technical investigation is needed to fully understand the trade-offs. https://github.com/robertbastian/icu4x/tree/dart/ffi/capi/dart/package jlf: A fork with DART FFI

Elixir lang


https://elixir-lang.org/ "Elixir" |> String.graphemes() |> Enum.frequencies() %{"E" => 1, "i" => 2, "l" => 1, "r" => 1, "x" => 1} --- "Elixir"~text~reduce(by: "characters", initial: .stem~new~~put(0)){accu[item] += 1}= a Stem (5 items) 'E' : 1 'i' : 2 'l' : 1 'r' : 1 'x' : 1 https://hexdocs.pm/elixir/String.html Strings in Elixir are UTF-8 encoded binaries. Works at grapheme level. The functions in this module rely on the Unicode Standard, but do not contain any of the locale specific behaviour. To act according to the Unicode Standard, many functions in this module run in linear time, as they need to traverse the whole string considering the proper Unicode code points. For example, String.length/1 will take longer as the input grows. On the other hand, Kernel.byte_size/1 always runs in constant time (i.e. regardless of the input size). --- Interesting: they manage correctly the upper/lower without using a locale. upcase(string, mode \\ :default) Converts all characters in the given string to uppercase according to mode. mode may be :default, :ascii, :greek or :turkic. The :default mode considers all non-conditional transformations outlined in the Unicode standard. :ascii uppercases only the letters a to z. :greek includes the context sensitive mappings found in Greek. :turkic properly handles the letter i with the dotless variant. https://hexdocs.pm/elixir/unicode-syntax.html Strings are UTF-8 encoded. Charlists are lists of Unicode code points. In such cases, the contents are kept as written by developers, without any transformation. Elixir allows Unicode characters in its variables, atoms, and calls. From now on, we will refer to those terms as identifiers. The characters allowed in identifiers are the ones specified by Unicode. Elixir normalizes all characters to be the in the NFC form. Mixed-script identifiers are not supported for security reasons. аdmin "аdmin"~text~unicodecharacters== an Array (shape [5], 5 items) 1 : ( "а" U+0430 Ll 1 "CYRILLIC SMALL LETTER A" ) 2 : ( "d" U+0064 Ll 1 "LATIN SMALL LETTER D" ) 3 : ( "m" U+006D Ll 1 "LATIN SMALL LETTER M" ) 4 : ( "i" U+0069 Ll 1 "LATIN SMALL LETTER I" ) 5 : ( "n" U+006E Ll 1 "LATIN SMALL LETTER N" ) The character must either be all in Cyrillic or all in Latin. The only mixed-scripts that Elixir allows, according to the Highly Restrictive Unicode recommendations, are: Latin and Han with Bopomofo Latin and Japanese Latin and Korean Elixir will also warn on confusable identifiers in the same file. For example, Elixir will emit a warning if you use both variables а (Cyrillic) and а (Latin) in your code. Elixir implements the requirements outlined in the Unicode Annex #31 (https://www.unicode.org/reports/tr31/) Elixir does not allow the use of ZWJ or ZWNJ in identifiers and therefore does not implement R1a. Bidirectional control characters are also not supported. R1b is guaranteed for backwards compatibility purposes. Elixir supports only code points \t (0009), \n (000A), \r (000D) and \s (0020) as whitespace and therefore does not follow requirement R3. R3 requires a wider variety of whitespace and syntax characters to be supported.

Factor lang


http://docs.factorcode.org/content/article-unicode.html http://useless-factor.blogspot.fr/2007/02/doing-unicode-right-part-1.html JLF : bof... http://useless-factor.blogspot.fr/2007/02/doing-unicode-right-part-2.html http://useless-factor.blogspot.fr/2007/08/unicode-implementers-guide-part-3.html http://useless-factor.blogspot.fr/2007/08/unicode-implementers-guide-part-4.html grapheme breaking http://useless-factor.blogspot.fr/2007/08/r-597-rs-unicode-library-is-broken.html http://useless-factor.blogspot.fr/2007/02/more-string-parsing.html UTF-8/16 encoder/decoder I used a design pattern known as a sentinel, which helps me cross-cut pointcutting concerns by instantiating objects which encapsulate the state of the parser. I never mutate these, and the program is purely functional except for the use of make (which could trivially be changed into a less efficient map [ ] subset, sacrificing efficiency and some terseness but making it functional). TUPLE: new ; TUPLE: double val ; TUPLE: quad2 val ; TUPLE: quad3 val ; : bad-char CHAR: ? ; GENERIC: (utf16le) ( char state -- state ) M: new (utf16le) drop <double> ; M: double (utf16le) over -3 shift BIN: 11011 = [ over BIN: 100 bitand 0 = [ double-val swap BIN: 11 bitand 8 shift bitor <quad2> ] [ 2drop bad-char , <new> ] if ] [ double-val swap 8 shift bitor , <new> ] if ; M: quad2 (utf16le) quad2-val 10 shift bitor <quad3> ; M: quad3 (utf16le) over -2 shift BIN: 110111 = [ swap BIN: 11 bitand 8 shift swap quad3-val bitor HEX: 10000 + , <new> ] [ 2drop bad-char , <new> ] if ; : utf16le ( state string -- state string ) [ [ swap (utf16le) ] each ] { } make ; https://re.factorcode.org/2023/05/unicode.html jlf: very basic, but may be useful to write little tests https://re.factorcode.org/2023/05/case-conversion.html snake_case camelCase kebab-case PascalCase Ada_Case Train-Case COBOL-CASE MACRO_CASE UPPER CASE lower case Title Case Sentence case dot.case

Fortran lang


https://fortran-lang.discourse.group/t/using-unicode-characters-in-fortran/2764 jlf: hum... it's a blind support of UTF-8, as we do with current Rexx. There is no support of Unicode. In the unicode_len.f90 example: chars = 'Fortran is 💪, 😎, 🔥!' if (len(chars) /= 28) error stop 28 is the lentgh in bytes... In the unicode_index.f90 example: chars = '📐: 4.0·tan⁻¹(1.0) = π' i = index(chars, 'n') if (i /= 14) error stop i = index(chars, '¹') if (i /= 18) error stop 14 and 18 are byte positions...

GO lang


https://go.dev/ https://go.dev/ref/spec#Conversions_to_and_from_a_string_type jlf: worth reading, they cover all the possible conversions between bytes, rune and string. https://go.dev/play/ The Go Playground https://github.com/traefik/yaegi Another Elegant Go Interpreter --- rlwrap yaegi https://yourbasic.org/golang/ Tutorial, a selection related to strings --- []byte("Noël") // [78 111 195 171 108] // 1. Using the string() constructor string([]byte{78, 111, 195, 171, 108}) // Noël // 2. Go provides a package called bytes with a function called NewBuffer(), which // creates a new Buffer and then uses the String() method to get the string output. bytes.NewBuffer([]byte{78, 111, 195, 171, 108}).String() // Noël // 3. Using fmt.Sprintf() function fmt.Sprintf("%s", []byte{78, 111, 195, 171, 108}) // Noël // String building fmt.Sprintf("Size: %d MB.", 85) // Size: 85 MB. // High-performance string concatenation var b strings.Builder b.Grow(32) // preallocate memory when the maximum size of the string is known for i, p := range []int{2, 3, 5, 7, 11, 13} { fmt.Fprintf(&b, "%d:%d, ", i+1, p) } s := b.String() // no copying s = s[:b.Len()-2] // no copying (removes trailing ", ") fmt.Println(s) // 1:2, 2:3, 3:5, 4:7, 5:11, 6:13 // Convert string to runes // For an invalid UTF-8 sequence, the rune value will be 0xFFFD for each invalid byte. []rune("Noël") // [78 111 235 108] // Convert runes to string // When you convert a slice of runes to a string, you get a new string that // is the concatenation of the runes converted to UTF-8 encoded strings. // Values outside the range of valid Unicode code points are converted to // \uFFFD, the Unicode replacement character �. string([]rune{'\u004E', '\u006F', '\u00EB', '\u006C'}) // Noël // String iteration by runes // the range loop iterates over Unicode code points. // The index is the first byte of a UTF-8-encoded code point; // the second value, of type rune, is the value of the code point. // For an invalid UTF-8 sequence, the second value will be 0xFFFD, // and the iteration will advance a single byte. for i, ch := range "日本語" { fmt.Printf("%#U starts at byte position %d\n", ch, i) } // Output: U+004E 'N' starts at byte position 0 U+006F 'o' starts at byte position 1 U+00EB 'ë' starts at byte position 2 U+006C 'l' starts at byte position 4 // String iteration by bytes const s = "Noël" for i := 0; i < len(s); i++ { fmt.Printf("%x ", s[i]) } // Output: 4e 6f c3 ab 6c https://pkg.go.dev/strings Package strings implements simple functions to manipulate UTF-8 encoded strings. jlf: BIFs https://go.dev/blog/slices Arrays, slices (and strings): The mechanics of 'append' Rob Pike 26 September 2013 --- jlf: prerequisite to understand how strings are managed Next blog also helps (no relation with Unicode, but...) https://teivah.medium.com/slice-length-vs-capacity-in-go-af71a754b7d8 https://go.dev/blog/strings Strings, bytes, runes and characters in Go Rob Pike 23 October 2013 --- In Go, a string is in effect a read-only slice of bytes. A string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98" Indexing a string accesses individual bytes, not characters. for i := 0; i < len(sample); i++ { fmt.Printf("%x ", sample[i]) # bd b2 3d bc 20 e2 8c 98 } A shorter way to generate presentable output for a messy string is to use the %x (hexadecimal) format verb of fmt.Printf. It just dumps out the sequential bytes of the string as hexadecimal digits, two per byte. fmt.Printf("%x\n", sample) # bdb23dbc20e28c98 fmt.Printf("% x\n", sample) # bd b2 3d bc 20 e2 8c 98 The %q (quoted) verb will escape any non-printable byte sequences in a string so the output is unambiguous. fmt.Printf("%q\n", sample) # "\xbd\xb2=\xbc ⌘" fmt.Printf("%+q\n", sample) # "\xbd\xb2=\xbc \u2318" The Go language defines the word rune as an alias for the type int32, so programs can be clear when an integer value represents a code point. A for range loop decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value. const nihongo = "日本語" for index, runeValue := range nihongo { fmt.Printf("%#U starts at byte position %d\n", runeValue, index) } The output shows how each code point occupies multiple bytes: U+65E5 '日' starts at byte position 0 U+672C '本' starts at byte position 3 U+8A9E '語' starts at byte position 6 https://go.dev/pkg/unicode/utf8/ Unicode/utf8 package https://go.dev/blog/normalization Text normalization in Go Marcel van Lohuizen 26 November 2013 --- To write your text as NFC, use the https://pkg.go.dev/golang.org/x/text/unicode/norm package to wrap your io.Writer of choice: wc := norm.NFC.Writer(w) defer wc.Close() // write as before... If you have a small string and want to do a quick conversion, you can use this simpler form: norm.NFC.Bytes(b) https://cs.opensource.google/go/x/text This repository holds supplementary Go libraries for text processing, many involving Unicode. https://pkg.go.dev/golang.org/x/text/collate The collate package, which can sort strings in a language-specific way, works correctly even with unnormalized strings https://pkg.go.dev/golang.org/x/text/encoding Package encoding defines an interface for character encodings, such as Shift JIS and Windows 1252, that can convert to and from UTF-8. Encoding implementations are provided in other packages, such as golang.org/x/text/encoding/charmap golang.org/x/text/encoding/japanese. A Decoder converts bytes to UTF-8. It implements transform.Transformer. Transforming source bytes that are not of that encoding will not result in an error per se. Each byte that cannot be transcoded will be represented in the output by the UTF-8 encoding of '\uFFFD', the replacement rune. --- jlf: strange... I was expecting a more conservative conversion, since the core language supports any bytes in a string. An Encoder converts bytes from UTF-8. It implements transform.Transformer. Each rune that cannot be transcoded will result in an error. In this case, the transform will consume all source byte up to, not including the offending rune. Transforming source bytes that are not valid UTF-8 will be replaced by `\uFFFD`. --- jlf: the previous description seems contradictory. "up to, not including the offending rune" "not valid UTF-8 will be replaced by `\uFFFD`" https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/charmap Package charmap provides simple character encodings such as IBM Code Page 437 and Windows 1252. CodePage037 is the IBM Code Page 037 encoding. CodePage1047 is the IBM Code Page 1047 encoding. CodePage1140 is the IBM Code Page 1140 encoding. CodePage437 is the IBM Code Page 437 encoding. CodePage850 is the IBM Code Page 850 encoding. CodePage852 is the IBM Code Page 852 encoding. CodePage855 is the IBM Code Page 855 encoding. CodePage858 is the Windows Code Page 858 encoding. CodePage860 is the IBM Code Page 860 encoding. CodePage862 is the IBM Code Page 862 encoding. CodePage863 is the IBM Code Page 863 encoding. CodePage865 is the IBM Code Page 865 encoding. CodePage866 is the IBM Code Page 866 encoding. ISO8859_1 is the ISO 8859-1 encoding. ISO8859_10 is the ISO 8859-10 encoding. ISO8859_13 is the ISO 8859-13 encoding. ISO8859_14 is the ISO 8859-14 encoding. ISO8859_15 is the ISO 8859-15 encoding. ISO8859_16 is the ISO 8859-16 encoding. ISO8859_2 is the ISO 8859-2 encoding. ISO8859_3 is the ISO 8859-3 encoding. ISO8859_4 is the ISO 8859-4 encoding. ISO8859_5 is the ISO 8859-5 encoding. ISO8859_6 is the ISO 8859-6 encoding. ISO8859_7 is the ISO 8859-7 encoding. ISO8859_8 is the ISO 8859-8 encoding. ISO8859_9 is the ISO 8859-9 encoding. KOI8R is the KOI8-R encoding. KOI8U is the KOI8-U encoding. Macintosh is the Macintosh encoding. MacintoshCyrillic is the Macintosh Cyrillic encoding. Windows1250 is the Windows 1250 encoding. Windows1251 is the Windows 1251 encoding. Windows1252 is the Windows 1252 encoding. Windows1253 is the Windows 1253 encoding. Windows1254 is the Windows 1254 encoding. Windows1255 is the Windows 1255 encoding. Windows1256 is the Windows 1256 encoding. Windows1257 is the Windows 1257 encoding. Windows1258 is the Windows 1258 encoding. Windows874 is the Windows 874 encoding. https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/japanese Package japanese provides Japanese encodings such as EUC-JP and Shift JIS. EUCJP is the EUC-JP encoding. ISO2022JP is the ISO-2022-JP encoding. ShiftJIS is the Shift JIS encoding, also known as Code Page 932 and Windows-31J. https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/korean Package korean provides Korean encodings such as EUC-KR. EUCKR is the EUC-KR encoding, also known as Code Page 949. https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/simplifiedchinese Package simplifiedchinese provides Simplified Chinese encodings such as GBK. HZGB2312 is the HZ-GB2312 encoding. https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/traditionalchinese Package traditionalchinese provides Traditional Chinese encodings such as Big5. Big5 is the Big5 encoding, also known as Code Page 950. https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/unicode Package unicode provides Unicode encodings such as UTF-16. UTF8 is the UTF-8 encoding. It neither removes nor adds byte order marks. UTF8BOM is an UTF-8 encoding where the decoder strips a leading byte order mark while the encoder adds one. UTF16 returns a UTF-16 Encoding for the given default endianness and byte order mark (BOM) policy. func UTF16(e Endianness, b BOMPolicy) encoding.Encoding https://pkg.go.dev/golang.org/x/text@v0.10.0/encoding/unicode/utf32 Package utf32 provides the UTF-32 Unicode encoding. UTF32 returns a UTF-32 Encoding for the given default endianness and byte order mark (BOM) policy. func UTF32(e Endianness, b BOMPolicy) encoding.Encoding https://go.dev/blog/matchlang Language and Locale Matching in Go The Go package https://golang.org/x/text/language implements the BCP 47 standard for language tags and adds support for deciding which language to use based on data published in the Unicode Common Locale Data Repository (CLDR). https://github.com/unicode-org/icu4x/issues/2882 https://cs.opensource.google/go/x/text The golang x-text library has re-implemented most of ICU from scratch, and some of their algorithms and data structures might be interesting for the icu4x project (afaik x-text was not just a port of the ICU codebase to another language, but an actual re-implementation). You might want to have a look at their code, or talk to @mpvl who wrote most of it. https://github.com/golang/go/blob/master/src/cmd/compile/internal/syntax/scanner.go Implementation of Golang’s lexer Identifier is made up of letters and digits (where first is always a letter) and letter is an arbitrary Unicode code point. package main import "fmt" func 隨機名稱() { fmt.Println("It works!") } func main() { 隨機名稱() źdźbło := 1 fmt.Println(źdźbło) } https://henvic.dev/posts/go-utf8/ UTF-8 strings with Go: len(s) isn't enough jlf: in his initial post, the guy was not aware of graphemes and it's after a feedback on Reddit that he addded stuff about graphemes. https://github.com/rivo/uniseg Unicode Text Segmentation, Word Wrapping, and String Width Calculation in Go https://pkg.go.dev/github.com/rivo/uniseg#hdr-Monospace_Width Monospace Width Monospace width, as referred to in this package, is the width of a string in a monospace font. This package differs from wcswidth() in a number of ways, presumably to generate more visually pleasing results. Note that whether these widths appear correct depends on your application's render engine, to which extent it conforms to the Unicode Standard, and its choice of font. --- Rules implemented by uniseg: we assume that every code point has a width of 1, with the following exceptions: - Code points with grapheme cluster break properties Control, CR, LF, Extend, and ZWJ have a width of 0. - U+2E3A, Two-Em Dash, has a width of 3. - U+2E3B, Three-Em Dash, has a width of 4. - Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide" (W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both have a width of 1.) - Code points with grapheme cluster break property Regional Indicator have a width of 2. - Code points with grapheme cluster break property Extended Pictographic have a width of 2, unless their Emoji Presentation flag is "No", in which case the width is 1. - For Hangul grapheme clusters composed of conjoining Jamo and for Regional Indicators (flags), all code points except the first one have a width of 0. - For grapheme clusters starting with an Extended Pictographic, any additional code point will force a total width of 2, except if the Variation Selector-15 (U+FE0E) is included, in which case the total width is always 1. - Grapheme clusters ending with Variation Selector-16 (U+FE0F) have a width of 2. --- jlf: mouais, in conclusion there is no guarantee that the result will be good. --- uniseg.StringWidth("🇩🇪🏳️‍🌈!") -- uniseg returns 5 utf8proc: "🇩🇪🏳️‍🌈!"~text~unicodeCharacters~each("charWidth")= -- [ 1, 1, 1, 0, 0, 2, 1] "🇩🇪🏳️‍🌈!"~text~unicodeCharacters== an Array (shape [7], 7 items) 1 : ( "🇩" U+1F1E9 So 1 "REGIONAL INDICATOR SYMBOL LETTER D" ) 2 : ( "🇪" U+1F1EA So 1 "REGIONAL INDICATOR SYMBOL LETTER E" ) 3 : ( "🏳" U+1F3F3 So 1 "WAVING WHITE FLAG" ) 4 : ( "️" U+FE0F Mn 0 "VARIATION SELECTOR-16", "VS16" ) 5 : ( "‍" U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" ) 6 : ( "🌈" U+1F308 So 2 "RAINBOW" ) 7 : ( "!" U+0021 Po 1 "EXCLAMATION MARK" )

jRuby lang


https://github.com/jruby/jruby/blob/master/core/src/main/java/org/jruby/RubyString.java jlf: big file, more than 7000 lines. https://github.com/jruby/jruby/blob/master/core/src/main/java/org/jruby/RubyEncoding.java https://github.com/jruby/jruby/blob/master/lib/ruby/stdlib/unicode_normalize/normalize.rb https://github.com/jruby/jruby/blob/master/spec/ruby/core/string/unicode_normalize_spec.rb

Java lang


https://docs.oracle.com/en/java/javase/ https://docs.oracle.com/en/java/javase/20/docs/api/java.base/java/text/BreakIterator.html java.text.BreakIterator The default implementation of the character boundary analysis conforms to the Unicode Consortium's Extended Grapheme Cluster breaks. For more detail, refer to Grapheme Cluster Boundaries section in the Unicode Standard Annex #29. https://docs.oracle.com/en/java/javase/20/intl/internationalization-overview.html Internationalization Overview https://codeahoy.com/2016/05/08/the-char-type-in-java-is-broken/ Java has supported Unicode since its first release and strings are internally represented using UTF-16 encoding. UTF-16 is a variable length encoding scheme. For characters that can fit into the 16 bits space, it uses 2 bytes to represent them. For all other characters, it uses 4 bytes. For a character that requires more than 16 bits, like these emojis 👦👩, the char methods like someString.charAt(0) or someString.substring(0,1) will break and give you only half the code point. https://docs.oracle.com/javase/tutorial/i18n/text/unicode.html When the specification for the Java language was created, the Unicode standard was accepted and the char primitive was defined as a 16-bit data type, with characters in the hexadecimal range from 0x0000 to 0xFFFF. Because 16-bit encoding supports 216 (65,536) characters, which is insufficient to define all characters in use throughout the world, the Unicode standard was extended to 0x10FFFF, which supports over one million characters. The definition of a character in the Java programming language could not be changed from 16 bits to 32 bits without causing millions of Java applications to no longer run properly. To correct the definition, a scheme was developed to handle characters that could not be encoded in 16 bits. The characters with values that are outside of the 16-bit range, and within the range from 0x10000 to 0x10FFFF, are called supplementary characters and are defined as a pair of char values. https://openjdk.org/jeps/400 JEP 400: UTF-8 by Default A quick way to see the default charset of the current JDK is with the following command: java -XshowSettings:properties -version 2>&1 | grep file.encoding As envisaged by the specification of Charset.defaultCharset(), the JDK will allow the default charset to be configured to something other than UTF-8. java -Dfile.encoding=COMPAT the default charset will be the charset chosen by the algorithm in JDK 17 and earlier, based on the user's operating system, locale, and other factors. The value of file.encoding will be set to the name of that charset. java -Dfile.encoding=UTF-8 the default charset will be UTF-8. This no-op value is defined in order to preserve the behavior of existing command lines. The treatment of values other than "COMPAT" and "UTF-8" are not specified. They are not supported, but if such a value worked in JDK 17 then it will likely continue to work in JDK 18. https://www.baeldung.com/java-remove-accents-from-text Remove Accents and Diacritics From a String in Java - We will perform the compatibility decomposition represented as the Java enum NFKD. because it decomposes more ligatures than the canonical method (for example, ligature “fi”). - We will remove all characters matching the Unicode Mark category using the \p{M} regex expression. Test: assertEquals("\\u0066 \\u0069", StringNormalizer.unicodeValueOfNormalizedString("fi")); assertEquals("\\u0061 \\u0304", StringNormalizer.unicodeValueOfNormalizedString("ā")); assertEquals("\\u0069 \\u0308", StringNormalizer.unicodeValueOfNormalizedString("ï")); assertEquals("\\u006e \\u0301", StringNormalizer.unicodeValueOfNormalizedString("ń")); Compare Strings Including Accents Using Collator. Java provides four strength values for a Collator: PRIMARY: comparison omitting case and accents SECONDARY: comparison omitting case but including accents and diacritics TERTIARY: comparison including case and accents IDENTICAL: all differences are significant https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/io/DataInput.html#modified-utf-8 Implementations of the DataInput and DataOutput interfaces represent Unicode strings in a format that is a slight modification of UTF-8. - Characters in the range '\u0001' to '\u007F' are represented by a single byte. - The null character '\u0000' and characters in the range '\u0080' to '\u07FF' are represented by a pair of bytes. - Characters in the range '\u0800' to '\uFFFF' are represented by three bytes. The differences between this format and the standard UTF-8 format are the following: - The null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls. - Only the 1-byte, 2-byte, and 3-byte formats are used. - Supplementary characters are represented in the form of surrogate pairs. Decomposition of ligature In Java, you'll need to use the Normalizer class and the NFKC form: --- String ff ="\uFB00"; String normalized = Normalizer.normalize(ff, Form.NFKC); System.out.println(ff + " = " + normalized); --- This will print ff = ff https://stackoverflow.com/questions/9699071/what-is-the-javas-internal-represention-for-string-modified-utf-8-utf-16 You can confirm the following by looking at the source code of the relevant version of the java.lang.String class in OpenJDK. Prior to Java 9, the standard in-memory representation for a Java String is UTF-16 code-units held in a char[]. private final char value[]; With Java 6 update 21 and later, there was a non-standard option (-XX:UseCompressedStrings) to enable compressed strings. This feature was removed in Java 7. For Java 9 and later, the implementation of String has been changed to use a compact representation by default. private final byte[] value; private final byte coder; // LATIN1 (0) or UTF16 (1) https://docs.oracle.com/en/java/javase/20/docs/specs/man/java.html#advanced-runtime-options-for-java -XX:-CompactStrings Disables the Compact Strings feature. By default, this option is enabled. When this option is enabled, Java Strings containing only single-byte characters are internally represented and stored as single-byte-per-character Strings using ISO-8859-1 / Latin-1 encoding. This reduces, by 50%, the amount of space required for Strings containing only single-byte characters. For Java Strings containing at least one multibyte character: these are represented and stored as 2 bytes per character using UTF-16 encoding. Disabling the Compact Strings feature forces the use of UTF-16 encoding as the internal representation for all Java Strings. As of 2023, see JEP 254: Compact Strings https://openjdk.org/jeps/254 https://howtodoinjava.com/java9/compact-strings/ https://stackoverflow.com/questions/44178432/difference-between-compact-strings-and-compressed-strings-in-java-9 In Java 9 on the other hand, compact strings are fully integrated into the JDK source. String is always backed by byte[], where characters use one byte if they are Latin-1 and otherwise two. Most operations do a check to see which is the case, e.g. charAt: public char charAt(int index) { if (isLatin1()) { return StringLatin1.charAt(value, index); } else { return StringUTF16.charAt(value, index); } } Compact strings are enabled by default and can be partially disabled - "partially" because they are still backed by a byte[] and operations returning chars must still put them together from two separate bytes public int length() { return value.length >> coder(); } If our String is Latin1 only, coder is going to be zero, so length of value (the byte array) is the size of chars. For non-Latin1 divide by two. https://www.baeldung.com/java-string-encode-utf-8 Encoding With Core Java // First, we get the String bytes, and then we create a new one using the retrieved bytes and the desired charset: String rawString = "Entwickeln Sie mit Vergnügen"; byte[] bytes = rawString.getBytes(StandardCharsets.UTF_8); String utf8EncodedString = new String(bytes, StandardCharsets.UTF_8); assertEquals(rawString, utf8EncodedString); Encoding With Java 7 StandardCharsets // First, we'll encode the String into bytes, and second, we'll decode it into a UTF-8 String: String rawString = "Entwickeln Sie mit Vergnügen"; ByteBuffer buffer = StandardCharsets.UTF_8.encode(rawString); String utf8EncodedString = StandardCharsets.UTF_8.decode(buffer).toString(); assertEquals(rawString, utf8EncodedString); https://www.baeldung.com/java-string-to-byte-array Convert String to Byte Array and Reverse in Java Converting a String to Byte Array A String is stored as an array of Unicode characters in Java. To convert it to a byte array, we translate the sequence of characters into a sequence of bytes. For this translation, we use an instance of Charset. This class specifies a mapping between a sequence of chars and a sequence of bytes. We refer to the above process as encoding. Using String.getBytes() The String class provides three overloaded getBytes methods to encode a String into a byte array: - getBytes() – encodes using platform's default charset --- String inputString = "Hello World!"; byte[] byteArrray = inputString.getBytes(); --- The above method is platform-dependent, as it uses the platform's default charset. We can get this charset by calling Charset.defaultCharset(). - getBytes (String charsetName) – encodes using the named charset - getBytes (Charset charset) – encodes using the provided charset Using Charset.encode() The Charset class provides encode(), a convenient method that encodes Unicode characters into bytes. This method always replaces invalid input and unmappable-characters using the charset's default replacement byte array. --- String inputString = "Hello ਸੰਸਾਰ!"; Charset charset = StandardCharsets.US_ASCII; byte[] byteArrray = charset.encode(inputString).array(); --- CharsetEncoder CharsetEncoder transforms Unicode characters into a sequence of bytes for a given charset. Moreover, it provides fine-grained control over the encoding process. --- String inputString = "Hello ਸੰਸਾਰ!"; CharsetEncoder encoder = StandardCharsets.US_ASCII.newEncoder(); encoder.onMalformedInput(CodingErrorAction.IGNORE) .onUnmappableCharacter(CodingErrorAction.REPLACE) .replaceWith(new byte[] { 0 }); byte[] byteArrray = encoder.encode(CharBuffer.wrap(inputString)).array(); --- Converting a Byte Array to String We refer to the process of converting a byte array to a String as decoding. Similar to encoding, this process requires a Charset. However, we can't just use any charset for decoding a byte array. In particular, we should use the charset that encoded the String into the byte array. https://retrocomputing.stackexchange.com/questions/26535/why-do-java-classfiles-and-jni-use-a-frankensteins-monster-encoding-crossin Why do Java classfiles (and JNI) use a "Frankenstein's monster" encoding crossing UTF-8 and UTF-16? jlf: interesting for the history. If I understand correctly, Java uses the CESU-8 encoding to store strings in classfiles and JNI payloads. https://en.wikipedia.org/wiki/CESU-8 CESU-8 = Compatibility Encoding Scheme for UTF-16: 8-Bit - A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8 - A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only. It should be used exclusively for internal processing and never for external data exchange. Supporting CESU-8 in HTML documents is prohibited by the W3C and WHATWG HTML standards, as it would present a cross-site scripting vulnerability. What is the level of support of surrogates? Java.lang.Character.isSurrogatePair() Java.lang.Character.toCodePoint(char high, char low) : int String.codePointAt() Character.codePointAt() http://hauchee.blogspot.com/2015/05/surrogate-characters-mechanism.html Neither String or StringBuilder working properly. To avoid the issue above, use java.text.BreakIterator to determine the correct position. jlf: the code below show how to pass from logical position to real position. public static void main(String[] args) { String text = "a\uD834\uDD60s\uD834\uDD60\uD834\uDD60©₂"; // text: a텠s텠텠©₂ int startIndex = 2; int endIndex = 5; BreakIterator charIterator = BreakIterator.getCharacterInstance(); System.out.println( subString(charIterator, text, startIndex, endIndex)); // output: s텠텠 } private static String subString(BreakIterator charIterator, String target, int start, int end) { int realStart = 0; int realEnd = 0; charIterator.setText(target); int boundary = charIterator.first(); int i = 0; while (boundary != BreakIterator.DONE) { if (i == start) { realStart = boundary; } if (i == end) { realEnd = boundary; break; } boundary = charIterator.next(); i++; } return target.substring(realStart, realEnd); } https://github.com/s-u/rJava/issues/51 R to Java interface Error on UTF-16 surrogate pairs Java uses UTF-16 internally and encodes Unicode characters above U+FFFF with surrogate pairs. When strings containing such characters are converted to UTF-8 by rJava they are encoded as a pair of 3 byte sequences rather than as the correct 4 byte sequence. This is not valid UTF-8 and will result in "invalid multibyte string" errors. https://www.unicode.org/faq/utf_bom.html#utf8-4 https://bugs.openjdk.org/browse/JDK-8291660 https://youtrack.jetbrains.com/issue/IDEA-197555 \b{g} not supported in regex In the docs for java.util.regex.Pattern (https://docs.oracle.com/javase/10/docs/api/java/util/regex/Pattern.html): \b{g} is listed under the “Boundary matchers” section: “\b{g} A Unicode extended grapheme cluster boundary” https://www.reddit.com/r/LanguageTechnology/comments/af0ice/seeking_lightweight_java_graphemetophoneme_g2p/ Seeking lightweight Java grapheme-to-phoneme (G2P) model Succeeded at getting jg2p working. It's doing pretty well in terms of pronunciation quality but the model is very large for an Android app and takes forever to load. https://github.com/steveash/jg2p/ jg2p Java implementation of a general grapheme to phoneme toolkit using a pipeline of CRFs, a log-loss re-ranker, and a joint "graphone" language model. https://horstmann.com/unblog/2023-10-03/index.html Stop Using char in Java. And Code Points jlf: moderately interesting... jlf: idem for the related HN comments https://news.ycombinator.com/item?id=37822967 Since Java 20, there is a way of iterating over the grapheme clusters of a string, using the BreakIterator class from Java 1.1. String s = "Ciao 🇮🇹!"; BreakIterator iter = BreakIterator.getCharacterInstance(); iter.setText(s); int start = boundary.first(); int end = boundary.next(); while (end != BreakIterator.DONE) { String gc = s.substring(start, end); start = end; end = boundary.next(); process(gc); } Here is a much simpler way, clearly not as efficient. I was stunned to find out that this worked since Java 9! s.split("\\b{g}"); // An array withments "C", "i", "a", "o", " ", "🇮🇹", "!" Or, to get a stream: Pattern.compile("\\X").matcher(s).results().map(MatchResult::group)

JavaScript lang


https://certitude.consulting/blog/en/invisible-backdoor/ THE INVISIBLE JAVASCRIPT BACKDOOR https://www.npmjs.com/package/tty-strings A one stop shop for working with text displayed in the terminal. The goal of this project is to alleviate the headache of working with Javascript's internal representation of unicode characters, particularly within the context of displaying text in the terminal for command line applications. --- jlf tag: character width https://github.com/foliojs/linebreak A JS implementation of the Unicode Line Breaking Algorithm (UAX #14) It is used by PDFKit (https://github.com/foliojs/pdfkit) for line wrapping text in PDF documents. https://github.com/codebox/homoglyph A big list of homoglyphs and some code to detect them

Julia lang


Remember: search in issues with "utf8proc in:title,body" https://bkamins.github.io/julialang/2020/08/13/strings.html The String, or There and Back Again https://docs.julialang.org/en/v1/manual/strings/ You can input any Unicode character in single quotes using \u followed by up to four hexadecimal digits or \U followed by up to eight hexadecimal digits (the longest valid value only requires six): julia> '\u0' '\0': ASCII/Unicode U+0000 (category Cc: Other, control) julia> '\u78' 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase) julia> '\u2200' '∀': Unicode U+2200 (category Sm: Symbol, math) julia> '\U10ffff' '\U10ffff': Unicode U+10FFFF (category Cn: Other, not assigned) julia> s = "\u2200 x \u2203 y" "∀ x ∃ y" https://docs.julialang.org/en/v1/base/strings/ jlf: search for "ß" in this page with Chrome, you will see it matches with "ss" It doesn't match the β here: isless("β", "α") "β"~text~characters= -- ( "β" U+03B2 Ll 1 "GREEK SMALL LETTER BETA" ) https://juliapackages.com/p/strs jlf: the string implemention of Scott P Jones Seems quiet since last year... This uses Swift-style \ escape sequences, such as \u{xxxx} for Unicode constants, instead of \uXXXX and \UXXXXXXXX, which have the advantage of not having to worry about some digit or letter A-F or a-f occurring after the last hex digit of the Unicode constant. It also means that $, a very common character for LaTeX strings or output of currencies, does not need to be in a string quoted as '$' It uses \(expr) for interpolation like Swift, instead of $name or $(expr), which also has the advantage of not having to worry about the next character in the string someday being allowed in a name. It allows for embedding Unicode characters using a variety of easy to remember names, instead of hex codes: \:emojiname: \<latexname> \N{unicodename} \&htmlname; Examples of this are: f"\<dagger> \&yen; \N{ACCOUNT OF} \:snake:", which returns the string: "† ¥ ℀ 🐍 " https://discourse.julialang.org/t/stupid-question-on-unicode/27674/7 Discussion about escape sequence https://docs.julialang.org/en/v1/stdlib/Unicode/ Unicode.julia_chartransform(c::Union{Char,Integer}) Unicode.isassigned(c) -> Bool isequal_normalized(s1::AbstractString, s2::AbstractString; casefold=false, stripmark=false, chartransform=identity) Unicode.normalize(s::AbstractString; keywords...) boolean keywords options (which all default to false except for compose) - compose=false: do not perform canonical composition - decompose=true: do canonical decomposition instead of canonical composition (compose=true is ignored if present) - compat=true: compatibility equivalents are canonicalized - casefold=true: perform Unicode case folding, e.g. for case-insensitive string comparison - newline2lf=true, newline2ls=true, or newline2ps=true: convert various newline sequences (LF, CRLF, CR, NEL) into a linefeed (LF), line-separation (LS), or paragraph-separation (PS) character, respectively - stripmark=true: strip diacritical marks (e.g. accents) - stripignore=true: strip Unicode's "default ignorable" characters (e.g. the soft hyphen or the left-to-right marker) - stripcc=true: strip control characters; horizontal tabs and form feeds are converted to spaces; newlines are also converted to spaces unless a newline-conversion flag was specified - rejectna=true: throw an error if unassigned code points are found - stable=true: enforce Unicode versioning stability (never introduce characters missing from earlier Unicode versions) Unicode.normalize(s::AbstractString, normalform::Symbol) normalform can be :NFC, :NFD, :NFKC, or :NFKD. utf8proc doesn't support language-sensitive case-folding Julia, which uses utf8proc, has decided to remain locale-independent. See https://github.com/JuliaLang/julia/issues/7848 https://github.com/JuliaLang/julia/pull/42493 This PR adds a function isequal_normalized to the Unicode stdlib to check whether two strings are canonically equivalent (optionally casefolding and/or stripping combining marks). https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/13 julia> '\ub5' 'µ': Unicode U+00b5 (category Ll: Letter, lowercase) julia> '\uff' 'ÿ': Unicode U+00ff (category Ll: Letter, lowercase) julia> Base.Unicode.uppercase("ÿ")[1] 'Ÿ': Unicode U+0178 (category Lu: Letter, uppercase) julia> Base.Unicode.uppercase("µ")[1] 'Μ': Unicode U+039c (category Lu: Letter, uppercase) jlf: I find the next thead interesting from a social point of view... https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/40 Yet another Stefan Karpinski against Scott P Jones... https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/42 jlf: helping Scott P Jones https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/46 Referencing https://github.com/JuliaLang/julia/pull/25021 https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/72 jlf: Stefan Karpinski not happy https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/79 jlf: Stefan Karpinski not happy https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/88 jlf: Scott P Jones not happy https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/130 Using a hybrid encoding like Python 3’s strings or @ScottPJones’s UniStr means that not only do you need to look at every byte of incoming data, but you also have to transcode it in general. This is a total performance nightmare for dealing with large text files. https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/133 jlf: Interesting points of Stefan Karpinski regarding the validation of strings. https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/138 jlf: not sure if Scott P Jones says that graphemes are no needed... https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/144 jlf: révolte! https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/148 jlf: "This is a plea for the thread to stop." https://discourse.julialang.org/t/problems-with-deprecations-of-islower-lowercase-isupper-uppercase/7797/154 jlf: very upset guy https://github.com/JuliaLang/julia/pull/25021 Move Unicode-related functions to new Unicode stdlib package jlf: nothing interesting in the comments, but this is this PR that Scott P Jones describes as a bomb. https://github.com/JuliaLang/julia/pull/19469#issuecomment-264810748 AFAICT does the currently implemented lowercase also not follow the spec. I do not know anything about Turkish but the following behaviour in Greek julia> lowercase("OΔΥΣΣΕΥΣ") "oδυσσευσ" # wrong "oδυσσευς" # would be correct is wrong, i.e. the lowercase sigma at the end is the non-final form σ but should be the final form ς instead. https://github.com/JuliaStrings/utf8proc/issues/54 Feature request: Full Case Folding #54 opened in 2015, still opened in 2022 jlf: related to utf8proc --- https://github.com/JuliaStrings/utf8proc/issues/54#issuecomment-141545196 our case is to make a perfect search in MAPS.ME :) In general, we need to preprocess a lot of raw strings added by community of OpenStreetMap, and match these strings effectively on mobile device, for any language and any input. This includes stripping out all diacritics, full case folding, and even some special conversions which are not covered in Unicode standard but are important for users trying to find something. I've already mentioned ß=>ss conversion, there are also non-standard Ł=>L, й=>и, famous turkish İ and ı conversions, all very important if you don't have a Ł key on your keyboard, for example, and trying to enter it as L (and find some Polish street for example). Now we have our own highly-optimized implementation for NFKD and Case Folding. --- jlf: made a search in https://github.com/mapsme/omim, but could not find where they handle Ł=>L Found NormalizeAndSimplifyString, but it doesn't simplify Ł=>L. https://github.com/JuliaStrings/utf8proc/pull/102 Fixes allowing for “Full” folding and NFKC_CaseFold compliance. #102 --- jlf: this is the creation of the function NFKC_Casefold in utf8proc --- https://github.com/JuliaStrings/utf8proc/pull/133 Case folding fixes #133 Updated version of #102: Restores the original behavior of IGNORE so that this PR is non-breaking, adds new STRIPNA flag. Renames the new function to utf8proc_NFKC_Casefold instead of utf8proc_NFKC_CF Adds a minimal test. Updates the utf8proc_data.c file. jlf: this explains why the the options in utf8proc are like that. jlf: "NFKC_CF" seems the name to search to get useful infos about utf8proc_NFKC_Casefold. https://unicode-org.github.io/icu/userguide/transforms/normalization/ NFKC_Casefold: ICU 4.4 supports the combination of NFKC, case folding and removing ignorable characters which was introduced with Unicode 5.2. https://docs.tibco.com/pub/enterprise-runtime-for-R/5.0.0/doc/html/Language_Reference/terrUtils/normalizeUnicode.html normalizeUnicode(x, form = "NCF") form: a character string specifying the type of Unicode normalization to be used. Should be one of the strings "NFC", "NFD", "NFKC", "NFKD", "NFKC_CF" or "NFKC_Casefold". The forms "NFKC_CF" or "NFKC_Casefold" (which are equivalent) are described in https://www.unicode.org/reports/tr31/. https://www.lanqiao.cn/library/elasticsearch-definitive-guide-cn/220_Token_normalization/40_Case_folding Case folding is the act of converting words into a (usually lowercase) form that does not necessarily result in the correct spelling, but does allow case-insensitive comparisons. jlf: they say "The default normalization form that the icu_normalizer token filter uses is nfkc_cf" https://github.com/JuliaLang/julia/issues/52408 isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false --- jlf: see the comments and new code --- This strings is really not well supported by bbedit! "בְּ"~text~unicodeCharacters== an Array (shape [3], 3 items) 1 : ( "ב" U+05D1 Lo 1 "HEBREW LETTER BET" ) 2 : ( "ּ" U+05BC Mn 0 "HEBREW POINT DAGESH OR MAPIQ" ) 3 : ( "ְ" U+05B0 Mn 0 "HEBREW POINT SHEVA" ) "בְּ"~text~c2x= -- D791 D6BC D6B0 "בְּ"~text~nfc~c2x= -- D791 D6B0 D6BC https://github.com/JuliaStrings/utf8proc/issues/257 normalization does not commute with case-folding? julia> using Unicode: normalize julia> s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c" "J" julia> normalize(s, casefold=true) == normalize(normalize(s), casefold=true) false julia> normalize(normalize(s, casefold=true)) == normalize(normalize(s), casefold=true) false Not sure if this is a bug or just a weird behavior of Unicode. --- I get something similar in Python 3: >>> import unicodedata >>> s = "J\u0f72\u0ec8\u0345\u0315\u05bf\u05bb\U0001d16d\u05b0\u0334\u035c" >>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", s).casefold() False >>> unicodedata.normalize("NFC", s.casefold()) == unicodedata.normalize("NFC", unicodedata.normalize("NFC", s).casefold()) False So I guess this is a weird quirk of Unicode? --- Executor idem: s = "J\u0f72\u0ec8\u0345\u0315\u05bf\u05bb\U0001d16d\u05b0\u0334\u035c"~text~unescape s~nfc(casefold:.true) == s~nfc~nfc(casefold:.true)= -- 0 s~nfc(casefold:.true)~c2x= -- 6A E0BB88 E0BDB2 CEB9 CCB4 D6B0 D6BB D6BF F09D85AD CC95 CD9C s~nfc~nfc(casefold:.true)~c2x= -- 6A CCB4 D6B0 D6BB D6BF E0BB88 E0BDB2 F09D85AD CC95 CD9C CEB9 s~nfc(casefold:.true)~nfc == s~nfc~nfc(casefold:.true)= -- 0 s~nfc(casefold:.true)~nfc~c2x= -- 6A E0BB88 E0BDB2 CEB9 CCB4 D6B0 D6BB D6BF F09D85AD CC95 CD9C s~nfc~nfc(casefold:.true)~c2x= -- 6A CCB4 D6B0 D6BB D6BF E0BB88 E0BDB2 F09D85AD CC95 CD9C CEB9 https://github.com/JuliaStrings/utf8proc/issues/101#issuecomment-1876151702 jlf: maybe this example of Julia code could be useful for Executor? function _isequal_normalized! > I agree that a fast case-folded/normalized comparison function that requires > no buffers seems possible to write and could be useful, even for Julia; Note that such a function was implemented in Julia, and could be ported to C: https://github.com/JuliaLang/julia/blob/0f6c72c71bc947282ae18715c09f93a22828aab7/stdlib/Unicode/src/Unicode.jl#L268-L340

Kotlin lang


https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/ https://github.com/JetBrains/kotlin/tree/master/libraries/stdlib/jvm/src/kotlin/text

Lisp lang


14/09/2021 https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html name Corresponds to the Name Unicode property. The value is a string consisting of upper-case Latin letters A to Z, digits, spaces, and hyphen ‘-’ characters. For unassigned codepoints, the value is nil. general-category Corresponds to the General_Category Unicode property. The value is a symbol whose name is a 2-letter abbreviation of the character’s classification. For unassigned codepoints, the value is Cn. canonical-combining-class Corresponds to the Canonical_Combining_Class Unicode property. The value is an integer. For unassigned codepoints, the value is zero. bidi-class Corresponds to the Unicode Bidi_Class property. The value is a symbol whose name is the Unicode directional type of the character. Emacs uses this property when it reorders bidirectional text for display (see Bidirectional Display). For unassigned codepoints, the value depends on the code blocks to which the codepoint belongs: most unassigned codepoints get the value of L (strong L), but some get values of AL (Arabic letter) or R (strong R). decomposition Corresponds to the Unicode properties Decomposition_Type and Decomposition_Value. The value is a list, whose first element may be a symbol representing a compatibility formatting tag, such as small18; the other elements are characters that give the compatibility decomposition sequence of this character. For characters that don’t have decomposition sequences, and for unassigned codepoints, the value is a list with a single member, the character itself. decimal-digit-value Corresponds to the Unicode Numeric_Value property for characters whose Numeric_Type is ‘Decimal’. The value is an integer, or nil if the character has no decimal digit value. For unassigned codepoints, the value is nil, which means NaN, or “not a number”. digit-value Corresponds to the Unicode Numeric_Value property for characters whose Numeric_Type is ‘Digit’. The value is an integer. Examples of such characters include compatibility subscript and superscript digits, for which the value is the corresponding number. For characters that don’t have any numeric value, and for unassigned codepoints, the value is nil, which means NaN. numeric-value Corresponds to the Unicode Numeric_Value property for characters whose Numeric_Type is ‘Numeric’. The value of this property is a number. Examples of characters that have this property include fractions, subscripts, superscripts, Roman numerals, currency numerators, and encircled numbers. For example, the value of this property for the character U+2155 VULGAR FRACTION ONE FIFTH is 0.2. For characters that don’t have any numeric value, and for unassigned codepoints, the value is nil, which means NaN. mirrored Corresponds to the Unicode Bidi_Mirrored property. The value of this property is a symbol, either Y or N. For unassigned codepoints, the value is N. mirroring Corresponds to the Unicode Bidi_Mirroring_Glyph property. The value of this property is a character whose glyph represents the mirror image of the character’s glyph, or nil if there’s no defined mirroring glyph. All the characters whose mirrored property is N have nil as their mirroring property; however, some characters whose mirrored property is Y also have nil for mirroring, because no appropriate characters exist with mirrored glyphs. Emacs uses this property to display mirror images of characters when appropriate (see Bidirectional Display). For unassigned codepoints, the value is nil. paired-bracket Corresponds to the Unicode Bidi_Paired_Bracket property. The value of this property is the codepoint of a character’s paired bracket, or nil if the character is not a bracket character. This establishes a mapping between characters that are treated as bracket pairs by the Unicode Bidirectional Algorithm; Emacs uses this property when it decides how to reorder for display parentheses, braces, and other similar characters (see Bidirectional Display). bracket-type Corresponds to the Unicode Bidi_Paired_Bracket_Type property. For characters whose paired-bracket property is non-nil, the value of this property is a symbol, either o (for opening bracket characters) or c (for closing bracket characters). For characters whose paired-bracket property is nil, the value is the symbol n (None). Like paired-bracket, this property is used for bidirectional display. old-name Corresponds to the Unicode Unicode_1_Name property. The value is a string. For unassigned codepoints, and characters that have no value for this property, the value is nil. iso-10646-comment Corresponds to the Unicode ISO_Comment property. The value is either a string or nil. For unassigned codepoints, the value is nil. uppercase Corresponds to the Unicode Simple_Uppercase_Mapping property. The value of this property is a single character. For unassigned codepoints, the value is nil, which means the character itself. lowercase Corresponds to the Unicode Simple_Lowercase_Mapping property. The value of this property is a single character. For unassigned codepoints, the value is nil, which means the character itself. titlecase Corresponds to the Unicode Simple_Titlecase_Mapping property. Title case is a special form of a character used when the first character of a word needs to be capitalized. The value of this property is a single character. For unassigned codepoints, the value is nil, which means the character itself. special-uppercase Corresponds to Unicode language- and context-independent special upper-casing rules. The value of this property is a string (which may be empty). For example mapping for U+00DF LATIN SMALL LETTER SHARP S is "SS". For characters with no special mapping, the value is nil which means uppercase property needs to be consulted instead. special-lowercase Corresponds to Unicode language- and context-independent special lower-casing rules. The value of this property is a string (which may be empty). For example mapping for U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE the value is "i\u0307" (i.e. 2-character string consisting of LATIN SMALL LETTER I followed by U+0307 COMBINING DOT ABOVE). For characters with no special mapping, the value is nil which means lowercase property needs to be consulted instead. special-titlecase Corresponds to Unicode unconditional special title-casing rules. The value of this property is a string (which may be empty). For example mapping for U+FB01 LATIN SMALL LIGATURE FI the value is "Fi". For characters with no special mapping, the value is nil which means titlecase property needs to be consulted instead.

Mathematica lang


https://www.youtube.com/watch?v=yiwLBvirm7A Live CEOing Ep 426: Language Design in Wolfram Language [Unicode Characters & WFR Suggestions] At the begining, there are a few minutes about character properties. https://writings.stephenwolfram.com/2022/06/launching-version-13-1-of-wolfram-language-mathematica/#emojis-and-more-multilingual-support Launching Version 13.1 of Wolfram Language & Mathematica 🙀🤠🥳 Emojis! And More Multilingual Support Original 16-bit Unicode is “plane 0”. Now there are up to 16 additional planes. Not quite 32-bit characters, but given the way computers work, the approach now is to allow characters to be represented by 32-bit objects. It’s far from trivial to do that uniformly and efficiently. And for us it’s been a long process to upgrade everything in our system—from string manipulation to notebook rendering— to handle full 32-bit characters. And that’s finally been achieved in Version 13.1. --- You can have wolf and ram variables: In:= Expand[(🐺 + 🐏)^8] In:= Expand[(\|01f43a + \|01f40f)^8] 8 7 6 2 5 3 4 4 3 5 2 6 7 8 Out= 🐏 + 8 🐏 🐺 + 28 🐏 🐺 + 56 🐏 🐺 + 70 🐏 🐺 + 56 🐏 🐺 + 28 🐏 🐺 + 8 🐏 🐺 + 🐺 The 🐏 sorts before the 🐺 because it happens to have a numerically smaller character code: In:= ToCharacterCode["🐺🐏"] In:= ToCharacterCode["\|01f43a\|01f40f"] Out= {128058, 128015} --- In:= Grid[Outer[StringJoin[#1, "\:200d", #2] & , {"👩", "👨"}, {"🔬", "🏫", "🎓", "🍳", "🚀", "🔧"}]] In:= Grid[Outer[StringJoin[#1, "\:200d", #2] & , {"\|01f469", "\|01f468"}, {"\|01f52c", "\|01f3eb", "\|01f393", "\|01f373", "\|01f680", "\|01f527"}]] Out= 👩‍🔬 👩‍🏫 👩‍🎓 👩‍🍳 👩‍🚀 👩‍🔧 👨‍🔬 👨‍🏫 👨‍🎓 👨‍🍳 👨‍🚀 👨‍🔧 --- No outer product in Executor, only element-wise operators ("👩", "👨")~each{(item || .Unicode["zero width joiner"]~text) || ("🔬", "🏫", "🎓", "🍳", "🚀", "🔧")}== an Array (shape [2], 2 items) 1 : [T'👩‍🔬',T'👩‍🏫',T'👩‍🎓',T'👩‍🍳',T'👩‍🚀',T'👩‍🔧'] 2 : [T'👨‍🔬',T'👨‍🏫',T'👨‍🎓',T'👨‍🍳',T'👨‍🚀',T'👨‍🔧'] --- In:= CharacterRange[74000, 74050] Out= {𒄐, 𒄑, 𒄒, 𒄓, 𒄔, 𒄕, 𒄖, 𒄗, 𒄘, 𒄙, 𒄚, 𒄛, 𒄜, 𒄝, 𒄞, 𒄟, 𒄠, 𒄡, 𒄢, 𒄣, > 𒄤, 𒄥, 𒄦, 𒄧, 𒄨, 𒄩, 𒄪, 𒄫, 𒄬, 𒄭, 𒄮, 𒄯, 𒄰, 𒄱, 𒄲, 𒄳, 𒄴, 𒄵, 𒄶, 𒄷, > 𒄸, 𒄹, 𒄺, 𒄻, 𒄼, 𒄽, 𒄾, 𒄿, 𒅀, 𒅁, 𒅂} In:= FromCharacterCode[{2361, 2367}] Out= हि In:= Characters["हि"] In:= Characters["\:0939\:093f"] Out= {ह, ि} In:= Characters["o\:0308"] Out= {o, ̈} In:= CharacterNormalize["o\:0308", "NFC"] Out= ö In:= ToCharacterCode[%] Out= {246}

netrexx lang


https://groups.io/g/netrexx/topic/93734685 Unicode Examples (this not NetRexx, but this answer is useful for NetRexx) https://stackoverflow.com/questions/63410278/code-point-and-utf-16-code-units-are-the-same-thing code point and UTF-16 code units are the same thing? No, they are different. I know, MDN uses the rarely used "code units" term, which confuses people a lot. Code points are the number given to a Unicode element (character). This is independent to the encoding, and it can be as high as 0x​10FFFF. UTF-32 code units are equivalent to Unicode code points (if you are using the correct endianess). Code units in UTF-16 are units of 16bit data. UTF-16 uses 1 or 2 code units to describe a code point, depending on its value. Code points below (or equal) to 0xFFFF (the old limit/expectation of Unicode that such numbers were enough to encode all characters) use just 1 code unit, and its value is the same as the code point. Unicode expanded the code point space, so now code points between 0x010000..0x10FFFF require 2 code units (and we use "surrogates" to encode such characters), 4 bytes total. So, code points are not the same as code units. For UTF-16, code units are 16bit long, and code points could be 1 or 2 code units. (this is JavaScript, but this answer is useful for NetRexx) https://exploringjs.com/impatient-js/ch_unicode.html#:~:text=Code%20units%20are%20numbers%20that,has%208%2Dbit%20code%20units. each UTF-16 code unit is always either a leading surrogate, a trailing surrogate, or encodes a BMP code point BMP = Basic Multilingual Plane (0x0000–0xFFFF) (this is JavaScript, but this answer is useful for NetRexx) https://www.w3schools.com/jsref/jsref_codepointat.asp#:~:text=Difference%20Between%20charCodeAt()%20and%20codePointAt()&text=charCodeAt()%20returns%20a%20number,value%20greather%200xFFFF%20(65535). Difference Between charCodeAt() and codePointAt() charCodeAt() is UTF-16, codePointAt() is Unicode. charCodeAt() returns a number between 0 and 65535. Both methods return an integer representing the UTF-16 code of a character, but only codePointAt() can return the full value of a Unicode value greather 0xFFFF (65535). (this is Unicode, but this answer is useful for NetRexx) https://www.unicode.org/faq/utf_bom.html#:~:text=Surrogates%20are%20code%20points%20from,DC0016%20to%20DFFF16. What are surrogates? Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading surrogates, also called high surrogates, are encoded from D800 to DBFF, and trailing surrogates, or low surrogates, from DC00 to DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair. What is the difference between UCS-2 and UTF-16? UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided. UCS-2 does not describe a data format distinct from UTF-16, because both use exactly the same 16-bit code unit representations. However, UCS-2 does not interpret surrogate code points, and thus cannot be used to conformantly represent supplementary characters. Sometimes in the past an implementation has been labeled “UCS-2” to indicate that it does not support supplementary characters and doesn’t interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of character properties, code point boundaries, collation, etc. for supplementary characters, nor would it be able to support most emoji, for example. [AF] (this is Unicode, but this answer is useful for NetRexx) Unicode standard How the word "surrogate" is used surrogate pair surrogate code unit surrogate code point leading surrogate trailing surrogate high-surrogate code point high-surrogate code unit low-surrogate code point low-surrogate code unit Surrogates D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF. D72 High-surrogate code unit: A 16-bit code unit in the range D800 to DBFF, used in UTF-16 as the leading code unit of a surrogate pair. D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF. D74 Low-surrogate code unit: A 16-bit code unit in the range DC00 to DFFF, used in UTF-16 as the trailing code unit of a surrogate pair. UTF-16 In the UTF-16 encoding form, non-surrogate code points in the range U+0000..U+FFFF are represented as a single 16-bit code unit; code points in the supplementary planes, in the range U+10000..U+10FFFF, are represented as pairs of 16-bit code units. These pairs of special code units are known as surrogate pairs. The values of the code units used for surro- gate pairs are completely disjunct from the code units used for the single code unit representations, thus maintaining non-overlap for all code point representations in UTF-16. Code Points Unassigned to Abstract Characters C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. • The high-surrogate and low-surrogate code points are designated for surrogate code units in the UTF-16 character encoding form. They are unassigned to any abstract character. (this is Java, but this answer is useful for NetRexx) https://stackoverflow.com/questions/39955169/which-encoding-does-java-uses-utf-8-or-utf-16/39957184#39957184 Which encoding does Java uses UTF-8 or UTF-16? Note that new String(bytes, StandardCharsets.UTF_16); does not "convert it to UTF-16 explicitly". This string constructor takes a sequence of bytes, which is supposed to be in the encoding that you have given in the second argument, and converts it to the UTF-16 representation of whatever characters those bytes represent in that encoding. You can't tell Java how to internally store strings. It always stores them as UTF-16. The constructor String(byte[],Charset) tells Java to create a UTF-16 string from an array of bytes that is supposed to be in the given character set. The method getBytes(Charset) tells Java to give you a sequence of bytes that represent the string in the given encoding (charset). And the method getBytes() without an argument does the same - but uses your platform's default character set for the conversion. Edit: in fact, Java 9 introduced just such a change in internal representation of strings, where, by default, strings whose characters all fall in the ISO-8859-1 range are internally represented in ISO-8859-1, whereas strings with at least one character outside that range are internally represented in UTF-16 as before. So indeed, getBytes(StandardCharsets.UTF_16) no longer returns the internal representation. (this thread contains answers useful for NetRexx) https://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/ UCS vs UTF-8 as Internal String Encoding jlf: Very good introduction 100% applicable to NetRexx https://news.ycombinator.com/item?id=9618306 jlf: comments about the blog above. Interesting comments about the need or non-need to have direct access to code units or "characters" in constant time. --- Unicode provides 3 classes of grapheme clusters (legacy, extended and tailored) at least one of which (tailored) is locale-dependent (`ch` is a single tailored grapheme cluster under the Slovak locale, because it's the ch digraph). --- A text-editing control is thinking in terms of "grapheme clusters", not in terms of codepoints. jlf: not true. BBEdit works at codepoint level. 👩‍👨‍👩‍👧🎅 2 graphemes, 8 codepoints, 29 bytes c2x = 'F09F91A9 E2808D F09F91A8 E2808D F09F91A9 E2808D F09F91A7 F09F8E85' c2u = 'U+1F469 U+200D U+1F468 U+200D U+1F469 U+200D U+1F467 U+1F385' c2g = 'F09F91A9E2808DF09F91A8E2808DF09F91A9E2808DF09F91A7 F09F8E85' In BBEdit, I see 8 "characters" and can move the cursor between each "character". The ZERO WIDTH JOINER codepoints are visible. In VSCode, I see 2 "characters". --- What is the practicality of an unbounded number of possible graphemes? The standard itself doesn't mention any bounds but there is Unicode Standard Annex #15 - Unicode Normalization Forms which defines the Stream-Safe Text Format. UAX15-D3. Stream-Safe Text Format: A Unicode string is said to be in Stream-Safe Text Format if it would not contain any sequences of non-starters longer than 30 characters in length when normalized to NFKD. --- This sub-part of the thread is exactly what we are discussing for NetRexx https://news.ycombinator.com/item?id=9620112 --- jlf: next description is exactly what I do in the Executor prototype. What you really want is constant-time dereferencing of designators for semantically meaningful substrings. But no language AFAIK actually has that. The fundamental problem is that most languages have painted themselves into a corner by carving into stone the fact that strings can be dereferenced by integers. Once you've done that, you're pretty much screwed. It's not that you can't make it work, it's just that it requires an awful lot of machinery. You basically need to build an index for every string you construct, and that can get very expensive. Fixed-width representations are a less-than-perfect-but-still-not-entirely-unreasonable engineering solution to this problem. (this is Python, but this link could be useful for NetRexx c2x and x2c) https://docs.python.org/3/library/struct.html Interpret bytes as packed binary data

Oracle


https://docs.oracle.com/database/121/NLSPG/ch5lingsort.htm#NLSPG288 Database Globalization Support Guide 5 Linguistic Sorting and Matching Complex! Did not read in details, maybe I should... https://docs.oracle.com/database/121/NLSPG/ch6unicode.htm#NLSPG323 Database Globalization Support Guide 6 Supporting Multilingual Databases with Unicode https://docs.oracle.com/database/121/NLSPG/ch7progrunicode.htm#NLSPG346 Database Globalization Support Guide 7 Programming with Unicode

Perl lang (Perl 6 has been renamed to Raku)


https://swigunicode.wordpress.com/2021/10/18/example-post-3/ SWIG and Perl: Unicode C Library Part 1. Small Intro to SWIG https://swigunicode.wordpress.com/2021/10/22/part-2-c-header-file/ Part 2. C Header File https://swigunicode.wordpress.com/2021/10/24/part-3-c-source-file/ Part 3. C Source File https://swigunicode.wordpress.com/2021/10/25/part-4-perl-source-file/ Part 4. Perl Source File https://swigunicode.wordpress.com/2021/10/26/part-5-build-and-run-scripts/ Part 5. Build and Run Scripts https://swigunicode.wordpress.com/2021/10/27/part-6-swig-interface-file/ Part 6. SWIG Interface File https://lwn.net/Articles/667684/ An article about NFG. Unless one specifies otherwise, Perl 6 normalizes a text string to NFC when it's not NFG.

PHP lang


https://github.com/nicolas-grekas/Patchwork-UTF8 Extensive, portable and performant handling of UTF-8 and grapheme clusters for PHP https://kunststube.net/encoding/ jlf: First, a general introduction to encoding. Then a focus on PHP. https://www.php.net/manual/en/function.iconv.php iconv — Convert a string from one character encoding to another iconv(string $from_encoding, string $to_encoding, string $string): string|false https://www.php.net/manual/en/book.mbstring.php Multibyte String replicates all important string functions in a multi-byte aware fashion. Because the mb_ functions now have to actually think about what they're doing, they need to know what encoding they're working on. Therefore every mb_ function accepts an $encoding parameter as well. Alternatively, this can be set globally for all mb_ functions using mb_internal_encoding. https://www.php.net/manual/en/mbstring.overload.php Warning This feature has been DEPRECATED as of PHP 7.2.0, and REMOVED as of PHP 8.0.0. Relying on this feature is highly discouraged. --- mbstring supports a 'function overloading' feature which enables you to add multibyte awareness to such an application without code modification by overloading multibyte counterparts on the standard string functions. For example, mb_substr() is called instead of substr() if function overloading is enabled. This feature makes it easy to port applications that only support single-byte encodings to a multibyte environment in many cases. --- jlf: the few user's comments are all negative. hum... this is one of the choices we foresee for Rexx. Bad idea? Example: "In short, only use mbstring.func_overload if you are 100% certain that nothing on your site relies on manipulating binary data in PHP." Search PHP souces: grapheme https://heap.space/search?project=PHP-8.2&full=grapheme&defs=&refs=&path=&hist=&type= https://news-web.php.net/group.php?group=php.i18n php.i18n Most recent: 08 Feb 2018 (?) https://news-web.php.net/php.i18n/1439 Unicode support with UString abstraction layer 21/03/2012 by Umberto Salsi jlf: no URL https://wiki.php.net/rfc/ustring UString is much quicker than mbstring thanks to the use of ICU. https://www.reddit.com/r/PHP/comments/2jvvol/rfc_ustring/ Low enthusiasm on reddit... https://github.com/krakjoe/ustring UnicodeString for PHP7 dead, last commit on Mar 17, 2016 https://github.com/nicolas-grekas/Patchwork-UTF8 Patchwork-UTF8 Extensive, portable and performant handling of UTF-8 and grapheme clusters for PHP Dead? last commit was on May 18, 2016 https://blog.internet-formation.fr/2022/08/nettoyer-et-remplacer-les-homographes-et-homoglyphes-dun-texte-en-php/ Nettoyer et remplacer les homographes (et homoglyphes) d’un texte en PHP

Python lang


https://github.com/dabeaz-course/practical-python/blob/master/Notes/01_Introduction/04_Strings.md Practical Python Programming. A course by David Beazley jlf:good introduction to Python strings. https://www.youtube.com/watch?v=Nfqh6lr3frQ The Guts of Unicode in Python Benjamin Peterson This talk will examine how Python's internal Unicode representation has changed from its introduction through the latest major changes in Python 3.3. jlf: not too long (28 min), good overview. 10/08/2021 List of Python PEPS related to string. https://www.python.org/dev/peps/ Other Informational PEPs I 257 Docstring Conventions Goodger, GvR I 287 reStructuredText Docstring Format Goodger Accepted PEPs (accepted; may not be implemented yet) SA 675 Arbitrary Literal String Type SA 686 Make UTF-8 mode default SA 701 Syntactic formalization of f-strings Open PEPs (under consideration) Finished PEPs (done, with a stable interface) SF 100 Python Unicode Integration Lemburg SF 260 Simplify xrange() GvR SF 261 Support for "wide" Unicode characters Prescod SF 263 Defining Python Source Code Encodings Lemburg, von Löwis SF 277 Unicode file name support for Windows NT Hodgson SF 278 Universal Newline Support Jansen SF 292 Simpler String Substitutions Warsaw SF 331 Locale-Independent Float/String Conversions Reis SF 383 Non-decodable Bytes in System Character Interfaces von Löwis SF 393 Flexible String Representation v. Löwis SF 414 Explicit Unicode Literal for Python 3.3 Ronacher, Coghlan SF 498 Literal String Interpolation Smith SF 515 Underscores in Numeric Literals Brandl, Storchaka SF 528 Change Windows console encoding to UTF-8 Dower SF 529 Change Windows filesystem encoding to UTF-8 Dower SF 538 Coercing the legacy C locale to a UTF-8 based locale Coghlan SF 540 Add a new UTF-8 Mode Stinner SF 597 Add optional EncodingWarning Naoki SF 616 String methods to remove prefixes and suffixes Sweeney SF 623 Remove wstr from Unicode Naoki SF 624 Remove Py_UNICODE encoder APIs Naoki SF 3101 Advanced String Formatting Talin SF 3112 Bytes literals in Python 3000 Orendorff SF 3120 Using UTF-8 as the default source encoding von Löwis SF 3127 Integer Literal Support and Syntax Maupin SF 3131 Supporting Non-ASCII Identifiers von Löwis SF 3137 Immutable Bytes and Mutable Buffer GvR SF 3138 String representation in Python 3000 Ishimoto Deferred PEPs (postponed pending further research or updates) SD 501 General purpose string interpolation Coghlan SD 536 Final Grammar for Literal String Interpolation Angerer SD 558 Defined semantics for locals() Coghlan Abandoned, Withdrawn, and Rejected PEPs SS 215 String Interpolation Yee IR 216 Docstring Format Zadka SR 224 Attribute Docstrings Lemburg SR 256 Docstring Processing System Framework Goodger SR 295 Interpretation of multiline string constants Koltsov SR 332 Byte vectors and String/Unicode Unification Montanaro SR 349 Allow str() to return unicode strings Schemenauer IR 502 String Interpolation - Extended Discussion Miller SR 3126 Remove Implicit String Concatenation Jewett, Hettinger 15/07/2021 review https://docs.python.org/3/howto/unicode.html Escape sequences in string literals "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name '\u0394' "\u0394" # Using a 16-bit hex value '\u0394' "\U00000394" # Using a 32-bit hex value '\u0394' One can create a string using the decode() method of bytes. This method takes an encoding argument, such as UTF-8, and optionally an errors argument. The errors argument specifies the response when the input string can’t be converted according to the encoding’s rules. Legal values for this argument are 'strict' (raise a UnicodeDecodeError exception), 'replace' (use U+FFFD, REPLACEMENT CHARACTER), 'ignore' (just leave the character out of the Unicode result), 'backslashreplace' (inserts a \xNN escape sequence). Examples: b'\x80abc'.decode("utf-8", "strict") # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0 b'\x80abc'.decode("utf-8", "replace") # '\ufffdabc' b'\x80abc'.decode("utf-8", "backslashreplace") # '\\x80abc' b'\x80abc'.decode("utf-8", "ignore") # 'abc' Encodings are specified as strings containing the encoding’s name. Python comes with roughly 100 different encodings: https://docs.python.org/3/library/codecs.html#standard-encodings One-character Unicode strings can also be created with the chr() built-in function, which takes integers and returns a Unicode string of length 1 that contains the corresponding code point: chr(57344) # '\ue000' The reverse operation is the built-in ord() function that takes a one-character Unicode string and returns the code point value: ord('\ue000') # 57344 The opposite method of bytes.decode() is str.encode(), which returns a bytes representation of the Unicode string, encoded in the requested encoding. The errors parameter is the same as the parameter of the decode() method but supports a few more possible handlers. 'strict' (raise a UnicodeDecodeError exception), 'replace' inserts a question mark instead of the unencodable character, 'ignore' (just leave the character out of the Unicode result), 'backslashreplace' (inserts a \uNNNN escape sequence) 'xmlcharrefreplace' (inserts an XML character reference), 'namereplace' (inserts a \N{...} escape sequence). Unicode code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects eight hex digits, not four >>> s = "a\xac\u1234\u20ac\U00008000" ... # ^^^^ two-digit hex escape ... # ^^^^^^ four-digit Unicode escape ... # ^^^^^^^^^^ eight-digit Unicode escape >>> [ord(c) for c in s] [97, 172, 4660, 8364, 32768] Python supports writing source code in UTF-8 by default, but you can use almost any encoding if you declare the encoding being used. This is done by including a special comment as either the first or second line of the source file: #!/usr/bin/env python # -*- coding: latin-1 -*- u = 'abcdé' https://www.python.org/dev/peps/pep-0263/ PEP 263 -- Defining Python Source Code Encodings Comparing Strings The casefold() string method converts a string to a case-insensitive form following an algorithm described by the Unicode Standard. This algorithm has special handling for characters such as the German letter ‘ß’ (code point U+00DF), which becomes the pair of lowercase letters ‘ss’. >>> street = 'Gürzenichstraße' >>> street.casefold() 'gürzenichstrasse' The unicodedata module’s normalize() function converts strings to one of several normal forms: ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. def compare_strs(s1, s2): def NFD(s): return unicodedata.normalize('NFD', s) return NFD(s1) == NFD(s2) The Unicode Standard also specifies how to do caseless comparisons: def compare_caseless(s1, s2): def NFD(s): return unicodedata.normalize('NFD', s) return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold()) Why is NFD() invoked twice? Because there are a few characters that make casefold() return a non-normalized string, so the result needs to be normalized again. See section 3.13 of the Unicode Standard https://docs.python.org/3/library/unicodedata.html unicodedata.lookup(name) Look up character by name. If a character with the given name is found, return the corresponding character. If not found, KeyError is raised. Changed in version 3.3: Support for name aliases 1 and named sequences 2 has been added. unicodedata.name(chr[, default]) Returns the name assigned to the character chr as a string. unicodedata.decimal(chr[, default]) Returns the decimal value assigned to the character chr as integer. unicodedata.digit(chr[, default]) Returns the digit value assigned to the character chr as integer. unicodedata.numeric(chr[, default]) Returns the numeric value assigned to the character chr as float. unicodedata.category(chr) Returns the general category assigned to the character chr as string. unicodedata.bidirectional(chr) Returns the bidirectional class assigned to the character chr as string. unicodedata.combining(chr) Returns the canonical combining class assigned to the character chr as integer. Returns 0 if no combining class is defined. unicodedata.east_asian_width(chr) Returns the east asian width assigned to the character chr as string. unicodedata.mirrored(chr) Returns the mirrored property assigned to the character chr as integer. Returns 1 if the character has been identified as a “mirrored” character in bidirectional text, 0 otherwise. unicodedata.decomposition(chr) Returns the character decomposition mapping assigned to the character chr as string. An empty string is returned in case no such mapping is defined. unicodedata.normalize(form, unistr) Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. unicodedata.is_normalized(form, unistr) Return whether the Unicode string unistr is in the normal form form. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. unicodedata.unidata_version The version of the Unicode database used in this module. unicodedata.ucd_3_2_0 This is an object that has the same methods as the entire module, but uses the Unicode database version 3.2 instead https://www.python.org/dev/peps/pep-0393/ PEP 393 -- Flexible String Representation When creating new strings, it was common in Python to start of with a heuristical buffer size, and then grow or shrink if the heuristics failed. With this PEP, this is now less practical, as you need not only a heuristics for the length of the string, but also for the maximum character. In order to avoid heuristics, you need to make two passes over the input: once to determine the output length, and the maximum character; then allocate the target string with PyUnicode_New and iterate over the input a second time to produce the final output. While this may sound expensive, it could actually be cheaper than having to copy the result again as in the following approach. If you take the heuristical route, avoid allocating a string meant to be resized, as resizing strings won't work for their canonical representation. Instead, allocate a separate buffer to collect the characters, and then construct a unicode object from that using PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer element, assuming for the worst case in character ordinals. This will allow for pointer arithmetics, but may require a lot of memory. Alternatively, start with a 1-byte buffer, and increase the element size as you encounter larger characters. In any case, PyUnicode_FromKindAndData will scan over the buffer to verify the maximum character. 15/07/2021 https://docs.python.org/3/library/codecs.html Codec registry and base classes Most standard codecs are text encodings, which encode text to bytes, but there are also codecs provided that encode text to text, and bytes to bytes. errors string argument: strict ignore replace xmlcharrefreplace backslashreplace namereplace surrogateescape surrogatepass 15/07/2021 https://discourse.julialang.org/t/a-python-rant-about-types/43294/22 A Python rant about types jlf: the main discussion is about invalid string data. Stefan Karpinski describes the Julia strings: 1. You can read and write any data, valid or not. 2. It is interpreted as UTF-8 where possible and as invalid characters otherwise. 3. You can simply check if strings or chars are valid UTF-8 or not. 4. You can work with individual characters easily, even invalid ones. 5. You can losslessly read and write any string data, valid or not, as strings or chars. 6. You only get an error when you try to ask for the code point of an invalid char. Most Julia code that works with strings is automatically robust with respect to invalid UTF-8 data. Only code that needs to look at the code points of individual characters will fail on invalid data; in order to do that robustly, you simply need to check if the character is valid before taking its code point and handle that appropriately. jlf: I think that all the Julia methods working at character level will raise an error, not just when looking at the code point. jlf: Stefan Karpinski explains why Python design is problematic. Python 3 has to be able to represent any input string in terms of code points. Needing to turn every string into a fixed-width sequence of code points puts them in a tough position with respect to invalid strings where there is simply no corresponding sequence of code points. 17/07/2021 https://groups.google.com/g/python-ideas/c/wStIS1_NVJQ Fix default encodings on Windows jlf: did not read in details, too long, too many feedbacks. Maybe some comments are interesting, so I save this URL. https://djangocas.dev/blog/python-unicode-string-lowercase-casefold-caseless-match/ Interesting infos about caseless matching https://gist.github.com/dpk/8325992 PyICU cheat sheet 10/05/2023 https://github.com/python/cpython/issues/56938 original URL before migration to github: https://bugs.python.org/issue12729 Python lib re cannot handle Unicode properly due to narrow/wide bug jlf: TODO not yet read, but seems interesting. I found this link thanks to https://news.ycombinator.com/item?id=9618306 (referenced in the NetRexx section) https://peps.python.org/pep-0414/ PEP 414 – Explicit Unicode Literal for Python 3.3 Specifically, the Python 3 definition for string literal prefixes will be expanded to allow: "u" | "U" in addition to the currently supported: "r" | "R" The following will all denote ordinary Python 3 strings: 'text' "text" '''text''' """text""" u'text' u"text" u'''text''' u"""text""" U'text' U"text" U'''text''' U"""text""" Types of string and their methods: string "H" "H"[0] # "H" unicode string u"H" u"H"[0] # "H" byte string b"H" b"H"[0] # 72 string of 8-bit bytes raw string r"H" r"H"[0] # "H" string literals with an uninterpreted backslash. f-string f"H" f"H"[0] # "H" string with formatted expression substitution. dir(""), dir(f""), dir(r"") dir(b"") ------------------------------------------------- __add__ __add__ __bytes__ __class__ __class__ __contains__ __contains__ __delattr__ __delattr__ __dir__ __dir__ __doc__ __doc__ __eq__ __eq__ __format__ __format__ __ge__ __ge__ __getattribute__ __getattribute__ __getitem__ __getitem__ __getnewargs__ __getnewargs__ __getstate__ __getstate__ __gt__ __gt__ __hash__ __hash__ __init__ __init__ __init_subclass__ __init_subclass__ __iter__ __iter__ __le__ __le__ __len__ __len__ __lt__ __lt__ __mod__ __mod__ __mul__ __mul__ __ne__ __ne__ __new__ __new__ __reduce__ __reduce__ __reduce_ex__ __reduce_ex__ __repr__ __repr__ __rmod__ __rmod__ __rmul__ __rmul__ __setattr__ __setattr__ __sizeof__ __sizeof__ __str__ __str__ __subclasshook__ __subclasshook__ capitalize capitalize casefold center center count count decode encode endswith endswith expandtabs expandtabs find find format format_map fromhex hex index index isalnum isalnum isalpha isalpha isascii isascii isdecimal isdigit isdigit isidentifier islower islower isnumeric isprintable isspace isspace istitle istitle isupper isupper join join ljust ljust lower lower lstrip lstrip maketrans maketrans partition partition removeprefix removeprefix removesuffix removesuffix replace replace rfind rfind rindex rindex rjust rjust rpartition rpartition rsplit rsplit rstrip rstrip split split splitlines splitlines startswith startswith strip strip swapcase swapcase title title translate translate upper upper zfill zfill https://stackoverflow.com/questions/72371202/unicodedecodeerror-utf-8-codec-cant-decode-byte-0x97-in-position-3118-inval UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 3118: invalid start byte Simple text file [duplicate] (jlf: just keeping a note for the example) It seems like the file is not encoded in utf-8. Could you try open the file using io.open with latin-1 encoding instead? https://docs.python.org/3/library/functions.html#open --- (example) from textblob import TextBlob import io with io.open("positive.txt", encoding='latin-1') as f: for line in f.read().split('\n'): # do what you want with line --- https://github.com/life4/textdistance Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage. Reimplemented in Rust by the same author: https://github.com/life4/textdistance.rs Testing the JMB's example "ς".upper() # 'Σ' "σ".upper() # 'Σ' "ὈΔΥΣΣΕΎΣ".lower() # 'ὀδυσσεύς' last Σ becomes ς "ὈΔΥΣΣΕΎΣA".lower() # 'ὀδυσσεύσa' last Σ becomes σ # Humm... the concatenation doesn't change ς to σ "ὈΔΥΣΣΕΎΣ".lower() + "A" # 'ὀδυσσεύςA' ("ὈΔΥΣΣΕΎΣ".lower() + "A").upper() # 'ὈΔΥΣΣΕΎΣA' ("ὈΔΥΣΣΕΎΣ".lower() + "A").upper().lower() # 'ὀδυσσεύσa' https://news.ycombinator.com/item?id=33984308 The History and rationale of the Python 3 Unicode model for the operating system (vstinner.github.io) jlf: HN comments about this old blog https://vstinner.github.io/python30-listdir-undecodable-filenames.html https://github.com/python/cpython/blob/main/Include/cpython/unicodeobject.h (search "Unicode Type") CPython source code of Unicode string This URL comes from https://blog.vito.nyc/posts/gil-balm/ Fast string construction for CPython extensions https://python.developpez.com/tutoriels/plonger-au-coeur-de-python/?page=chapitre-4-moins-strings jlf: todo read (french) Translation from english, could not find the original article.

R lang


https://stringi.gagolewski.com/index.html stringi: Fast and Portable Character String Processing in R stringi (pronounced “stringy”, IPA [strinɡi]) is THE R package for very fast, portable, correct, consistent, and convenient string/text processing in any locale or character encoding. Thanks to ICU, stringi fully supports a wide range of Unicode standards. Paper (PDF): https://www.jstatsoft.org/index.php/jss/article/view/v103i02/4324 https://github.com/gagolews/stringi Fast and Portable Character String Processing in R (with the Unicode ICU)

RAKU lang Rakudo lang (Perl6, Perl 6, MOAR-VM)


https://raku-advent.blog/2022/12/23/sigils-2/ jlf: not related to unicode, but good for general culture. A sigil is any non-alphabetic character that’s used at the front of a word, and that conveys meta information about the word. For example, hashtags are a sigil: the  #  in  #nofilter  is a sigil that communicates that “nofilter” is a tag (not a regular word of text). The Raku programming language uses sigils to mark its variables; Raku has four sigils: @  (normally associated with arrays), can only be used for types that implement the  Positional  (“array-like”) role %  (normally associated with hashes), can only be used for types that implement the  Associative  (“hash-like”) role &  (normally associated with functions) can only be used for types that implement the  Callable  (“function-like”) role $  (for other variables, such as numbers and strings). https://dev.to/lizmat/series/24075 Migrating Perl to Raku Series' Articles jlf: not related to unicode, but good for general culture. http://docs.p6c.org/routine.html Raku Routines This is a list of all built-in routines that are documented here as part of the Raku language. jlf: not related to unicode, but good for general culture. https://www.learningraku.com/2016/11/26/quick-tip-11-number-strings-and-numberstring-allomorphs/ Quick Tip #11: Number, Strings, and NumberString Allomorphs jlf: maybe the same as ooRexx string numbers? https://docs.raku.org/type/Stringy String or object that can act as a string (role) https://rakudocs.github.io/type/Allomorph Dual value number and string (class) https://docs.raku.org/type/IntStr Dual value integer and string (class) https://docs.raku.org/type/RatStr Dual value rational number and string (class) https://github.com/MoarVM/MoarVM/blob/master/docs/strings.asciidoc MoarVM string documentation. jlf: little intro, no detailled API. https://docs.raku.org/type/Str class Str Built-in class for strings. Objects of type Str are immutable. https://docs.raku.org/type/Uni class Uni A string of Unicode codepoints Unlike Str, which is made of Grapheme clusters, Uni is string strictly made of Unicode codepoints. That is, base characters and combining characters are separate elements of a Uni instance. Uni presents itself with a list-like interface of integer Codepoints. Typical usage of Uni is through one of its subclasses, NFC, NFD, NFKD and NFKC, which represent strings in one of the Unicode Normalization Forms of the same name. https://course.raku.org/essentials/strings/string-concatenation/ String concatenation jlf: strange... the concatenation is not described in the doc of Str. In Raku, you concatenate strings using concatenation operator. This operator is a tilde: ~. my $greeting = 'Hello, '; my $who = 'World!'; say $greeting ~ $who; Concatenation with assignment $str = $str ~ $another-str; $str ~= $another-str; https://www.codesections.com/blog/raku-unicode/ A deep dive into Raku's Unicode support Grepping for "Unicode Character Database" brings us to unicode_db.c. https://github.com/MoarVM/MoarVM/blob/master/src/strings/unicode_db.c 29/05/2021 http://moarvm.com/releases.html 2017.07 Greatly reduce the cases when string concatenation needs renormalization Use normalize_should_break to decide if concat needs normalization Rename should_break to MVM_unicode_normalize_should_break Fix memory leak in MVM_nfg_is_concat_stable If both last_a and first_b during concat are non-0 CCC, re-NFG --> maybe to review : the last sentence seems to be an optimization of concatenation. 2017.02 Implement support for synthetic graphemes in MVM_unicode_string_compare Implement configurable collation_mode for MVM_unicode_string_compare 2017.01 Add a new unicmp_s op, which compares using the Unicode Collation Algorithm Add support for Grapheme_Cluster_Break=Prepend from Unicode 9.0 Add a script to download the latest version of all of the Unicode data --> should review this script 2015.11 NFG now uses Unicode Grapheme Cluster algorithm; "\r\n" is now one grapheme --> ??? [later] ah, I had a bug! Was not analyzing an UTF-8 ASCII string... Now fixed: "0A0D"x~text~description= -- UTF-8 ASCII ( 2 graphemes, 2 codepoints, 2 bytes ) "0D0A"x~text~description= -- UTF-8 ASCII ( 1 grapheme, 2 codepoints, 2 bytes ) 29/05/2021 https://news.ycombinator.com/item?id=26591373 String length functions for single emoji characters evaluate to greater than 1 --> to check : MOAR VM really concatenate a 8bit string with a 32bit string using a string concatenation object ? You could do it the way Raku does. It's implementation defined. (Rakudo on MoarVM) The way MoarVM does it is that it does NFG, which is sort of like NFC except that it stores grapheme clusters as if they were negative codepoints. If a string is ASCII it uses an 8bit storage format, otherwise it uses a 32bit one. It also creates a tree of immutable string objects. If you do a substring operation it creates a substring object that points at an existing string object. If you combine two strings it creates a string concatenation object. Which is useful for combining an 8bit string with a 32bit one. All of that is completely opaque at the Raku level of course. my $str = "\c[FACE PALM, EMOJI MODIFIER FITZPATRICK TYPE-3, ZWJ, MALE SIGN, VARIATION SELECTOR-16]"; say $str.chars; # 1 say $str.codes; # 5 say $str.encode('utf16').elems; # 7 say $str.encode('utf16').bytes; # 14 say $str.encode.elems; # 17 say $str.encode.bytes; # 17 say $str.codes * 4; # 20 #(utf32 encode/decode isn't implemented in MoarVM yet) say for $str.uninames; # FACE PALM # EMOJI MODIFIER FITZPATRICK TYPE-3 # ZERO WIDTH JOINER # MALE SIGN # VARIATION SELECTOR-16 The reason we have utf8-c8 encode/decode is because filenames, usernames, and passwords are not actually Unicode. (I have 4 files all named rèsumè in the same folder on my computer.) utf8-c8 uses the same synthetic codepoint system as grapheme clusters. https://andrewshitov.com/2018/10/31/unicode-in-perl-6/ Unicode in Raku https://docs.raku.org/language/unicode Raku applies normalization by default to all input and output except for file names, which are read and written as UTF8-C8 UTF-8 Clean-8 is an encoder/decoder that primarily works as the UTF-8 one. However, upon encountering a byte sequence that will either not decode as valid UTF-8, or that would not round-trip due to normalization, it will use NFG synthetics to keep track of the original bytes involved. This means that encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they originally existed. https://github.com/MoarVM/MoarVM/blob/master/docs/strings.asciidoc Strings in MoarVM Strands Strands are a type of MVMString which instead of being a flat string with contiguous data, actually contains references to other strings. Strands are created during concatenation or substring operations. When two flat strings are concatenated together, a Strand with references to both string a and string b is created. If string a and string b were strands themselves, the references of string a and references of string b are copied one after another into the Strand. Synthetic’s Synthetics are graphemes which contain multiple codepoints. In MoarVM these are stored and accessed using a trie, while the actual data itself stores the base character seprately and then the combiners are stored in an array. Currently the maximum number of combiners in a synthetic is 1024. MoarVM will throw an exception if you attempt to create a grapheme with more than 1024 codepoints in it. Normalization MoarVM normalizes into NFG form all input text. NFG Normalization Form Grapheme. Similar to NFC except graphemes which contain multiple codepoints are stored in Synthetic graphemes. https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-perl-6-and-you/ Types Str type: graphemes say "नि".codes; # returns 2 say "नि".chars; # returns 1 say "\r\n".chars; # returns 1 NFC, NFD, NFKC, NFKD: types (jlf: types? really?) Uni: work with codepoints, no normalization (keep text as-is) Blob: family of types to work at the binary level Unicode source code say 0 ∈ «42 -5 1».map(&log ∘ &abs); say 0.1e0 + 0.2e0 ≅ 0.3e0; say 「There is no \escape in here!」 "Texas" source code say 0 (elem) <<42 -5 1>>.map(&log o &abs); say 0.1e0 + 0.2e0 =~= 0.3e0; say Q[[[There is no \escape in here!]]] https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-perl-6-and-you/#comment-14302 jlf: interesting critics about graphemes. See also the comment after, which provides answers to the critics. https://lwn.net/Articles/667036/ Unicode, Perl 6, and You jlf: interesting opinions. https://en.wikipedia.org/wiki/Devanagari#Conjunct_consonants jlf: this is executable code (what is this notation < षि > ?) < षि > .NFC .say # NFC:0x<0937 093f> < षि > .NFKC .say # NFD:0x<0937 093f> < षि > .NFD .say # NFKC:0x<0937 093f> < षि > .NFKD .say # NFKD:0x<0937 093f> Particularly interesting, this subthread: https://lwn.net/Articles/667669/ Is the current Unicode design impractical? jlf tests # Returns a list of Unicode codepoint numbers that describe the codepoints making up the string "aå«".ords # (97 229 171) # Returns the codepoint number of the base characters of the first grapheme in the string "å«".ord # 229 "Bundesstraße im Freiland".lc # bundesstraße im freiland "Bundesstraße im Freiland".uc # BUNDESSTRASSE IM FREILAND "Bundesstraße im Freiland".fc # bundesstrasse im freiland "Bundesstraße im Freiland".index("Freiland") # 16 (start at 0) (executor: 17) "Bundesstraße im Freiland".index("freiland", :ignorecase) # 16 # Bundesstraße sss sßs ss # 01234567890123456789012 # | | || || | "Bundesstraße sss sßs ss".indices("ss") # (5 13 21) "Bundesstraße sss sßs ss".indices("ss", :overlap) # (5 13 14 21) "Bundesstraße sss sßs ss".indices("ss", :ignorecase) # (5 10 13 18 21) "Bundesstraße sss sßs ss".indices("ss", :ignorecase, :overlap) # (5 10 13 14 18 21) not 17? "Bundesstraße sss sßs ss".indices("s", :ignorecase, :overlap) # (5 6 13 14 15 17 19 21 22) "Bundesstraße sss sßs ss".indices("sSs", :ignorecase, :overlap) # (13 17 18) "Bundesstraße sss sßs ss".indices("sSsS", :ignorecase, :overlap) # (17) "Bündesstraße sss sßs ss".fc # bundesstrasse sss ssss ss # 0123456789012345678901234 # | | || ||| | "Bündëssträßë sss sßs ss".fc.indices("ss") # (5 10 14 18 20 23) "Bündëssträßë sss sßs ss".fc.indices("ss", :overlap) # (5 10 14 15 18 19 20 23) # straßssßßssse # 0123456789012 # || |||| "straßssßßssse".indices("Ss", :ignorecase) # (4 7 9) "straßssßßssse".indices("Ss", :ignorecase, :overlap) # (4 5 7 8 9 10) "TÊt\c[TAG SPACE]e".chars # 4, "t" + "TAG SPACE" is one grapheme "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc # TÊt󠀠e sss ssss ss t󠀠êTE # 012345678901234567890 # ^ ^ || ||| | ^ ^ "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".indices("ss") # (5 13) "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".indices("ss", :ignorecase) # (5 10 13) "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("ss") # (5 9 11 14) 11? why not 10? because no overlap "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("ss", :overlap) # (5 6 9 10 11 14) "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("te") # () "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("te", :ignorecase) # (19) "TÊt\c[TAG SPACE]e sss sßs ss t\c[TAG SPACE]êTE".fc.indices("te", :ignoremark) # (0 2 17 19) so TAG SPACE is ignored when :ignoremark # Matching inside a grapheme "noël👩‍👨‍👩‍👧🎅".indices("👧🎅") # () "noël👩‍👨‍👩‍👧🎅".indices("👨‍👩") # () # Matching a ligature # bâfflé # 012 3 "bâfflé".indices("é") # (3) "bâfflé".indices("ffl") # () "bâfflé".indices("ffl", :ignorecase) # (2) https://raku-advent.blog/2022/12/22/day-22-hes-making-a-list-part-1/ Unicode’s CLDR (Common Linguistic Data Repository) jlf: to read... https://www.nu42.com/2015/12/perl6-newline-translation-broken.html Newline translation in Perl6 is broken A. Sinan Unur December 11, 2015 --- jlf: Referenced from https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-perl-6-and-you/#comment-14382 I reference this URL in case \r\n versus \r is a problem for Rexx Unicodified. For Unicode, \r\n is one grapheme. Maybe no relation with the failed test cases. Was fixed like that: https://github.com/Raku/old-issue-tracker/issues/4849#issuecomment-570873506 * We do translation of \r\n graphemes to \n on all input read as text except sockets, independent of platform * We do translation of all \n graphemes to \r\n on text output to handles except sockets, on Windows only * \n is now, unless `use newline` is in force, always \x0A * We don't do any such translation when using .encode/.decode, and of course when reading/writing Bufs to files, providing an escape hatch from translation if needed https://6guts.wordpress.com/2015/11/21/what-one-christmas-elf-has-been-up-to/ jlf: referenced for the section NFG improvements. https://6guts.wordpress.com/2015/10/15/last-week-unicode-case-fixes-and-much-more/ jlf: referenced for the section A case of Unicode. Testing the JMB's example "ς".uc # Σ "σ".uc # Σ "ὈΔΥΣΣΕΎΣ".lc # ὀδυσσεύς last Σ becomes ς "ὈΔΥΣΣΕΎΣA".lc # ὀδυσσεύσa last Σ becomes σ # Humm... the concatenation doesn't change ς to σ "ὈΔΥΣΣΕΎΣ".lc ~ "A" # ὀδυσσεύςA ("ὈΔΥΣΣΕΎΣ".lc ~ "A").uc # ὈΔΥΣΣΕΎΣA ("ὈΔΥΣΣΕΎΣ".lc ~ "A").uc.lc # ὀδυσσεύσa https://stackoverflow.com/questions/39663846/how-can-i-make-perl-6-be-round-trip-safe-for-unicode-data How can I make Perl 6 be round-trip safe for Unicode data? Answer: UTF8-C8 isn't really a good solution (but is probably the only solution currently). jlf: asked in 2016-09-23, maybe the situation is better today. https://rosettacode.org/wiki/String_comparison#Raku String comparisons never do case folding because that's a very complicated subject in the modern world of Unicode. (You can explicitly apply an appropriate case-folding function to the arguments before doing the comparison, or for "equality" testing you can do matching with a case-insensitive regex, assuming Unicode's language-neutral case-folding rules are okay.) --- Be aware that Raku applies normalization (Unicode NFC form (Normalization Form Canonical)) by default to all input and output except for file names See docs. Raku follows the Unicode spec. Raku follows all of the Unicode spec, including parts that some people don't like. There are some graphemes for which the Unicode consortium has specified that the NFC form is a different (though usually visually identical) grapheme. Referred to in Unicode standard annex #15 as Canonical Equivalence. Raku adheres to that spec. https://docs.raku.org/language/traps#Traps_to_avoid Some problems that might arise when dealing with strings https://raku.guide/#_unicode Escape characters say "\x0061"; say "\c[LATIN SMALL LETTER A]"; Numbers say (٤,٥,٦,1,2,3).sort; # (1 2 3 4 5 6) say 1 + ٩; # 10 Raku has methods/operators that implement the Unicode Collation Algorithm. say 'a' unicmp 'B'; # Less Raku provides a collate method that implements the Unicode Collation Algorithm. say ('a','b','c','D','E','F').sort; # (D E F a b c) say ('a','b','c','D','E','F').collate; # (a b c D E F)

Rexx lang


11/08/2021 http://nokix.sourceforge.net/help/learn_rexx/funcs5.htm#VALUEIN Reads in a numeric value from a binary (ie, non-text) file. value = VALUEIN(stream, position, length, options) Args stream is the name of the stream. It can include the full path to the stream (ie, any drive and directory names). If omitted, the default is to read from STDIN. position specifies at what character position (within the stream) to start reading from, where 1 means to start reading at the very first character in the stream. If omitted, the default is to resume reading at where a previous call to CHARIN() or VALUEIN() left off (ie, where you current read character position is). length is a 1 to read in the next binary byte (ie, 8-bit value), a 2 to read in the next binary short (ie, 16-bit value), or a 4 to read in the next binary long (ie, 32-bit value). If length is omitted, VALUEIN() defaults to reading a byte. options can be any of the following: M The value is stored (in the stream) in Motorola (big endian) byte order, rather than Intel (little endian) byte order. The effects only long and short values. H Read in the value as hexadecimal (rather than the default of base 10, or decimal, which is the base that REXX uses to express numbers). The value can later be converted with X2D(). B Read in the value as binary (base 2). - The value is signed (as opposed to unsigned). V stream is the actual data string from which to extract a value. You can now replace calls to SUBSTR and C2D with a single, faster call to VALUEIN. If omitted, options defaults to none of the above. Returns The value, if successful. If an error, an empty string is returned (unless the NOTREADY condition is trapped via CALL method. Then, a '0' is returned). http://nokix.sourceforge.net/help/learn_rexx/funcs5.htm#VALUEOUT Write out numeric values to a binary (ie, non-text) file (ie, in non-text format). result = VALUEOUT(stream, values, position, size, options) Args stream is the name of the stream. It can include the full path to the stream (ie, any drive and directory names). If omitted, the default is to write to STDOUT (typically, display the data in the console window). position specifies at what character position (within the stream) to start writing the data, where 1 means to start writing at the very first character in the stream. If omitted, the default is to resume writing at where a previous call to CHAROUT() or VALUEOUT() left off (or where the "write character pointer" was set via STREAM's SEEK). values are the numeric values (ie, data) to write out. Each value is separated by one space. size is a 1 if each value is to be written as a byte (ie, 8-bit value), 2 if each value is to be written as a short (16-bit value), or 4 if each value is to be written as a long (32-bit value). If omitted, size defaults to 1. options can be any of the following: M Write out the values in Motorola (big endian) byte order, rather than Intel (little endian) byte order. The effects only long and short values. H The values you supplied are specified in hexadecimal. B The values you supplied are specified in binary (base 2). V stream is the name of a variable, and the data will be overlaid onto that variable's value. You can now replace calls to D2C and OVERLAY with a single, faster call to VALUEOUT, especially when a variable has a large amount of non-text data. If omitted, options defaults to none of the above. Returns 0 if the string was written out successfully. If an error, VALUEOUT() returns non-zero. http://www.dg77.net/tekno/manuel/rexxendian.htm Test de l’endianité /* Verifie l'endianité / check endiannity */ /* Pour traitement d'information encodees en UTF-8 */ /* Adapter si on utilise un autre encodage */ CALL CONV8_16 ' ' IF c2x(sortie) = '2000' THEN DO endian = 'LE' /* little endian */ blanx = '2000' END ELSE DO endian = 'BE' /* big endian */ blanx = '0020' END return endian blanx /* ********************************************************************** */ /* Conversion UTF-8 -> UNICODE */ CONV8_16: parse arg entree sortie = '' ZONESORTIE.='NUL'; ZONESORTIE.0=0 err = systounicode(entree, 'UTF8', , ZONESORTIE.) if err == 0 then sortie = ZONESORTIE.!TEXT else say 'probleme car., code ' err return http://www.dg77.net/tekno/xhtml/codage.htm Le codage des caractères To read, some infos about the code pages could be useful. Regina doc EXPORT(address, [string], [length] [,pad]) - (AREXX) Copies data from the (optional) string into a previously-allocated memory area, which must be specified as a 4-byte address. The length parameter specifies the maximum number of characters to be copied; the default is the length of the string. If the specified length is longer than the string, the remaining area is filled with the pad character or nulls('00'x). The returned value is the number of characters copied. Caution is advised in using this function. Any area of memory can be overwritten,possibly causing a system crash. See also STORAGE() and IMPORT(). Note that the address specified is subject to a machine's endianess. EXPORT('0004 0000'x,'The answer') '10' IMPORT(address [,length]) - (AREXX) Creates a string by copying data from the specified 4-byte address. If the length parameter is not supplied,the copy terminates when a null byte is found. See also EXPORT() Note that the address specified is subject to a machine's endianess. IMPORT('0004 0000'x,10) 'The answer' /* maybe */

Ruby lang


jlf note: still searching articles/blogs comparing the Ruby's approach (multi-encodings) with languages that force the conversion to Unicode (be it utf-8 or Unicode scalars). https://docs.ruby-lang.org/en/3.2/String.html class String --- jlf: focus on comparison. I did not find the definition of "compatible". Methods for Comparing ==, ===: Returns true if a given other string has the same content as self. eql?: Returns true if the content is the same as the given other string. <=>: Returns -1, 0, or 1 as a given other string is smaller than, equal to, or larger than self. casecmp: Ignoring case, returns -1, 0, or 1 as a given other string is smaller than, equal to, or larger than self. casecmp?: Returns true if the string is equal to a given string after Unicode case folding; false otherwise. Returns false if the two strings’ encodings are not compatible: "\u{e4 f6 fc}" == ("\u{e4 f6 fc}") # => true "\u{e4 f6 fc}".encode("ISO-8859-1") == ("\u{e4 f6 fc}") # => false "\u{e4 f6 fc}".eql?("\u{e4 f6 fc}") # => true "\u{e4 f6 fc}".encode("ISO-8859-1").eql?("\u{e4 f6 fc}") # => false # "äöü" "ÄÖÜ" "\u{e4 f6 fc}".casecmp("\u{c4 d6 dc}") # => 1 "\u{e4 f6 fc}".encode("ISO-8859-1").casecmp("\u{c4 d6 dc}") # => nil https://yehudakatz.com/2010/05/17/encodings-unabridged/ Encodings, Unabridged jlf: this article explains why the Ruby team consider that Unicode is not a good solution for CJK. https://ruby-doc.org/current/Encoding.html https://github.com/ruby/ruby/blob/master/encoding.c jlf: search "compat" https://docs.ruby-lang.org/en/master/encodings_rdoc.html Encodings --- jlf: Executor has a similar support of encodings, with less defaults and less supported encodings. Otherwise the technical solution is the same: all encodings are equal, there is no forced internal encoding, no forced conversion. --- Default encodings: - Encoding.default_external: the default external encoding - Encoding.default_internal: the default internal encoding (may be nil) - locale: the default encoding for a string from the environment - filesystem: the default encoding for a string from the filesystem String encoding A Ruby String object has an encoding that is an instance of class Encoding. The encoding may be retrieved by method String#encoding. 's'.encoding # => #<Encoding:UTF-8> The default encoding for a string literal is the script encoding The encoding for a string may be changed: s = "R\xC3\xA9sum\xC3\xA9" # => "Résumé" s.encoding # => #<Encoding:UTF-8> s.force_encoding('ISO-8859-1') # => "R\xC3\xA9sum\xC3\xA9" s.encoding # => #<Encoding:ISO-8859-1> Stream Encodings Certain stream objects can have two encodings; these objects include instances of: IO. File. ARGF. StringIO. The two encodings are: - An external encoding, which identifies the encoding of the stream. The default external encoding is: - UTF-8 for a text stream. - ASCII-8BIT for a binary stream. - An internal encoding, which (if not nil) specifies the encoding to be used for the string constructed from the stream. The default internal encoding is nil (no conversion). Script Encoding The default script encoding is UTF-8; a Ruby source file may set its script encoding with a magic comment on the first line of the file (or second line, if there is a shebang on the first). The comment must contain the word coding or encoding, followed by a colon, space and the Encoding name or alias: # encoding: ISO-8859-1 __ENCODING__ #=> #<Encoding:ISO-8859-1> This example writes a string to a file, encoding it as ISO-8859-1, then reads the file into a new string, encoding it as UTF-8: s = "R\u00E9sum\u00E9" path = 't.tmp' ext_enc = 'ISO-8859-1' int_enc = 'UTF-8' File.write(path, s, external_encoding: ext_enc) raw_text = File.binread(path) # "R\xE9sum\xE9" transcoded_text = File.read(path, external_encoding: ext_enc, internal_encoding: int_enc) # "Résumé" https://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/ 3 Steps to Fix Encoding Problems in Ruby The major difference between encode and force_encoding is that encode might change bytes, and force_encoding won’t. In ASCII-8BIT, every character is represented by a single byte. That is, str.chars.length == str.bytes.length. https://www.cloudbees.com/blog/how-ruby-string-encoding-benefits-developers Familiarize Yourself with Ruby String Encoding written August 14, 2018 Ruby encoding methods - String#force_encoding is a way of saying that we know the bits for the characters are correct and we simply want to properly define how those bits are to be interpreted to characters. - String#encode will transcode the bits themselves that form the characters from whatever the string is currently encoded as to our target encoding. Example of the byte size being different from the character length: "łał".size # => 3 "łał".bytesize # => 5 Different operating systems have different default character encodings so programming languages need to support these. Encoding.default_external # => #<encoding:utf -8></encoding:utf> Ruby defaults to UTF-8 as its encoding so if it is opening up files from the operating system and the default is different from UTF-8, it will transcode the input from that encoding to UTF-8. If this isn't desirable, you may change the default internal encoding in Ruby with Encoding.default_internal. Otherwise you can use specific IO encodings in your Ruby code. File.open(filename, 'r:UTF-8', &amp;:read) # or File.open(filename, external_encoding: "ASCII-8BIT", internal_encoding: "ASCII-8BIT") do |f| f.read end Lately, I've been integrating Ruby's encoding support to Rust with the library Rutie. Rutie allows you to write Rust that works in Ruby and Ruby that works in Rust. jlf: see Rutie in Rust lang. https://ruby-core.ruby-lang.narkive.com/RDKAvdS7/20483-encoding-of-symbols#post2 [ruby-core:20483] encoding of symbols --- jlf: AT LAST! I found a discussion about the comparison of strings. LONG thread, to carefully read. --- This message 2008-12-14 is a good summary! Is it still correct today? https://ruby-core.ruby-lang.narkive.com/RDKAvdS7/20483-encoding-of-symbols#post12 - String operations are done using the bytes in the strings - they are not converted to codepoints internally - String equality comparisons seem to be simply done on a byte-by-byte basis, without regard to the encoding - *However* other operations are not simply byte-by-byte. They are done character-by-character, but without converting to codepoints - eg: a 3 byte character is kept as 3 bytes. For example this means that when operating on a variable-length encoding, simple operations like indexing can be inefficient, as Ruby may have to scan through the string from the start. However Ruby does try to optimize this where possible. - There is also a concept of "compatible encodings". Given 2 encodings e1 & e2, e1 is compatible with e2 if the representation of every character in e1 is the same as in e2. This implies that e2 must be a "bigger" encoding than e1 - ie: e2 is a superset of e1. Typically we are mainly talking about US-ASCII here, which is compatible with most other character sets that are either all single-byte (eg: all the ISO-8859 sets) or are variable-length multi-byte (eg: UTF-8). - When operating on encodings e1 & e2, if e1 is compatible with e2, then Ruby treats both strings as being in encoding e2. - String#> and String#< are a bit wierd. Normally they are just done on a byte-by-byte basis, UNLESS the strings are the same and are incompatible encodings, then they always seem to return FALSE. (I have to check this - it may be more complicated than this). - When operating on incompatible encodings, *normally* non-comparison operations (including regexp matches) raise an "Encoding Compatibility Error". - However there appears to be an exception to this: if operating on 2 incompatible encodings AND US-ASCII is compatible with both, AND both strings are US-ASCII strings, then the operation appears to proceed, treating both as US-ASCII. For example "abc" as an ISO-8859-1 and "abc" as UTF-8. I guess this is Ruby being "forgiving". (Personally I am not sure if this is good or bad). The encoding of the result (for example of a string concatenation) seems to be one of the 2 original encodings - I haven't figured out the logic to this yet :) --- jlf: this one seems ugly... Actually I just checked this, and this is wrong, sorry. I ended up looking at the source code of rb_str_cmp() in string.c, and here is what I think it does: - it does a byte-by-byte comparison. Assuming the strings are different, Ruby returns what you would expect based on this. - if the strings are byte for byte identical, but they have incompatible encodings and at least one of the strings contains a non-ASCII character, then it seems that the result is determined by the ordering of the encodings, based on ruby's "encoding index" - an internal ordering of the available encodings. Maybe I have got this wrong - it doesn't make a lot of sense to me! --- I don't mean to shoot you down in flames, but a lot of thought and effort has gone into Ruby's encoding support. Ruby could have followed the Python route of converting everything to Unicode, but that was rejected for various good reasons. Also automatic transcoding to solve issues of incompatible encodings was also rejected because it causes a number of problems, in particular I believe that transcoding isn't necessarilly accurate, because for example there may be multiple or ambiguous representations of the same character. --- Yukihiro Matsumoto UTF-8 + ASCII-8BIT makes ASCII-8BIT. Binary wins. jlf: hum... I do the opposite with Executor jlf 2023.08.09: I checked today with Ruby 3.2, the result is UTF-8 http://graysoftinc.com/character-encodings jlf: 12 articles about character encoding in Ruby. From 2008-10-14 to 2009-06-18 Old, but maybe interesting? todo: read https://docs.ruby-lang.org/en/3.2/case_mapping_rdoc.html Case Mapping By default, all of these methods use full Unicode case mapping, which is suitable for most languages. Non-ASCII case mapping and folding are supported for UTF-8, UTF-16BE/LE, UTF-32BE/LE, and ISO-8859-1~16 Strings/Symbols. Context-dependent case mapping is currently not supported (Unicode standard: Context Specification for Casing). In most cases, case conversions of a string have the same number of characters. There are exceptions (see also :fold below): s = "\u00DF" # => "ß" s.upcase # => "SS" s = "\u0149" # => "ʼn" s.upcase # => "ʼN" Case mapping may also depend on locale (see also :turkic below) s = "\u0049" # => "I" s.downcase # => "i" # Dot above. s.downcase(:turkic) # => "ı" # No dot above. Case changing methods may not maintain Unicode normalization. Except for casecmp and casecmp?, each of the case-mapping methods listed above accepts optional arguments, *options. The arguments may be: :ascii only. :fold only. :turkic or :lithuanian or both. https://andre.arko.net/2013/12/01/strings-in-ruby-are-utf-8-now/ composition in the form of ligatures isn’t handled at all "baffle".upcase == "BAFFLE" # => false jlf: Has been fixed in a later version: "baffle".upcase # => "BAFFLE" BUT other things are still not good in Ruby 3.2.2 (March 30, 2023): "noël".reverse # => "l̈eon" "noël"[0..2] # => "noe" --- "baffle"~text~upper= -- T'BAfflE' 30/05/2023 Executor not good because utf8proc upper is not good "baffle"~text~caselessEquals("baffle")= -- 1 30/05/2023 Executor is good because utf8proc casefold is good "noël"~text~reverse= -- T'lëon' "noël"~text[1,3]= -- T'noë' https://github.com/jmhodges/rchardet Character encoding auto-detection in Ruby. jlf: no doc :-( Returns a confidence rate? cd = CharDet.detect(some_data) encoding = cd['encoding'] confidence = cd['confidence'] # 0.0 <= confidence <= 1.0 https://bugs.ruby-lang.org/issues/18949 Deprecate and remove replicate and dummy encodings Rejected by Naruse: String is a container and an encoding is a label of it. While data whose encoding is an encoding categorized in dummy encodings in Ruby, we cannot avoid such encodings. <reopened, lot of discussions> This is all done now, only https://github.com/ruby/ruby/pull/7079. Overall: We deprecated and removed Encoding#replicate We removed get_actual_encoding() We limited to 256 encodings and kept rb_define_dummy_encoding() with that constraint. There is a single flat array to lookup encodings, rb_enc_from_index() is fast now. https://github.com/ruby/ruby/pull/3803 Add string encoding IBM720 alias CP720 The mapping table is generated from the ICU project: https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/ibm-720_P100-1997.ucm https://speakerdeck.com/ima1zumi/dive-into-encoding slide 23: Code Set Independent (CSI), Treat all encodings fair slide 24: Each instance of string has encoding information slide 26: Universal Coded Set (UCS) https://shopify.engineering/code-ranges-ruby-strings Code Ranges: A Deeper Look at Ruby Strings Code ranges are a way for the VM to avoid repeated work and optimize operations on a per-string basis, guiding away from slow paths when that functionality isn't needed. jlf: not sure this article is useful. https://idiosyncratic-ruby.com/66-ruby-has-character.html Ruby has Character video: https://www.youtube.com/watch?v=hlryzsdGtZo (jlf: too small, not very readable, but good for pronuntiation: "Louby") --- jlf: this page is interesting for the one-liners. Tools implemented by the author https://github.com/janlelis/unibits Visualize different Unicode encodings in the terminal https://github.com/janlelis/uniscribe Know your Unicode ✀ https://idiosyncratic-ruby.com/41-proper-unicoding.html Proper Unicoding Ruby's Regexp engine has a powerful feature built in: It can match for Unicode character properties. https://idiosyncratic-ruby.com/26-file-encoding-magic.html default source encoding # coding: cp1252 p "".encoding #=> #<Encoding:Windows-1252> https://tomdebruijn.com/posts/rust-string-length-width-calculations/ The article is about Rust, but there is an appendix about Ruby. Seems a good summary, so copy-paste here... --- When calling Ruby's String#length, it returns the length of characters like Rust's Chars.count. If you want the length in bytes you need to call String#bytesize. "abc".length # => 3 characters "abc".bytesize # => 3 bytes "é".length # => 1 characters "é".bytesize # => 2 bytes Calling the length on emoji will return the individual characters as the length. The 👩‍🔬 emoji is three characters and eleven bytes in Ruby as well. "👩‍🔬".length # => 3 characters "👩‍🔬".bytesize # => 11 bytes Do you want grapheme clusters? it's built-in to Ruby with String#grapheme_clusters. "👩‍🔬".grapheme_clusters.length # => 1 cluster To calculate the display with, we can use the unicode-display_width gem. The same multiple counting of emoji in the grapheme cluster still applies here. require "unicode/display_width" Unicode::DisplayWidth.of("👩‍🔬") # => 4 Unicode::DisplayWidth.of("❤️") # => 1 https://ruby-doc.org/3.2.2/File.html class File A File object is a representation of a file in the underlying platform. --- Data mode To specify whether data is to be treated as text or as binary data, either of the following may be suffixed to any of the string read/write modes above: 't': Text data; sets the default external encoding to Encoding::UTF_8; on Windows, enables conversion between EOL and CRLF and enables interpreting 0x1A as an end-of-file marker. 'b': Binary data; sets the default external encoding to Encoding::ASCII_8BIT; on Windows, suppresses conversion between EOL and CRLF and disables interpreting 0x1A as an end-of-file marker. --- Encodings Any of the string modes above may specify encodings - either external encoding only or both external and internal encodings - by appending one or both encoding names, separated by colons: f = File.new('t.dat', 'rb') f.external_encoding # => #<Encoding:ASCII-8BIT> f.internal_encoding # => nil f = File.new('t.dat', 'rb:UTF-16') f.external_encoding # => #<Encoding:UTF-16 (dummy)> f.internal_encoding # => nil f = File.new('t.dat', 'rb:UTF-16:UTF-16') f.external_encoding # => #<Encoding:UTF-16 (dummy)> f.internal_encoding # => #<Encoding:UTF-16> f.close - When the external encoding is set, strings read are tagged by that encoding when reading, and strings written are converted to that encoding when writing. - When both external and internal encodings are set, strings read are converted from external to internal encoding, and strings written are converted from internal to external encoding. For further details about transcoding input and output, see Encodings. https://ruby-doc.org/3.2.2/encodings_rdoc.html#label-Encodings String comparison If the encodings are different then the strings are different. So it's not a comparison of Unicode codepoints. irb(main):026:0> s1 = "hello" => "hello" irb(main):027:0> s1 => "hello" irb(main):028:0> s2 = "hello" => "hello" irb(main):029:0> s1 == s2 => true irb(main):030:0> s2.force_encoding("utf-16") => "\x68\x65\x6C\x6C\x6F" irb(main):031:0> s2 => "\x68\x65\x6C\x6C\x6F" irb(main):032:0> s1 == s2 => false https://bugs.ruby-lang.org/issues/9111 Encoding-free String comparison 14/11/2013 --- Description Currently, strings with the same content but with different encodings count as different strings. This causes strange behaviour as below (noted in StackOverflow question http://stackoverflow.com/questions/19977788/strange-behavior-in-packed-ruby-strings#19978206): [128].pack("C") # => "\x80" [128].pack("C") == "\x80" # => false Since [128].pack("C") has the encoding ASCII-8BIT and "\x80" (by default) has the encoding UTF-8, the two strings are not equal. Also, comparison of strings with different encodings may end up with a messy, unintended result. I suggest that the comparison String#<=> should not be based on the respective encoding of the strings, but all the strings should be internally converted to UTF-8 for the purpose of comparison. --- nobu (Nobuyoshi Nakada) It's unacceptable to always convert all strings to UTF-8, should restrict to comparison with an ASCII-8BIT string. --- naruse (Yui NARUSE) The standard practice is NFD("â") == NFD("a" + "^"). To NFD, you can use some libraries. --- duerst (Martin Dürst) Lié à Feature #10084: Add Unicode String Normalization to String class ajouté https://bugs.ruby-lang.org/issues/10084 --- jlf 09/08/2023: ticket still opened... The test [128].pack("C") == "\x80" still returns false, so I assume they made no change. https://bugs.ruby-lang.org/issues/10084 Add Unicode String Normalization to String class 23/07/2014 --- nobu (Nobuyoshi Nakada) What will happen for a non-unicode string, raising an exception? --- duerst (Martin Dürst) This is a very good question. I'm okay with whatever Matz and the community think is best. There are many potential approaches. In general, these will be: 1. Make the operation a no-op. 2. Convert to UTF-8, normalize, then convert back. 3. Implement normalization directly in the encoding. 4. Raise an exception. There is also the question of what a "non-unicode" or "unicode" string is. UTF-8 is the preferred way to handle Unicode in Ruby, and is where normalization is really needed and will be used. For the other encodings, unless we go with 1) or 4), the following considerations apply. UTF8-Mac, UTF8-DoCoMo, UTF8-KDDI and UTF8-Softbank are essentially UTF-8 but with slightly different character conversions. For these encodings, the easiest thing to do is force_encoding to UTF-8, normalize, and force_encoding back. A C-level implementation may not actually need force_encoding, but a Ruby implementation does. There are some questions about what normalizing UTF8-Mac means, so that may have to be treated separately. The DoCoMo/KDDI/Softbank variants are mostly about emoji, which as far as I know are not affected by normalization. Then there are UTF-16LE/BE and UTF-32LE/BE. For these, it depends on the implementation. A Ruby-level implementation (unless very slow) may want to convert to UTF-8 and back. A C-level implementation may not need to do this. Then there is also GB18030. Conversion to UTF-8 and back seems to be the best solution. Doing normalization directly in GB18030 will need too much data. For other, truely non-unicode encodings, implementing noramlization directly in the encoding would mean the following: Analyze to what extent the normalization applies to the encoding in question, and apply this part. As an example, '①'.nfkc produces '1' in UTF-8, it could do the same in Windows-31J. The analysis might take some time (but can be automated), and the data needed for each encoding would mostly be just very small. --- matz (Yukihiro Matsumoto) First of all, I don't think normalize is the best name. I propose unicode_normalize instead, since this normalization is sort of unicode specific. It should raise an exception for non Unicode strings. It shouldn't convert to UTF-8 implicitly inside. https://www.honeybadger.io/blog/troubleshooting-encoding-errors-in-ruby/ Troubleshooting Encoding Errors in Ruby --- jlf: interesting for the one-liners --- "H".bytes # => [72] in decimal "H".bytes.map {|e| e.to_s 2} # => ["1001000"] convert in base 2 Encoding.name_list # => ["ASCII-8BIT", "UTF-8", "US-ASCII", "UTF-16BE", "UTF-16LE", "UTF-32BE", "UTF-32LE", "UTF-16", "UTF-32", "UTF8-MAC", "EUC-JP", "Windows-31J", "Big5", "Big5-HKSCS", "Big5-UAO", "CP949", "Emacs-Mule", "EUC-KR", ...] "hellÔ!".encode("US-ASCII") # in `encode': U+00D4 from UTF-8 to US-ASCII (Encoding::UndefinedConversionError) "hellÔ!".force_encoding("US-ASCII"); # => "hell\xC3\x94!" "abc\xCF\x88\xCF\x88" # => "abcψψ" "abcψψ".force_encoding("US-ASCII").valid_encoding? # => false "abcψψ".encode("US-ASCII", "UTF-8", invalid: :replace, undef: :replace, replace: "") # => "abc" "abc\xA1z".encode("US-ASCII") # in `encode': "\xA1" on UTF-8 (Encoding::InvalidByteSequenceError) "abc\xA1z".force_encoding("US-ASCII").scrub("*") # => "abc*z" "abc\xA1z".force_encoding("US-ASCII").scrub("") # => "abcz" "abc\xA1z".force_encoding("US-ASCII").valid_encoding? # => false

Rust lang


Seen in a comment here : https://bugs.swift.org/browse/SR-7602 For reference, I think [Rust's model]( https://doc.rust-lang.org/std/string/struct.String.html ) is pretty good: `from_utf8` produces an error explaining why the code units were invalid `from_utf8_lossy` replaces encoding errors with U+FFFD `from_utf8_unchecked` which takes the bytes, but if there's an encoding error, then memory safety has been violated I'm not entirely sure if accepting invalid bytes requires voiding memory safety (assuming bounds checking always happens), but it is totally a security hazard if used improperly. We may want to be very cautious about if/how we expose it. I think that trying to do read-time validation is dubious for UTF-16, and totally bananas for UTF-8. 17/07/2021 https://www.generacodice.com/en/articolo/120763/Unicode+Support+in+Various+Programming+Languages jlf: I learned something: OsStr/OsString Rust's strings (std::String and &str) are always valid UTF-8, and do not use null terminators, and as a result can not be indexed as an array, like they can be in C/C++, etc. They can be sliced somewhat like Go using .get since 1.20, with the caveat that it will fail if you try slicing the middle of a code point. Rust also has OsStr/OsString for interacting with the Host OS. It's byte array on Unix (containing any sequence of bytes). On windows it's WTF-8 (A super-set of UTF-8 that handles the improperly formed Unicode strings that are allowed in Windows and Javascript), &str and String can be freely converted to OsStr or OsString, but require checks to covert the other way. Either by Failing on invalid unicode, or replacing with the Unicode replacement char. (There is also Path/PathBuf, which are just wrappers around OsStr/OsString). There is also the CStr and CString types, which represent Null terminated C strings, like OsStr on Unix they can contain arbitrary bytes. Rust doesn't directly support UTF-16. But can convert OsStr to UCS-2 on windows. 22/07/2021 https://lib.rs/crates/ STFU-8: Sorta Text Format in UTF-8 STFU-8 is a hacky text encoding/decoding protocol for data that might be not quite UTF-8 but is still mostly UTF-8. Its primary purpose is to be able to allow a human to visualize and edit "data" that is mostly (or fully) visible UTF-8 text. It encodes all non visible or non UTF-8 compliant bytes as longform text (i.e. ESC becomes the full string r"\x1B"). It can also encode/decode ill-formed UTF-16. 28/07/2021 https://fasterthanli.me/articles/working-with-strings-in-rust 07/11/2021 https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html security concern affecting source code containing "bidirectional override" Unicode codepoints 10/03/2022 https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html Allow non-ASCII letters (such as accented characters, Cyrillic, Greek, Kanji, etc.) in Rust identifiers. 10/09/2022 https://blog.burntsushi.net/bstr/ A byte string library for Rust Invalid UTF-8 doesn’t actually prevent one from applying Unicode-aware algorithms on the parts of the string that are valid UTF-8. The parts that are invalid UTF-8 are simply ignored. 15/10/2022 https://crates.io/crates/finl_unicode Library for handling Unicode functionality for finl (categories and grapheme segmentation) There are these comments in https://news.ycombinator.com/item?id=32700315 All with two-step tables instead of range- and binary search? Yes. The two-step tables are really not that expensive and they enable features not possible with range and binary search, like identifying the category of a character cheaply. https://github.com/open-i18n/rust-unic UNIC: Unicode and Internationalization Crates for Rust jlf: seems stale since Oct 21, 2020. Killed by ICU4X? This fork is still alive: https://github.com/eyeplum/rust-unic https://github.com/logannc/fuzzywuzzy-rs port of https://github.com/seatgeek/fuzzywuzzy (Fuzzy String Matching in Python This project has been renamed and moved to https://github.com/seatgeek/thefuzz) Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package. https://en.wikipedia.org/wiki/Levenshtein_distance https://hsivonen.fi/encoding_rs/ encoding_rs: a Web-Compatible Character Encoding Library in Rust encoding_rs is a high-decode-performance, low-legacy-encode-footprint and high-correctness implementation of the WHATWG Encoding Standard written in Rust. --- https://hsivonen.fi/modern-cpp-in-rust/ How I Wrote a Modern C++ Library in Rust Slides: https://hsivonen.fi/rustfest2018/ Video: https://media.ccc.de/v/rustfest18-5-a_rust_crate_that_also_quacks_like_a_modern_c_library https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2773r0.pdf (pdf...) Generally speaking, reducing the size of the tables has a direct impact on performance, if only because increasing cache locality is the most effective way to improve the performance of anything. I landed on a set of strategies developed by the rust team https://github.com/rust-lang/rust/tree/master/src/tools/unicode-table-generator/src https://www.youtube.com/watch?v=Mcuqzx3rBWc Strings in Rust FINALLY EXPLAINED! jlf: is there something to learn from 15:29 Indexing into a string? no. https://github.com/rust-lang/regex/blob/master/UNICODE.md regex Unicode conformance jlf: I found the URL above in this HN comment (related to awk support of Unicode) https://news.ycombinator.com/item?id=32538560 https://github.com/danielpclark/rutie Integrate Ruby with your Rust application. Or integrate Rust with your Ruby application. https://github.com/danielpclark/rutie/blob/master/src/class/string.rs https://tomdebruijn.com/posts/rust-string-length-width-calculations/ Calculating String length and width https://github.com/lintje/lintje/blob/501aab06e19008e787237438a69ac961f38bb4b7/src/utils.rs#L22-L71 // Return String display width as rendered in a monospace font according to the Unicode // specification. https://www.reddit.com/r/rust/comments/gpw2ra/how_is_the_rust_compiler_able_to_tell_the_visible/ How is the Rust compiler able to tell the visible width of unicode characters? --- jlf: some arbitray excerpts - rustc uses the unicode-width crate (https://github.com/unicode-rs/unicode-width) - Now try it with the rainbow flag emoji. Unicode is hard :) - explanation: the rainbow flag emoji is actually just a white flag + zero width joiner + a rainbow, meaning it's technically three characters. - Sure but why doesn't the unicode-width crate handle that? - The unicode-width crate operates on scalar values. I don't believe Unicode has a way to determine whether a grapheme cluster is halfwidth/fullwidth. The most reasonable way to determine this would probably be the maximum width of any scalar value within a grapheme cluster, but this isn't part of any standard and probably isn't 100% accurate. - It is also dependent on the display platform. A platform with support for displaying emojis but only in older unicode versions would indeed display multiple emojis on the screen. I don't believe there's a platform independent way to detect the visual length of any given series of unicode codepoints. For Rust this isn't a problem as we restrict the unicode identifiers only to things that are fairly homogeneous (namely, no emojis in your variable names!). - At the bottom of things is the unicode-width native Rust implementation, based off the Unicode 13.0 data tables. In C/POSIX land, we would use the function wcwidth(). Unfortunately, this isn't the whole story. The actual number of columns used is dependent upon your font and the font layout engine. See section 7.4 of my Free book, Hacking the Planet! with Notcurses, aka "Fixed-width Fonts Ain't So Fixed." https://nick-black.com/htp-notcurses.pdf#page=57 you want pages 47--49 (p49 has some good examples). https://github.com/unicode-rs/unicode-width Displayed width of Unicode characters and strings according to UAX#11 rules. NOTE: The computed width values may not match the actual rendered column width. For example, the woman scientist emoji comprises of a woman emoji, a zero-width joiner and a microscope emoji. extern crate unicode_width; use unicode_width::UnicodeWidthStr; fn main() { assert_eq!(UnicodeWidthStr::width("👩"), 2); // Woman assert_eq!(UnicodeWidthStr::width("🔬"), 2); // Microscope assert_eq!(UnicodeWidthStr::width("👩‍🔬"), 4); // Woman scientist } https://github.com/life4/textdistance.rs https://www.reddit.com/r/rust/comments/13lo6ne/textdistancers_rust_library_to_compare_strings_or/ textdistance.rs: Rust library to compare strings (or any sequences). 25+ algorithms, pure Rust, common interface, Unicode support. Based on popular and battle-tested textdistance Python library https://github.com/life4/textdistance https://github.com/dguo/strsim-rs Rust implementations of string similarity metrics: Hamming Levenshtein - distance & normalized Optimal string alignment Damerau-Levenshtein - distance & normalized Jaro and Jaro-Winkler - this implementation of Jaro-Winkler does not limit the common prefix length Sørensen-Dice https://docs.rs/xi-unicode/latest/xi_unicode/ Unicode utilities useful for text editing, including a line breaking iterator. https://github.com/BurntSushi/bstr A string type for Rust that is not required to be valid UTF-8. --- jlf: this crate is referenced by Stefan Karpinski in the section Filenames (search this URL). https://www.reddit.com/r/rust/comments/qr0rem/how_many_string_types_does_rust_have_maybe_its/ How many String types does Rust have? Maybe it's just 1 jlf: to read?

Saxon lang


https://www.saxonica.com/documentation12/#!localization/unicode-collation-algorithm Unicode Collation Algorithm https://www.saxonica.com/documentation12/index.html#!localization/sorting-and-collations Sorting and collations https://www.saxonica.com/documentation12/index.html#!changes/spi/10-11 Changes from 10 to 11 Strings Most uses of CharSequence have been replaced by a new class net.sf.saxon.str.UnicodeString (which also replaces the old class net.sf.saxon.regex.UnicodeString). The UnicodeString class has a number of implementations. All of them are designed to be codepoint-addressible: they expose an indexable array of 32-bit codepoint values, and never use surrogate pairs. The implementations of UnicodeString include: - Twine8: a string consisting entirely of codepoints in the range 1-255, held in an array with one byte per character. - Twine16: a string consisting entirely of codepoints in the range 1-65535, held in an array with two bytes per character. - Twine24: a string of arbitrary codepoints, held in an array with three bytes per character. - Slice8: a sub-range of an array using one byte per character. - Slice16: a sub-range of an array using two bytes per character. - Slice24: a sub-range of an array using two bytes per character. - BMPString: a wrapper around a Java/C# string known to contain no surrogate pairs. - ZenoString: a composite string held as a list of segments, each of which is itself a UnicodeString. The name derives from the algorithm used to combine segments, which results in segments having progressively decreasing lengths towards the end of the string. - StringView: a wrapper around an arbitrary Java/C# string. (This stores the string both in its native Java/C# form, and using a "real" codepoint- addressible implementation of UnicodeString, which is constructed lazily when it is first required.) Unicode normalization of strings (for example in the fn:normalize-unicode() function) now uses the JDK class java.text.Normalizer rather than code derived from the Unicode Consortium's implementation. This appears to be substantially faster. https://www.balisage.net/Proceedings/vol26/html/Kay01/BalisageVol26-Kay01.html ZenoString: A Data Structure for Processing XML Strings August 2 - 6, 2021 Compare with - Monolithic char arrays - Strings in Saxon - Ropes - Finger Trees https://www.cambridge.org/core/journals/journal-of-functional-programming/article/finger-trees-a-simple-generalpurpose-data-structure/BF419BCA07292DCAAF2A946E6BDF573B#article finger-trees-a-simple-general-purpose-data-structure.pdf

SQL lang


https://dev.mysql.com/doc/refman/8.0/en/charset-unicode.html Unicode Support BMP characters - can be encoded in a variable-length encoding using 8, 16, or 24 bits (1 to 3 bytes) - can be encoded in a fixed-length encoding using 16 bits (2 bytes). Supplementary characters take more space than BMP characters (up to 4 bytes per character). MySQL supports these Unicode character sets: - utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character. - utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character. This character set is deprecated in MySQL 8.0, and you should use utfmb4 instead. - utf8: An alias for utf8mb3. In MySQL 8.0, this alias is deprecated; use utf8mb4 instead. utf8 is expected in a future release to become an alias for utf8mb4. https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb4.html jlf: I take note of this URL for this concatenation rule: utf8mb4 is a superset of utf8mb3, so for an operation such as the following concatenation, the result has character set utf8mb4 and the collation of utf8mb4_col: SELECT CONCAT(utf8mb3_col, utf8mb4_col); Similarly, the following comparison in the WHERE clause works according to the collation of utf8mb4_col: SELECT * FROM utf8mb3_tbl, utf8mb4_tbl WHERE utf8mb3_tbl.utf8mb3_col = utf8mb4_tbl.utf8mb4_col; https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html#data-types-storage-reqs-strings String Type Storage Requirements https://dev.mysql.com/doc/refman/8.0/en/charset-introducer.html Character Set Introducers A character string literal, hexadecimal literal, or bit-value literal may have an optional character set introducer and COLLATE clause, to designate it as a string that uses a particular character set and collation: [_charset_name] literal [COLLATE collation_name] The _charset_name expression is formally called an introducer. It tells the parser, “the string that follows uses character set charset_name.” An introducer does not change the string to the introducer character set like CONVERT() would do. It does not change the string value, although padding may occur. The introducer is just a signal. --- Examples: SELECT 'abc'; SELECT _latin1'abc'; SELECT _binary'abc'; SELECT _utf8mb4'abc' COLLATE utf8mb4_danish_ci; SELECT _latin1 X'4D7953514C'; SELECT _utf8mb4 0x4D7953514C COLLATE utf8mb4_danish_ci; SELECT _latin1 b'1000001'; SELECT _utf8mb4 0b1000001 COLLATE utf8mb4_danish_ci; --- Character string literals can be designated as binary strings by using the _binary introducer. mysql> SET @v1 = X'000D' | X'0BC0'; mysql> SET @v2 = _binary X'000D' | X'0BC0'; mysql> SELECT HEX(@v1), HEX(@v2); +----------+----------+ | HEX(@v1) | HEX(@v2) | +----------+----------+ | BCD | 0BCD | +----------+----------+ --- Followed by rules to determines the character set and collation of a character string literal, hexadecimal literal, or bit-value literal. See the page for the details. https://www.eversql.com/mysql-utf8-vs-utf8mb4-whats-the-difference-between-utf8-and-utf8mb4/ MySQL utf8 vs utf8mb4 – What’s the difference between utf8 and utf8mb4? MySQL decided that UTF-8 can only hold 3 bytes per character (as it's defined as an alias of utf8mb3). Why? no good reason that I can find documented anywhere. Few years later, when MySQL 5.5.3 was released, they introduced a new encoding called utf8mb4, which is actually the real 4-byte utf8 encoding that you know and love. https://www.percona.com/blog/migrating-to-utf8mb4-things-to-consider/ Migrating to utf8mb4: Things to Consider The utf8mb4 character set is the new default as of MySQL 8.0, and this change neither affects existing data nor forces any upgrades. Migration to utf8mb4 has many advantages including: - It can store more symbols, including emojis - It has new collations for Asian languages - It is faster than utf8mb3

Swift lang


https://github.com/apple/swift-evolution/blob/main/proposals/0363-unicode-for-string-processing.md Proposal: Unicode for String Processing This proposal describes Regex's rich Unicode support during regex matching, along with the character classes and options that define and modify that behavior. This proposal is one component of a larger regex-powered string processing initiative. https://docs.swift.org/swift-book/documentation/the-swift-programming-language/stringsandcharacters/ Strings and Characters Every string is composed of encoding-independent Unicode characters, and provides support for accessing those characters in various Unicode representations. When a Unicode string is written to a text file or some other storage, the Unicode scalars in that string are encoded in one of several Unicode-defined encoding forms. Each form encodes the string in small chunks known as code units. These include the UTF-8 encoding form (which encodes a string as 8-bit code units), the UTF-16 encoding form (which encodes a string as 16-bit code units), and the UTF-32 encoding form (which encodes a string as 32-bit code units). 03/08/2021 https://swiftdoc.org/v5.1/type/string/ Auto-generated documentation for Swift. A Unicode string value that is a collection of characters. https://developer.apple.com/documentation/swift/string https://www.simpleswiftguide.com/get-character-from-string-using-its-index-in-swift/ jlf: no direct access to a character Doesn't work: let input = "Swift Tutorials" let char = input[3] Work: let input = "Swift Tutorials" let char = input[input.index(input.startIndex, offsetBy: 3)] A "workaround" to have direct access extension StringProtocol { subscript(offset: Int) -> Character { self[index(startIndex, offsetBy: offset)] } } Which can be used just like that: let input = "Swift Tutorials" let char = input[3] https://gist.github.com/paultopia/6609780e7b53676b7dfc55736221cd23 paultopia/monkey_patch_slicing_into_string.swift Another "workaround" to have direct access to the characters like that: var s = "here is a boring string" print(s.getCharList()) print(s[1]) print(s[-1]) print(s[0, 5]) print(s[5, 0]) print(s[3...6]) print(s[2..<10]) print(s[...15]) print(s[2...]) print(s[..<15]) https://developer.apple.com/documentation/swift/unicode/canonicalcombiningclass Unicode.CanonicalCombiningClass The classification of a scalar used in the Canonical Ordering Algorithm defined by the Unicode Standard. --- Canonical combining classes are used by the ordering algorithm to determine if two sequences of combining marks should be considered canonically equivalent (that is, identical in interpretation). Two sequences are canonically equivalent if they are equal when sorting the scalars in ascending order by their combining class. --- aboveBeforeBelow = "\u{0041}\u{0301}\u{0316}"~text~unescape belowBeforeAbove = "\u{0041}\u{0316}\u{0301}"~text~unescape aboveBeforeBelow~compareTo(belowBeforeAbove)= -- 0 (good, means equal) aboveBeforeBelow == belowBeforeAbove= -- .true 15/07/2017 String Processing For Swift 4 https://github.com/apple/swift/blob/master/docs/StringManifesto.md https://swift.org/blog/utf8-string/ Swift 5 switches the preferred encoding of strings from UTF-16 to UTF-8 while preserving efficient Objective-C-interoperability. jlf: Search "breadcrumb". Notice that the article is about Swift Objective-C interoperability. The language Swift itself is not allowing random access to characters. --- Swift 5, like Rust, performs encoding validation once on creation, when it is far more efficient to do so. NSStrings, which are lazily bridged (zero-copy) into Swift and use UTF-16, may contain invalid content (i.e. isolated surrogates). As in Swift 4.2, these are lazily validated when read from. https://bugs.swift.org/browse/SR-7602 (redirect to next URL) https://github.com/apple/swift/issues/50144 UTF8 should be (one of) the fastest String encoding(s) --- Requirements: being able to copy UTF-8 encoded bytes from a String into a pre-allocated raw buffer must be allocation-free and as fast as memcpy can copy them creating a String from UTF-8 encoded bytes should just validate the encoding and store the bytes as they are (jlf: "and store the bytes as they are" --> YES!) slightly softer but still very strong requirement: currently (even with ASCII) only the stdlib seems to be able to get a pointer to the contiguous ASCII representation (if at all in that form). That works fine if you just want to copy the bytes (UnsafeMutableBufferPointer(start: destinationStart, count: destinationLength).initialize(from: string.utf8) which will use memcpy if in ASCII representation) but doesn't allow you to implement your own algorithms that are only performant on a contiguously stored [UInt8] --- jlf: this comment in the thread is particularly interesting, because it reminds me what was said on the ARB mailing list about byte versus string. https://github.com/apple/swift/issues/50144#issuecomment-1108303710 May 9, 2018 @milseman Virtually all of it comes down to `String(data: myData, encoding: .utf8)` and `myString.data(encoding: .utf8)`. When parsing protocols such as HTTP, Redis, MySQL, PostgreSQL, etc we will read data from the OS into an `UnsafeBufferPointer<UInt8>`. This is almost always via NIO's [`ByteBuffer`](https://apple.github.io/swift-nio/docs/current/NIO/Structs/ByteBuffer.html) type. We sometimes grab `String` from that directly or grab `Data` if we want to iterate over the bytes for additional parsing. In other words, from `UnsafePointer<UInt8>` we commonly read `FixedWidthInteger`, `BinaryFloatingPoint`, `Data`, and `String`. All are very performant except String which is the concern since the vast majority of bytes ends up being `String`s. Considering the DB use case specifically, the data transfer is usually emails, names, bios, comments, etc. Very few bytes are actually dedicated to binary numbers or data blobs. Strings everywhere. To summarize, the faster we can get from `Swift.Unsafe...Pointer<UInt8>` or `Foundation.Data` to `String` the better. That will affect (for the better!) quite literally our entire framework. --- jlf: this comment from the same thread shows which questions we should answer for Rexx: https://github.com/apple/swift/issues/50144#issuecomment-1108303720 Along the lines of potentially separable issues, what is your validation story? If the stream of bytes contains invalid UTF-8, do you want: 1) The initializer to fail resulting in nil 2) The initializer to fail producing an error 3) The invalid bytes to be replaced with U+FFFD 4) The bytes verbatim, and experience the emergent behavior / unspecified results / security hazard from those bytes. For reference, I think [Rust's model](https://doc.rust-lang.org/std/string/struct.String.html) is pretty good: `from_utf8` produces an error explaining why the code units were invalid `from_utf8_lossy` replaces encoding errors with U+FFFD `from_utf8_unchecked` which takes the bytes, but if there's an encoding error, then memory safety has been violated I'm not entirely sure if accepting invalid bytes requires voiding memory safety (assuming bounds checking always happens), but it is totally a security hazard if used improperly. We may want to be very cautious about if/how we expose it. I think that trying to do read-time validation is dubious for UTF-16, and totally bananas for UTF-8. (jlf: I don't understand this last sentence. By "read-time", does he means "when working with the string?") milseman Michael Ilseman added a comment - 5 Nov 2018 3:44 PM It's now the fastest encoding. https://forums.swift.org/t/string-s-abi-and-utf-8/17676/1 https://github.com/apple/swift/pull/20315 https://github.com/apple/swift/blob/7e68e8f4a3cb1173e909dc22a3490c05e43fa592/stdlib/public/core/StringObject.swift swift/stdlib/public/core/StringObject.swift jlf: the link above is a frozen link To have an up-to-date view, go to https://github.com/apple/swift/tree/main/stdlib/public/core Many code to review! String.swift StringBreadcrumbs.swift StringBridge.swift StringCharacterView.swift StringComparable.swift StringComparison.swift StringCreate.swift StringGraphemeBreaking.swift jlf: Apparently, there are some difficulties when going backwards . // When walking backwards, it's impossible to know whether we were in an emoji // sequence without walking further backwards. This walks the string backwards // enough until we figure out whether or not to break our // (.zwj, .extendedPictographic) question. // When walking backwards, it's impossible to know whether we break when we // see our first (.regionalIndicator, .regionalIndicator) without walking // further backwards. This walks the string backwards enough until we figure // out whether or not to break these RIs. StringGuts.swift StringGutsRangeReplaceable.swift StringGutsSlice.swift StringHashable.swift StringIndex.swift StringIndexConversions.swift StringIndexValidation.swift StringInterpolation.swift StringLegacy.swift StringNormalization.swift StringObject.swift StringProtocol.swift StringRangeReplaceableCollection.swift StringStorage.swift StringStorageBridge.swift StringSwitch.swift StringTesting.swift StringUTF16View.swift StringUTF8Validation.swift StringUTF8View.swift StringUnicodeScalarView.swift StringWordBreaking.swift Substring.swift https://github.com/apple/swift/blob/main/stdlib/public/core/StringBreadcrumbs.swift Breadcrumb optimization The distance between successive breadcrumbs, measured in UTF-16 code units is 64. internal static var breadcrumbStride: Int { 64 } jlf: nothing sophisticated here... They scan the whole string by iterating over the UTF-16 indexes and when i % stride == 0 then self.crumbs.append(curIdx) When searching the offset for a String.Index, they do a binary search. https://github.com/apple/swift/pull/20315/commits/2e368a3f6a25b5e84c0f682861ea0a5c9b3b26af [String] Introduce StringBreadcrumbs Breadcrumbs provide us amortized O(1) access to the UTF-16 view, which is vital for efficient Cocoa interoperability. --- jlf: this is the commit where breadcrumbs are added to Swift (Nov 4, 2018). https://stackoverflow.com/questions/55389444/whats-does-extended-grapheme-clusters-are-canonically-equivalent-means-in-term Whats does “extended grapheme clusters are canonically equivalent” means in terms of Swift String? jlf: They don't answer to the question :-( no explanation about "canonically equivalent", just ONE poor example, no general definition. https://forums.swift.org/t/pitch-unicode-equivalence-for-swift-source/21576/6 Pitch: Unicode Equivalence for Swift Source jlf: interersting Mar 13,2019 In short, there is a thorough set of rules already laid out in UAX#31 on how to normalize identifiers in programming languages. Several of us have written several versions of a proposal to adopt it, but each time it has failed because of issues with emoji. Recent versions of Unicode now have more robust classifications for emoji, so the proposal can be resurrected with better luck now, probably. No need to start from scratch; feel free to build on the work that we’ve already done. All of this applies only to identifiers. Literals should never be messed with by the compiler. That are, after all, supposed to be literals. 13/06/2021 https://github.com/apple/swift-evolution/blob/main/proposals/0211-unicode-scalar-properties.md Add Unicode Properties to Unicode.Scalar Issues Linking with ICU The Swift standard library uses the system's ICU libraries to implement its Unicode support. A third-party developer may expect that they could also link their application directly to the system ICU to access the functionality that they need, but this proves problematic on both Apple and Linux platforms. Apple On Apple operating systems, libicucore.dylib is built with function renaming disabled (function names lack the _NN version number suffix). This makes it fairly straightforward to import the C APIs and call them from Swift without worrying about which version the operating system is using. Unfortunately, libicucore.dylib is considered to be private API for submissions to the App Store, so applications doing this will be rejected. Instead, users must built their own copy of ICU from source and link that into their applications. This is significant overhead. Linux On Linux, system ICU libraries are built with function renaming enabled (the default), so function names have the _NN version number suffix. Function renaming makes it more difficult to use these APIs from Swift; even though the C header files contain #defines that map function names like u_foo_59 to u_foo, these #defines are not imported into Swift—only the suffixed function names are available. This means that Swift bindings would be fixed to a specific version of the library without some other intermediary layer. Again, this is significant overhead. extension Unicode.Scalar.Properties { public var isAlphabetic: Bool { get } // Alphabetic public var isASCIIHexDigit: Bool { get } // ASCII_Hex_Digit public var isBidiControl: Bool { get } // Bidi_Control public var isBidiMirrored: Bool { get } // Bidi_Mirrored public var isDash: Bool { get } // Dash public var isDefaultIgnorableCodePoint: Bool { get } // Default_Ignorable_Code_Point public var isDeprecated: Bool { get } // Deprecated public var isDiacritic: Bool { get } // Diacritic public var isExtender: Bool { get } // Extender public var isFullCompositionExclusion: Bool { get } // Full_Composition_Exclusion public var isGraphemeBase: Bool { get } // Grapheme_Base public var isGraphemeExtend: Bool { get } // Grapheme_Extend public var isHexDigit: Bool { get } // Hex_Digit public var isIDContinue: Bool { get } // ID_Continue public var isIDStart: Bool { get } // ID_Start public var isIdeographic: Bool { get } // Ideographic public var isIDSBinaryOperator: Bool { get } // IDS_Binary_Operator public var isIDSTrinaryOperator: Bool { get } // IDS_Trinary_Operator public var isJoinControl: Bool { get } // Join_Control public var isLogicalOrderException: Bool { get } // Logical_Order_Exception public var isLowercase: Bool { get } // Lowercase public var isMath: Bool { get } // Math public var isNoncharacterCodePoint: Bool { get } // Noncharacter_Code_Point public var isQuotationMark: Bool { get } // Quotation_Mark public var isRadical: Bool { get } // Radical public var isSoftDotted: Bool { get } // Soft_Dotted public var isTerminalPunctuation: Bool { get } // Terminal_Punctuation public var isUnifiedIdeograph: Bool { get } // Unified_Ideograph public var isUppercase: Bool { get } // Uppercase public var isWhitespace: Bool { get } // Whitespace public var isXIDContinue: Bool { get } // XID_Continue public var isXIDStart: Bool { get } // XID_Start public var isCaseSensitive: Bool { get } // Case_Sensitive public var isSentenceTerminal: Bool { get } // Sentence_Terminal (S_Term) public var isVariationSelector: Bool { get } // Variation_Selector public var isNFDInert: Bool { get } // NFD_Inert public var isNFKDInert: Bool { get } // NFKD_Inert public var isNFCInert: Bool { get } // NFC_Inert public var isNFKCInert: Bool { get } // NFKC_Inert public var isSegmentStarter: Bool { get } // Segment_Starter public var isPatternSyntax: Bool { get } // Pattern_Syntax public var isPatternWhitespace: Bool { get } // Pattern_White_Space public var isCased: Bool { get } // Cased public var isCaseIgnorable: Bool { get } // Case_Ignorable public var changesWhenLowercased: Bool { get } // Changes_When_Lowercased public var changesWhenUppercased: Bool { get } // Changes_When_Uppercased public var changesWhenTitlecased: Bool { get } // Changes_When_Titlecased public var changesWhenCaseFolded: Bool { get } // Changes_When_Casefolded public var changesWhenCaseMapped: Bool { get } // Changes_When_Casemapped public var changesWhenNFKCCaseFolded: Bool { get } // Changes_When_NFKC_Casefolded public var isEmoji: Bool { get } // Emoji public var isEmojiPresentation: Bool { get } // Emoji_Presentation public var isEmojiModifier: Bool { get } // Emoji_Modifier public var isEmojiModifierBase: Bool { get } // Emoji_Modifier_Base } extension Unicode.Scalar.Properties { // Implemented in terms of ICU's `u_isdefined`. public var isDefined: Bool { get } } Case Mappings The properties below provide full case mappings for scalars. Since a handful of mappings result in multiple scalars (e.g., "ß" uppercases to "SS"), these properties are String-valued, not Unicode.Scalar. extension Unicode.Scalar.Properties { public var lowercaseMapping: String { get } // u_strToLower public var titlecaseMapping: String { get } // u_strToTitle public var uppercaseMapping: String { get } // u_strToUpper } Identification and Classification extension Unicode.Scalar.Properties { /// Corresponds to the `Age` Unicode property, when a code point was first /// defined. public var age: Unicode.Version? { get } /// Corresponds to the `Name` Unicode property. public var name: String? { get } /// Corresponds to the `Name_Alias` Unicode property. public var nameAlias: String? { get } /// Corresponds to the `General_Category` Unicode property. public var generalCategory: Unicode.GeneralCategory { get } /// Corresponds to the `Canonical_Combining_Class` Unicode property. public var canonicalCombiningClass: Unicode.CanonicalCombiningClass { get } } extension Unicode { /// Represents the version of Unicode in which a scalar was introduced. public typealias Version = (major: Int, minor: Int) /// General categories returned by /// `Unicode.Scalar.Properties.generalCategory`. Listed along with their /// two-letter code. public enum GeneralCategory { case uppercaseLetter // Lu case lowercaseLetter // Ll case titlecaseLetter // Lt case modifierLetter // Lm case otherLetter // Lo case nonspacingMark // Mn case spacingMark // Mc case enclosingMark // Me case decimalNumber // Nd case letterlikeNumber // Nl case otherNumber // No case connectorPunctuation //Pc case dashPunctuation // Pd case openPunctuation // Ps case closePunctuation // Pe case initialPunctuation // Pi case finalPunctuation // Pf case otherPunctuation // Po case mathSymbol // Sm case currencySymbol // Sc case modifierSymbol // Sk case otherSymbol // So case spaceSeparator // Zs case lineSeparator // Zl case paragraphSeparator // Zp case control // Cc case format // Cf case surrogate // Cs case privateUse // Co case unassigned // Cn } public struct CanonicalCombiningClass: Comparable, Hashable, RawRepresentable { public static let notReordered = CanonicalCombiningClass(rawValue: 0) public static let overlay = CanonicalCombiningClass(rawValue: 1) public static let nukta = CanonicalCombiningClass(rawValue: 7) public static let kanaVoicing = CanonicalCombiningClass(rawValue: 8) public static let virama = CanonicalCombiningClass(rawValue: 9) public static let attachedBelowLeft = CanonicalCombiningClass(rawValue: 200) public static let attachedBelow = CanonicalCombiningClass(rawValue: 202) public static let attachedAbove = CanonicalCombiningClass(rawValue: 214) public static let attachedAboveRight = CanonicalCombiningClass(rawValue: 216) public static let belowLeft = CanonicalCombiningClass(rawValue: 218) public static let below = CanonicalCombiningClass(rawValue: 220) public static let belowRight = CanonicalCombiningClass(rawValue: 222) public static let left = CanonicalCombiningClass(rawValue: 224) public static let right = CanonicalCombiningClass(rawValue: 226) public static let aboveLeft = CanonicalCombiningClass(rawValue: 228) public static let above = CanonicalCombiningClass(rawValue: 230) public static let aboveRight = CanonicalCombiningClass(rawValue: 232) public static let doubleBelow = CanonicalCombiningClass(rawValue: 233) public static let doubleAbove = CanonicalCombiningClass(rawValue: 234) public static let iotaSubscript = CanonicalCombiningClass(rawValue: 240) public let rawValue: UInt8 public init(rawValue: UInt8) } } Numerics Many Unicode scalars have associated numeric values. These are not only the common digits zero through nine, but also vulgar fractions and various other linguistic characters and ideographs that have an innate numeric value. These properties are exposed below. They can be useful for determining whether segments of text contain numbers or non-numeric data, and can also help in the design of algorithms to determine the values of such numbers. extension Unicode.Scalar.Properties { /// Corresponds to the `Numeric_Type` Unicode property. public var numericType: Unicode.NumericType? /// Corresponds to the `Numeric_Value` Unicode property. public var numericValue: Double? } extension Unicode { public enum NumericType { case decimal case digit case numeric } } 14/06/2021 https://lists.isocpp.org/sg16/2018/08/0121.php Feedback from swift team Swift strings now sort with NFC (currently UTF-16 code unit order, but likely changed to Unicode scalar value order). We didn't find FCC significantly more compelling in practice. Since NFC is far more frequent in the wild (why waste space if you don't have to), strings are likely to already be in NFC. We have fast-paths to detect on-the-fly normal sections of strings (e.g. all ASCII, all < U+0300, NFC_QC=yes, etc.). We lazily normalize portions of string during comparison when needed. Q: Swift strings support comparison via normalization. Has use of canonical string equality been a performance issue? Or been a source of surprise to programmers? A: This was a big performance issue on Linux, where we used to do UCA+DUCET based comparisons. We switch to lexicographical order of NFC-normalized UTF-16 code units (future: scalar values), and saw a very significant speed up there. The remaining performance work revolves around checking and tracking whether a string is known to already be in a normal form, so we can just memcmp. Q: I'm curious why this was a larger performance issue for Linux than for (presumably) macOS and/or iOS. A: There were two main factors. The first is that on Darwin platforms, CFString had an implementation that we used instead of UCA+DUCET which was faster. The second is that Darwin platforms are typically up-to-date and have very recent versions of ICU. On Linux, we still support Ubuntu LTS 14.04 which has a version of ICU which predates Swift and didn't have any fast-paths for ASCII or mostly-ASCII text. Switching to our own implementation based on NFC gave us many X improvement over CFString, which in turn was many X faster than UCA+DUCET (especially on older versions of ICU). Q: How firmly is the Swift string implementation tied to ICU? If the C++ standard library were to add suitable Unicode support, what would motivate reimplementing Swift strings on top of it? A: Swift's tie to ICU is less firm than it used to be If the C++ standard library provided these operations, sufficiently up-to-date with Unicode version and comparable or better to ICU in performance, we would be willing to switch. A big pain in interacting with ICU is their limited support for UTF-8. Some users who would like to use a lighter-weight Swift and are unhappy at having to link against ICU, as it's fairly large, and it can complicate security audits. https://forums.swift.org/t/pitch-unicode-for-string-processing/56907/6 [Pitch] Unicode for String Processing https://github.com/apple/swift-evolution/blob/main/proposals/0350-regex-type-overview.md jlf: surprising intro! Swift strings provide an obsessively Unicode-forward model of programming with strings. String processing with Collection's algorithms is woefully inadequate for many day-to-day tasks compared to other popular programming and scripting languages. We propose addressing this basic shortcoming through an effort we are calling regex. https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/ProposalOverview.md Regex Proposals todo: read String processing algorithms https://forums.swift.org/t/pitch-regex-powered-string-processing-algorithms/55969 todo: read Unicode for String Processing https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/UnicodeForStringProcessing.md https://stackoverflow.com/questions/41059974/german-character-%C3%9F-uppercased-in-ss "ß" is converted to "SS" when using uppercased(). --- Use caseInsensitiveCompare() instead of converting the strings to upper or lowercase: let s1 = "gruß" let s2 = "GRUß" let eq = s1.caseInsensitiveCompare(s2) == .orderedSame print(eq) // true This compares the strings in a case-insensitive way according to the Unicode standard. There is also localizedCaseInsensitiveCompare() which does a comparison according to the current locale, and s1.compare(s2, options: .caseInsensitive, locale: ...) for a case-insensitive comparison according to an arbitrary given locale. https://www.kodeco.com/3418439-encoding-and-decoding-in-swift jlf: out of subject, it's not related to strings. It's about serialization of data strctures. https://github.com/apple/swift-evolution/blob/main/proposals/0241-string-index-explicit-encoding-offset.md Deprecate String Index Encoded Offsets Feb 23, 2019 jlf: I add this URL for this description, not for the topic covered by this proposition: String abstracts away details about the underlying encoding used in its storage. String.Index is opaque and represents a position within a String or Substring. This can make serializing a string alongside its indices difficult, and for that reason SE-0180 added a computed variable and initializer encodedOffset in Swift 4.0. String was always meant to be capable of handling multiple backing encodings for its contents, and this is realized in Swift 5. String now uses UTF-8 for its preferred “fast” native encoding, but has a resilient fallback for strings of different encodings. Currently, we only use this fall-back for lazily-bridged Cocoa strings, which are commonly encoded as UTF-16, though it can be extended in the future thanks to resilience. Unfortunately, SE-0180’s approach of a single notion of encodedOffset is flawed. A string can be serialized with a choice of encodings, and the offset is therefore encoding-dependent and requires access to the contents of the string to calculate. https://www.tutorialkart.com/swift-tutorial/swift-read-text-file/#gsc.tab=0 Read text file import Foundation let file = "sample.txt" var result = "" //if you get access to the directory if let dir = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask).first { //prepare file url let fileURL = dir.appendingPathComponent(file) do { result = try String(contentsOf: fileURL, encoding: .utf8) } catch {/* handle if there are any errors */} } print(result) https://www.appsdeveloperblog.com/read-and-write-string-into-a-text-file/ Read and Write String Into a Text File let fileName = "myFileName.txt" var filePath = "" // Fine documents directory on device let dirs : [String] = NSSearchPathForDirectoriesInDomains(FileManager.SearchPathDirectory.documentDirectory, FileManager.SearchPathDomainMask.allDomainsMask, true) if dirs.count > 0 { let dir = dirs[0] //documents directory filePath = dir.appending("/" + fileName) print("Local path = \(filePath)") } else { print("Could not find local directory to store file") return } // Set the contents let fileContentToWrite = "Text to be recorded into file" do { // Write contents to file try fileContentToWrite.write(toFile: filePath, atomically: false, encoding: String.Encoding.utf8) } catch let error as NSError { print("An error took place: \(error)") } // Read file content. Example in Swift do { // Read file content let contentFromFile = try NSString(contentsOfFile: filePath, encoding: String.Encoding.utf8.rawValue) print(contentFromFile) } catch let error as NSError { print("An error took place: \(error)") } Testing the JMB's example "ς".uppercased() // "Σ" "σ".uc // "Σ" "ὈΔΥΣΣΕΎΣ".lowercased() // "ὀδυσσεύσ" NOT SUPPORTED last Σ becomes ς "ὈΔΥΣΣΕΎΣA".lowercased() // "ὀδυσσεύσa" last Σ becomes σ https://developer.apple.com/documentation/swift/character/isnewline isNewline A Boolean value indicating whether this character represents a newline. For example, the following characters all represent newlines: “\n” (U+000A): LINE FEED (LF) U+000B: LINE TABULATION (VT) U+000C: FORM FEED (FF) “\r” (U+000D): CARRIAGE RETURN (CR) “\r\n” (U+000D U+000A): CR-LF U+0085: NEXT LINE (NEL) U+2028: LINE SEPARATOR U+2029: PARAGRAPH SEPARATOR --- jlf: this is related to Unicode properties of a character. But what is the impacts on file I/O?

Typst lang


https://github.com/typst/typst A new markup-based typesetting system that is powerful and easy to learn. --- jlf: uses ICU4X https://github.com/unicode-org/icu4x/issues/3811

XPath lang


https://www.w3.org/TR/xpath-functions-31/#string-functions Functions on strings jlf: to read no "grapheme" in this document. written by Michael Kay (XSLT WG), Saxonica <http://www.saxonica.com/> https://www.w3.org/TR/xpath-functions-31/#string.match String functions that use regular expressions jlf: part of the doc "Functions on strings" above, explicitely referenced for direct access. https://www.w3.org/TR/xpath-functions-31/#func-collation-key Referenced in https://github.com/unicode-org/icu4x/issues/2689#issuecomment-1743127855 hsivonen: I'm quite skeptical of processes that use XPath having the kind of lifetimes and numbers of comparisons that computing a sort key is justified, but whether or not exposing sort keys in XPath is a good idea, it's good to know that XPath has this dependency. faassen: I think the XPath spec (the library portion) has been influenced by the capabilities of ICU4J. The motivation for this facility is described in the "notes" section: https://www.w3.org/TR/xpath-functions-31/#func-collation-key and is basically to use this as a collation-dependent hashmap key. I can't judge myself how useful that is, so I'll defer to your skepticism. I'll note however that this same specification also provides the function library available to XQuery, and with XQuery the lifetimes and numbers of comparisons are likely to be much bigger.

Zig lang, Ziglyph


04/07/2021 https://github.com/jecolon/ziglyph Unicode text processing for the Zig programming language. https://devlog.hexops.com/2021/unicode-data-file-compression/ achieving 40-70% reduction over gzip alone https://github.com/jecolon/ziglyph/issues/3 More size-optimal grapheme cluster sorting 08/02/2023 https://github.com/natecraddock/zf a commandline fuzzy finder that prioritizes matches on filenames To review: uses zygliph https://github.com/jecolon/ziglyph/issues/20 Grapheme segmentation with ZWJ sequences 10/02/2023 https://github.com/jecolon/ziglyph/issues/20 Grapheme segmentation with ZWJ sequences --- jlf: Executor is ok with utf8proc t = "🐻‍❄️🐻‍❄️"~text t~description= -- 'UTF-8 not-ASCII (2 graphemes, 8 codepoints, 26 bytes, 0 error)' t~characters== an Array (shape [8], 8 items) 1 : ( "🐻" U+1F43B So 2 "BEAR FACE" ) 2 : ( "‍" U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" ) 3 : ( "❄" U+2744 So 1 "SNOWFLAKE" ) 4 : ( "️" U+FE0F Mn 0 "VARIATION SELECTOR-16", "VS16" ) 5 : ( "🐻" U+1F43B So 2 "BEAR FACE" ) 6 : ( "‍" U+200D Cf 0 "ZERO WIDTH JOINER", "ZWJ" ) 7 : ( "❄" U+2744 So 1 "SNOWFLAKE" ) 8 : ( "️" U+FE0F Mn 0 "VARIATION SELECTOR-16", "VS16" ) https://devlog.hexops.com/2021/unicode-sorting-why-browsers-added-special-emoji-matching/ Whether your application is in Go and has it’s own Unicode Collation Algorithm (UCA) implementation, or Rust and uses bindings to the popular ICU4C library - one thing is going to remain true: it requires large data files to work. The UCA algorithm depends on two quite large data table files to work: - UnicodeData.txt for normalization, a step required before sorting can take place. - allkeys.txt for weighting certain text above others. - And more, if you want truly locale-aware sorting and not just “the default” the UCA algorithm gives you. Together, these files can add up to over a half a megabyte. While WASM languages could shell out to JavaScript browser APIs for collation, I suspect they won’t due to the lack of guarantees around those APIs. A more likely scenario is languages continuing to leave locale-aware sorting as an optional, opt-in feature - that also makes your application larger. I think this a worthwhile problem to solve, so I am working on compression a lgorithms for these files specifically in Zig to reduce them to only a few tens of kilobytes. https://github.com/jecolon/ziglyph/issues/3

Knock, knock.


Knock, knock. Who’s there? You. You who? Yoo-hoo! It's You Nicode. Knock, knock. Who’s there? Sue. Sue who? It's Sue Nicode.