Category talk:Wren-upc

User-perceived characters

In Unicode a user-perceived character (or grapheme cluster) can comprise one or more codepoints and the process of splitting a string into such grapheme clusters is described in Unicode Standard Annex #29.

Given the complexity of this process, Wren doesn't have built-in support for it and this module aims to remedy that situation. It is based on Oliver Kuederle's Unicode Text Segmentation for Go library which is subject to the MIT License and is currently based on Unicode version 12.0.

Although the source code file is large by Wren library standards (over 1900 lines), approximately 1600 lines of this are needed to describe the property table which provides the raw material for text segmentation. In the interests of brevity, I have omitted the comments which accompanied the original table which should be referred to if any explanation is needed.

(Currently, I am unable to upload the source code for this module, or even to preview it, as I keep getting a 502 'bad gateway' error. I wondered at first if this was due to the size of the file (circa 75K bytes) but I get the same error if I try to upload it in chunks. Will keep trying but may have to upload to an external site and then link to that if the problem persists.)