PennMUSH Community

Ticket #7287 (new unicode)

Opened 1 year ago

Last modified 1 year ago

Unicode. And: Regarding "Pennstr" vs. Codepoint-style markup.

Reported by: Sketch Assigned to:
Priority: major Milestone:
Keywords: ansi, mxp, pueblo Cc:
Visibility: Public

Description

I am going to implement Unicode in PennMUSH by allowing indexing by character, not by byte, in all functions. This means converting *bp/buff to a new style--don't worry about work, I'll shoulder it all.

However, what I am most worried about is, should I implement this, we will be drawn to copy the TinyMUX style of color codes as unprintable Unicode codepoints, and my "Pennstr" model will not be given consideration. I'd like to discuss and decide the path to walk before beginning to code...

The conceptual model of Pennstr: http://sketch.pennmush.org/draft4.txt
The Unicode Codepoint-style markup is in TinyMUX, as far as I'm aware...

Pennstr style:
1) It's *very* fast. Most operations are bitwise, and simple for-loops. 2) There's no special workarounds for ANY markup(ANSI/MXP) in sorts, arrays, inserts, deletes, shuffles, scrambles, and on and on...
3) Simple, perfect handling of any markup.
Cons:
4) Memory use--while comparatively high, there is a limit that's mostly dependent on the recursion limit of the parser. Also, the in-memory 'ansi_string's take up a lot of memory as well--the Pennstr style, in theory, should just keep that memory in its use for a longer time frame.
5) Testing -- We'll need to try to break lots of things. Pennstrs are a departure from the simple in-string storage of markup.

Codepoints style:
1) We'd be identical in implementation to TinyMUX.
2) This could keep Pueblo very easily while still adding MXP.
3) Testing and concepts could be shared cross-platform with TinyMUX.
Cons:
3) This would really just be backtracking to all the 1.7.7 ANSI-handling code, except with Unicode in place of escape codes! That isn't an improvement.


Suggestions, comments, and questions welcome. I'm sure I didn't think of everything.

Change History

04/16/07 13:53:10 changed by Sketch

Regarding the storage of codepoints:
The characters will be normalized as much as possible before they're stored at all. All Unicode will be stored as UTF-8.

Regarding the PCRE functions:
It seems as though PCRE returns offsets within a buffer. I'm not sure if that's character-based or byte-based (probably adjustable?) but either should be easy to deal with.

05/19/07 12:42:23 changed by raevnos

  • type changed from general discussion to unicode.