Architecture

Each writing system using Mongolian script and the unified writing system formed by integrating them have the same architecture of text representation and text shaping. The text representation and text shaping rules contained in the individual writing systems and the integrated writing system are the concrete realization of this common architecture. This chapter describes this common architecture, and subsequent chapters describe the specific rules of each writing system in detail.

Character set

The characters included in each writing system of the Mongolian script can be categorized into Mongolian-specific characters and characters shared with other scripts.

Script	Type of characters	Examples	Note
General	Space	space
	Punctuation	middle dot, …
	Format controls	ZWJ, ZWNJ, …	participate in shaping
	Digits	digit one, …
Mongolian	Punctuation	birga, …	less used now
	Format controls	FVS, MVS, …	participate in shaping
	Digits	Mongolian digit one, …	less used now
	Phonetic letters	Mongolian letter a, …	participate in shaping
CJK	Punctuation	question mark, …

Phonetic letters and written units

Due to the introduction of matres lectionis, for the writing systems using Mongolian script, both vowels and consonants of the written language correspond to the actual texts, and both are similar in behavior and equal in status. Therefore, although the original writing system was abjad, each writing system using Mongolian script has the characteristics of an alphabet.

These writing systems, with the characteristics of both abjad and alphabet, result in two ways of analyzing and encoding the Mongolian script. One is to determine the graphemes used by each writing system by chronologically comparing Mongolian script with Old Uyghur script and Old Turkic script, and to identify each grapheme as a character; the model formed by this method is called the graphetic model, and the basic unit in this model is called the written unit. The other is to analyze the phonemes of the written language recorded by the writing system, to group together the glyphs that record the same phoneme, and to identify each phoneme as a character; the model formed by this method is called a phonetic model, and the basic unit in the model is called the phonetic letter.

The specification used in this manual is based on the phonemic model, in which the characters correspond to the phonemes of the written language as recorded in the writing systems, with some compromises to maintain relative stability with the existing system. For example, the final Iy in Manchu records the phoneme /ɹ̩/, is the same as the phonetic letter ii, but because it corresponds to ⟨i⟩ rather than ⟨y⟩ in the Möllendorff transcription, it is treated as corresponding to the phonetic letter i rather than ii. This analysis is inherited in this specification.

Multi-to-one and one-to-multi confusion. Since the writing units and the phonemes do not have a one-to-one correspondence, the phenomenon of the same phoneme corresponding to more than one written unit and the same written unit corresponding to more than one phoneme may occur. For example, in Hudum, the phonemic letter n is usually written as N when it appears as an onset and as A when it appears as a coda (except for loanwords such as S W K2 O I N D); and the grapheme sequence A O R D U may correspond to o r d o “palace”, u r t u “long”, and u r d u “south”.

Format controls

Zero Width Non-Joiner (ZWNJ), Zero Width Joiner (ZWJ), and Nirugu. U+200C and U+200D are Unicode’s standard cursive joining controls. Note that ZWJ also breaks interaction (such as ligation) between two consecutive characters since it is treated as an invisible character. U+180A is a Mongolian-specific character that behaves exactly like ZWJ but is visible as a piece of stem stroke. ZWNJ and ZWJ should not be accessible to the average user on common keyboard layouts, as everyday text does not require these characters.

The visible character nirugu should be used to cause joining in everyday text. A common use case is to end a patronymic abbreviation that is the initial syllable body (i.e., an optional onset plus the first vowel) or just the initial consonant letter of the father’s name.

Vowel Separator (MVS) and Narrow No-Break Space (NNBSP). MVS is a Mongolian-specific format control for requesting the chachlag variation. It is transcribed as ‘·’ (a middle dot). NNBSP is a whitespace and format control used to represent and present particles. It is transcribed as ‘–’ (an en-dash). Use of the NNBSP is discouraged in preference for the MVS, as it sometimes produces anomalous shaping in various contexts.

Free Variation Selector (FVS). FVS’s are Mongolian-specific format controls. They are applied to follow certain characters to request the forms not captured by the predictive shaping rules.

Standardized Variation Selector (VS). VS’s are Unicode’s standard format controls for requesting glyph variants. From Unicode 17.0, VS2 is used to specify the Sibe form of quotation marks.

Numbers and Punctuation

The document does not contain specifications for numbers, punctuation, damaru, ubadama, and so on; these will be added in subsequent versions.

Preliminary investigation:

Numerals. Mongolian digits from zero to nine are not used in modern China, but in ancient China. The Hudum and Todo ancient literatures use these digits. They also appear a bit different between Hudum and Todo, but they are still in a system.

The Mongolian digits are employed in modern Mongolia. They appear on Mongolian banknotes.

The Mongolian digits behave identically to European digits. In various publications, these digits may be rotated (arranged vertically) or upright (arranged vertically or horizontally).

Punctuation.

Hudum punctuation.

᠀: rotated.

᠁, ᠄, ᠅: rotated.

᠂, ᠃: rotated.

%, +, -, =, ~: Latin punctuation, used in “50%, 3+5=8, 3~5”.

·: middle dot， positioned in the middle.

⁈, ⁉, ；, ！, ？: upright.

——: rotated.

（）《》〈〉［］: rotated.

Todo punctuation.

U+11660 — U+1166C: rotated.

「」『』｛｝: rotated.

，, 、, 。: upright, positioned in the middle.

……: rotated.

Sibe punctuation.

：: upright, positioned in the middle.

‘, ’, “, ”: upright, ‘ and “ are positioned on the left, ’ and ” are positioned on the right. Instructions for selecting VS need to be added and referred.

᠈, ᠉: rotated.

Note that, although there are gaps on both sides of punctuation marks, certain punctuation marks are conventionally followed by a space character. In such cases, it appears that the gap following the punctuation mark should be reduced.

Shaping process

The Mongolian text shaping process is based on the well-implemented technology foundation for general scripts and cursive scripts, while an additional phase of Mongolian-specific shaping steps is inserted into the ordinary shaping process required by cursive scripts. The minimal shaping process consists of a number of steps as shown below.

Shaping phase				Shaping step
	Ia. General			Basic character-to-glyph mapping
		IIa. Cursive script		Initiation of cursive positions
		III. Mongolian-specific Reduction of phonetic letters to written units	Phonetic	Chachlag
				Syllabic
				Particle
			Graphemic	Devsger
			Graphemic	Post-bowed
			Uncaptured	FVS-selected
		IIb. Cursive script (continued) Sub-written-unit variations		Variation involving bowed written units
				Cleanup of format controls
				Optional treatments
	Ib. General (continued) Typography			Vertical forms of punctuation marks
	Ib. General (continued) Typography			Optional treatments

General shaping phases

These are the basic mechanisms in fonts that apply to all scripts.

The basic character-to-glyph mapping (phase Ia) is typically controlled by the TrueType/OpenType table cmap. The Unicode representative glyphs can be used here as the default glyph mappings for phonetic letters, but these representative glyphs are essentially irrelevant to the final rendering.

Vertical forms of punctuation marks (phase Ib) are critical to the proper setting of Mongolian text, but are not part of the complex shaping between letters and format controls.

Cursive script shaping phases

On top of the general shaping mechanisms, complex scripts require additional shaping phases to be inserted after the basic character-to-glyph mapping and before typographical treatments. In particular, cursive scripts all undergo the cursive joining mechanism.

Cursive joining. Written forms exhibit the cursive joining mechanism (phase IIa). Both sides of a written form can either be joined to an adjacent written form or not, with up to four different states. Or, more abstractly, each written form is in one of the four cursive positions:

Isolated, abbreviated as ‘isol’: not joined forward (above, in Mongolian), not joined backward (below, in Mongolian);
Initial, abbreviated as ‘init’: not joined forward, joined backward;
Medial, abbreviated as ‘medi’: joined forward, joined backward;
Final, abbreviated as ‘fina’: joined forward, not joined backward.

Cursive positions are irrelevant to word boundaries, although they are usually consistent with word-wise positions in Mongolian because cursive joining breaks within a word are limited in the writing system.

Implementation. The nominal glyph of each phonetic letter will be mapped to the default glyph of that letter at a given cursive position.

Graphemic variation after bowed written units. Before the sub-written-unit variation, bowed written units may first cause a vowel to change its form.

Mongolian-specific shaping phases

Phase III consists of a series of steps for Mongolian-specific shaping requirements, and within each step there may be more than one set of non-overlapping rules, each for a different group of letters. Forms not captured by the predictive conditions are requested with an FVS.