Unicode™ : Java Glossary

go to home page U words local find full screen, hide local find menu Google search web for more information on this topic jump to foot of page translate this page with Babelfish punctuation 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all) ©1996-2009 2008-08-20 Roedy Green, Canadian Mind Products
Unicode logo Unicode
Unicode Glyphs BOMs : Byte Order Marks
What Is Unicode? What’s Missing From Unicode?
Symbols Unicode Editors
Arrows Books
Viewing Glyphs Links
Creating Unicode Documents

Unicode Glyphs

Unicode 16 and Unicode 32 Glyphs
in Downloadable Acrobat PDF Format
code Description code
† = 32 bit
Description
0000 Basic Latin 2600 unicode malemale 26A3  Miscellaneous Symbols chess, astrology, I-ching, telephones, hazards, religious symbols, hammer and sickle.
0080 e acute Latin-1 Supplement: accented letters, basic symbols 2700 unicode dingbat 2744  Dingbats: asterisks, ornaments, hands, right-pointing arrows, pencils, scissors, pens.
0100 g circumflex Latin Extended-A: Esperanto accented letters 27C0 unicode perpendicular 2708  Miscellaneous Mathematical Symbols-A: including SQL left, right and full joins.
0180 Latin Extended-B African 27F0 Supplemental Arrows-A
0250 unicode ipa 1293  IPA (International Phonetic Alphabet) Extensions 2800 unicode braille 285B Braille Patterns
02B0 Spacing Modifier Letters 2900 Supplemental Arrows-B
0300 Combining Diacritical Marks 2980 unicode error bar square 29ef Miscellaneous Mathematical Symbols-B
0370 Greek 2A00 unicode circle-Plus 2a01 Supplemental Mathematical Operators: including variants of + - × ÷
0400 Cyrillic 2B00 unicode pentagon 2b20  Miscellaneous Symbols and Arrows
0500 Cyrillic Supplement 2C00 Glagolytic: pre Cyrillic Bulgarian
0530 Armenian 2E80 CJK Radicals Supplement: Chinese Japanese Korean
0590 Hebrew 2F00 Kangxi Radicals: fragments combined to write Chinese
0600 Arabic 2FF0 Ideographic Description Characters
0700 Syriac 3000 CJK Symbols and Punctuation: Chinese Japanese Korean
0780 Thaana: Maldives 3040 Unicode hiragana 3041  Hiragana: (Japanese) Used when no Kanji character exists.
0900 Devangari (Hindi) 0921 Devanagari: Hindi 30A0 Unicode katakana 30b0  Katakana: (Japanese) mainly for foreign names
0980 Bengali 3100 Bopomofo: phonetic script for Mandarin
0A00 Gurmukhi: Punjabi 3130 Hangul Compatibility Jamo: Korean
0A80 Gujarati: Gujarat 3190 Kanbun: used by Japanese to annotate classic Chinese
0B00 Oriya: Odiya Orissa 31A0 Bopomofo Extended: phonetic script for Mandarin
0B80 Tamil 31F0 Katakana Phonetic Extensions: Japanese
0C00 Telugu: Andhra Pradesh 3200 Enclosed CJK Letters and Months: Chinese Japanese Korean
0C80 Kannada: Karnataka 3300 CJK Compatibility: Chinese Japanese Korean
0D00 Malayalam: Kerala 3400 CJK Unified Ideographs Extension A: Chinese Japanese Korean
0D80 Sinhala: Sri Lanka 4DC0 Yijing Hexagram Symbols: I Ching
0E00 Thai 4E00 unicode chinese symbol 4E70  CJK Unified Ideographs: Chinese Japanese Korean huge download including Kanji digits 零 一 二 三 四 五 六 七 八 九
0E80 Lao A000 Yi Syllables
0F00 Tibetan A490 Yi Radicals
1000 Myanmar AC00 Hangul Syllables: Korean
10A0 Georgian D800 High Surrogates
1100 Hangul Jamo: Korean DC00 Low Surrogates
1200 Ethiopic E000 Private Use Area
13A0 Cherokee F900 CJK Compatibility Ideographs: Chinese Japanese Korean
1400 unicode aboriginal 1416  Canadian Aboriginal Syllabic FB00 unicode ligature fi fb01  Alphabetic Presentation Forms: ligatures including Hebrew
1680 Ogham: Old Irish FB50 Arabic Presentation Forms-A
16A0 unicode runic 16df Runic FE00 Variation Selectors: non-printing control characters
1700 Tagalog: Philippino FE20 Combining Half Marks
1720 Hanunoo: Mindoro in the Philippines FE30 CJK Compatibility Forms: Chinese Japanese Korean
1740 Buhid: Mindoro in the Philippines FE50 Small Form Variants: small punctuation
1760 Tagbanwa: Philippines FE70 Arabic Presentation Forms-B
1780 Khmer: Cambodian FF00 Halfwidth and Fullwidth Forms
1800 Mongolian FFF0 Specials: byte order marks.
1900 Limbu: Tibet/Burma †0001 0000 Linear B Syllabary (32-bit)
1950 Tai Le: China †0001 0080 Linear B Ideograms (32-bit)
19E0 Khmer Symbols: Cambodian †0001 0100 Aegean Numbers: (32-bit)
1D00 Phonetic Extensions †0001 0300 Old Italic: (32-bit)
1E00 Latin Extended Additional: dotted letters, letters with two accents. †0001 0330 Gothic: (32-bit)
1F00 Greek Extended †0001 0380 unicode ugaritic cuneiform symbol 10389  Ugaritic Cuneiform (32-bit)
2000 General Punctuation †0001 0400 Deseret: Mormon: (32-bit)
2070 Superscripts and Subscripts †0001 0450 Shavian: (32-bit)
20A0 unicode Euro 20AC  Currency Symbols †0001 0480 Osmanya: Somalian (32-bit)
20D0 Combining Marks for Symbols †0001 0800 Cypriot Syllabary (32-bit)
2100 Letterlike Symbols †0001 D000 Byzantine Musical Symbols: (32-bit)
2150 unicode 5/8 215d  Number Forms: Roman Numerals and fractions †0001 D100 unicode treble clef  Musical Symbols: (32-bit)
2190 unicode arrow 21ba  Arrows †0001 D300 Tai Xuan Jing Symbols (32-bit) Look like I-Ching hexagrams truncated to four lines.
2200 unicode integral  Mathematical Operators: del, grad, element, there exists, for all, union, intersection, contains, dot product, cross product, therefore, square root, logical and, logical or, summation, product. †0001 D400 unicode Real symbol 1D4E1  Mathematical Alphanumeric Symbols: (32-bit)
2300 unicode apl 23c3  Miscellaneous Technical: APL operators. †0002 0000 unicode chinese symbol 200F0  CJK Unified Ideographs Extension B: (32-bit) Chinese Japanese Korean huge download
2400 unicode soh 2401  Control Pictures: for displaying unprintable ASCII: control chararacters. †0002 F800 CJK Compatibility Ideographs Supp.: (32-bit) Chinese Japanese Korean
2440 unicode banksymbol 2446  Optical Character Recognition †000E 0000 unicode A with tag e0041 Tags: control characters. (32-bit)
2460 unicode circled four 2463 Enclosed Alphanumerics †000E 0100 Variation Selectors Supp.: non printing control characters (32-bit)
2500 unicode boxdrawing 2523 Box Drawing, also triangles †000F 0000 Supplementary Private Use Area-A: (32-bit)
2580 unicode block 2591  Block Elements †0010 0000 Supplementary Private Use Area-B: (32-bit)
25A0 unicode geometric 25f6  Geometric Shapes

What Is Unicode?

A 16-bit character encoding used in Java. See the example glyphs, in PDF format. Requires Adobe Acrobat to view. Also available as ASCII text file describing the glyphs with cross references to similar glyphs. Unicode does not standardise the precise shapes of the letters, i.e. the glyphs. It does, however, provide example glyphes. This distinction is most important for Hangul which encodes Chinese, Japanese and Korean. They use the same Unicode encodings, but quite different looking renderings of the characters. These differences are handled by the font designer who uses Chinese, Japnese or Korean style.

Sometimes called UCS or ISO 10646. Unicode allows Java to handle international characters for most of the world’s living languages, including Arabic, Armenian, Bengali, Bopomofo, Chinese (via unified Han), Cyrillic, English, Georgian, Greek, Gujarati, Gurmukhi, Hebrew, Hindi (Devanagari), Japanese (Kanji, Hiragana and Katakana via unified Han), Kannada, Korean (Hangul via unified Han), Lao, Maylayalam, Oriya, Tai, Tamil, Telugu, Tibetan… Unicode will make it much easier for non-English speaking programmers to write programs for English speaking users and vice versa.

In Java, you get at the exotic characters by encoding them in hex in your strings like this: "\u00f7\u2713" to produce ÷ ✓. See String literals for more details.

In HTML, you get at the exotic characters by encoding them as entities such as ÷✓ to produce ÷ ✓.

Unicode Symbols

There are even codes for:
apple '\uf000' unofficial, private use area
British pound sign £ '\u20a4'
checkmark '\u2713'
copyright © '\u00a9'
degree ° '\u00b0'
dharma wheel '\u2638'
division ÷ '\u00f7'
bullet '\u2022'
euro '\u20ac'
female '\u2640'
funeral urn '\u26b1'
heart '\u2665'
bullet (as mathematical operator) '\u2219'
infinity '\u221e'
integral '\u222b'
male '\u2642'
pi π '\u03c0'
PI Π '\u03a0'
registered trade mark ® '\u00ae'
sun '\u2600'
telephone '\u260e'
trademark '\u2122'
This does not mean your fonts will support all these wonders, of course.

In addition there all kinds of interesting special characters characters such as: Alphabetic Presentation Forms, APL, Arrows, Bengali, Block Elements, Box Drawing, Braille Patterns, Byzantine Musical Symbols, Combining Diacritical Marks, Combining Half Marks, Combining Marks for Symbols, Control Pictures — icons for control chars, Currency Symbols, Dingbats, Enclosed Alphanumerics, General Punctuation, Geometric Shapes, Halfwidth and Fullwidth Forms, High Surrogates, Ideographic Description Characters, IPA Extensions, Letterlike Symbols, Low Surrogates, Mathematical Alphanumeric Symbols (32 bit Unicode), Mathematical Operators, Mathematical Symbols, Miscellaneous Symbols (astrology, chess, playing cards), Miscellaneous Technical (del, grad, integral), Musical Symbols, Number Forms (e.g. Roman numerals), OCR (Optical Character Recognition — the OCR-A MICR characters used in magnetic ink cheque encoding), Old Italic, Runic, Small Form Variants, Spacing Modifier Letters, Specials, Superscripts and Subscripts, Tags (letters with price tags), Unified Canadian Aboriginal Syllabic and Variation Selectors.

Unicode Arrows

There are also arrows:
\u2190
\u2191
\u2192
\u2193
\u2194
\u2195
\u21a2
\u21ac
\u21ad
\u21b0
\u21b6
\u21c5
\u21ce
\u21d0
\u21d1
\u21d2
\u21d3
\u21d4
\u21d5
\u21dc
There are even more arrows defined in Unicode: 2190-21ff, To use these characters in HTML, you need to code them as &… entities.

Viewing Unicode Glyphs

Nic Fulton of Reuters has written an Java Test Applet that can display all 64 thousand Unicode characters including the Chinese/Korean Han. How many of them actually display on your screen depends on the font handling ability of your browser and operating system, and which fonts you have installed. In Java programs, intractable Unicode characters are represented in the form '\uffff', with four hex digits. Ordinary characters like 'A' are actually 16-bit Unicode too.

Creating Unicode Documents

How do you create and edit the various flavours of Unicode documents? You can create them in some specific encoding then convert them. To write a little utility to do that read up on encoding and ask the File I/O Amanuensis for sample code. You can use lowly Notepad in Windows NT/W2K/XP to edit existing documents but not earlier Windows versions. You would have to acquire an almost empty Unicode document for getting started with new documents. It is even clever enough to deal with byte order (endian) marks. Recent version of MS Word in Windows NT/W2K/XP/W2K3 also work.

Byte Order Marks

There are two different standards, Unicode which assigns glyphs to numbers, and UTF which describes how you encode these number in a file. Byte order marks are part of the UTF standard, not the Unicode standard. See more on BOMs (Byte Order Marks).

What’s Missing From Unicode?

THere are no Unicode glyphs for the following: Unicode is not concerned with typesetting, just with raw text. In other words, it is about characters, (logical letters) not glyphs (how letters are precisely shaped). Unicode has various flavours of digits, that look much the same, but they are intended to be used in different contexts.

To typeset, you need separate fonts to handle such variants, with the letters encoded with the same unicode character. The word processor automatically selects the appropriate variant. I don’t know the mechanism by which a word processor can tell which fonts are related, and which styles and font-weights each supports. Presumably it is encoded somehow in the font files.

To a large extent ligatures are handled outside Unicode by automatically combining Unicode characters, though there are a few ligatures that rate a special Unicode character.

Unicode Editors

Where do Unicode files come from? You can create them with: You can edit or create UTF-8 or UTF-16 files with windows notepad.

Books

book cover recommend book⇒The Unicode 5.0 Standard
 hardcover
ISBN13:978-0-321-48091-0impressioncounter
publisher:Addison-Wesley
published:2006-11-19
by:The Unicode Consortium
Unicode 5.0 adds the following:
  • Security mechanisms
  • a standard collation algorithm for various national orderings.
  • A common locale data repository.
  • Improvements to the encoding model for UTF-8.
  • Rigorous stability of case folding.
  • a systematic framework covering combining characters, Unicode strings, line breaking, and segmentation
UK flag abe books.co.uk abe books.ca Canadian flag
UK flag amazon.co.uk. amazon.ca. Canadian flag
German flag abe books.de chapters.indigo.ca. Canadian flag
German flag amazon.de. abe books.com American flag
French flag abe books.fr amazon.com. American flag
French flag amazon.fr. barnes and noble.com American flag
Italian flag abe books.it powells.com American flag
Spanish flag iberlibro.com sony e-books American flag
Australian flag abe books anz

CMP homejump to top You can get the freshest copy of this page from: or possibly from your local J: drive (Java virtual drive/mindprod.com website mirror)
http://mindprod.com/jgloss/unicode.html J:\mindprod\jgloss\unicode.html
CMP logofeedback Please email your feedback for publication, errors, omissions, typos, formatting errors, ambiguities, unclear wording, broken/redirected link reports, suggestions to improve this page or comments to Roedy Green : feedback email
mindprod.com IP:[65.110.21.43]
view BlogYour face IP:[38.107.191.106]
You are visitor number 178,422.