char.txt -- representation of characters (including alternatives) uses: uncbasic.txt (==) higher.txt (mapped_pred) atom.txt (atom_defn) bit.txt (bit_atom_seq) 1. The (original) LOPSTR 2000 Paper Definition of Characters The original LOPSTR 2000 paper gives the following syntax extensions involving characters: "The set of atoms is extended to include symbols consisting of single printable characters in single quotes: 'a', '3'. Note that these are abstract symbols and their association to actual characters is only by interpretation. A character string in single quotes within a sequence is equivalent to the character atoms written separately: (... 'abc' ...) = (... 'a' 'b' 'c' ...) A character string in double quotes is equivalent to a sequence of character atoms: "abc" = ('abc') = ('a' 'b' 'c') A symbol not beginning with ` ' " % or $ is syntactic shorthand for an expression consisting of the atom ` and the symbol as a character sequence: abc = (` "abc") This is useful for higher-order definitions and for defining inequality between symbols, which is not built-in." These notes give a revised representation for characters as well as discuss alternatives. 2. New Representation for Characters In these notes we propose that character symbols, such as 'c' and '+', not be atoms but instead be syntactic short-hand for expressions of the form (`char ( ... )), where is a bit atom, `bit0 or `bit1, and the sequence of 8-bits is the byte code for thep character. Thus we have, for example: 'A' == (`char (`bit0 `bit1 `bit0 `bit0 `bit0 `bit0 `bit0 `bit1)) ! hex 41 'm' == (`char (`bit0 `bit1 `bit1 `bit0 `bit1 `bit1 `bit0 `bit1)) ! hex 6D For a two-byte character code, the bit sequence would have 16 bits. File has already "declared" the two bit atoms. We now declare the character atom as follows (also see ): (atom_defn `char "char" char_atom). It is useful to define the length of the character bit sequence as a dummy sequence of the desired length: (char_bit_seq_=length (%0 %1 %2 %3 %4 %5 %6 %6)). ! bit seqs have length 8 ! Note that an integer value for the length will be defined later, ! once numbers and arithmetic is defined. We can now define the set of (8-bit) characters by the following axioms: (char (%char_atom %8_bit_seq))< (char_atom %char_atom), (bit_atom_seq %8_bit_seq), (char_bit_seq_=length %8_bit_seq). ! old: (== %8_bit_seq (%0 %1 %2 %3 %4 %5 %6 %7)). The last condition forces the bit sequence to be of length 8. (Note that an =length or a numeric length predicate could be used instead, but those predicates are defined in later files: ! example: ! (=length %8_bit_seq (* * * * * * * *)) ! or: ! (eval 8 %8), ! (length %8_bit_seq %8). It is probably best to not have the character set definition depend on a natural number definition.) We also define the set of character strings: (mapped_pred char_str char). (Note that char_str is equivalent to (map char) and subsequent axioms will define other mapped forms such as char* or *:char.) The syntax extensions for character strings 'abc' and "abc" still apply. Recall that a single or double quote character is repeated when it occurs within a string with the same type of quotes: "'""" == ('''"') == ('''' '"') -- sequence of two chars: ' and " The syntax extension for symbols is postponed until file . Normally one would use the character symbols such as 'a', but having these expressions available provides access to the complete character set, including non-printable characters. Presumably, axioms would be defined to encapsulate the system-dependent properties due to character set variations. 3. Alternative Representations for Characters Deciding on a representation for characters has been difficult and it is possible that the proposal in section 2 will be changed. In this section we review other possibilities and comment on their advantages and disadvantages. a. leave printable characters as atoms This has a nice simplicity, but the problem is defining ordering and indexing on the complete set of characters. One could define ordering and indexing for the ASCII printable characters using the following finite_atom_set expression, ! 0---4--- 8---C---0---4---8---C--- (finite_atom_set char (' !"#$%&''()*+,-./0123456789:;<=>?' '@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_' '`abcdefghijklmnopqrstuvwxyz{|}~')). ! 0---4---8---C---0---4---8---C-- ! characters for ASCII codes x20 to x7E From this expression the set of characters (char ) and their ordering (< ) and indexing would be generated. The problem is that non-printable characters are not defined here. Also, many ASCII codes outside this sequence have printable representations on some devices. (Consider European variations of the ASCII character set, for example.) Clearly we need a uniform way to handle the entire character set, including non-printable characters. Another problem would be 16-bit character sets. It is obviously impractical to have distinct atoms for each of the nearly 64K characters, with them having to be defined in a single finite_atom_set expression! b. escape codes for non-printable characters A variation on alternative (a) that provides for support of non-printable characters would be to represent them with various escape codes. A C-like escape code convention such as '\n' for x0A, '\0' for x00, '\xF3' for xF3, and '\\' for backslash (\) could be used. Note that characters would have more than one representation: '\n' = '\x0A', '3' = '\x33', etc. A complete finite_atom_set definition for the complete set of characters could be as follows: (finite_atom_set ('\x00\x01\x02...\x0F' '\x10\x11\x12...\x1F' ... '\xF0\F01\xF2...\xFF')). Escape codes can be useful in character strings. However, my problem with this is that there is something somewhat arbitrary about the various escape codes that one might want and that it is better that this be a defined feature rather than built-in. Thus one would "evaluate" a character string with escape codes in order to determine the actual string of characters. c. have hex digit atom synonyms `'xx for characters An alternative proposal that handles the complete (8-bit) character set is to let the 256 characters be represented by atom symbols of the form `'xx, where xx are hex digits (with letters A-F uppercase). These atom symbols would be synonyms for the corresponding printable character symbols: 'A' = `'41 '+' = `'2B "123" = ('`31 `'32 `'33) The complete set of character atoms could be specified in a single finite_atom_set expression as follows: (finite_atom_set char (`'00 `'01 `'02 ... `'0F `'10 ... `'1F ... ... ... `'F0 ... `'FF)). From this expression ordering and indexing would be defined for the entire character set. A disadvantage with this approach is the added "built-in" complexity of having these 256 distinct atoms. Another problem would be how to handle 16-bit character codes -- it is again impractical to have to define 64K atoms of the form `'xxxx. (An alternative approach might be to let 16-bit characters be represented by pairs of byte atoms, but in that case one might as well use bit sequences as done in our main proposal.) Yet another issue is whether to have uppercase letters for hex digits (which I think look better) or to have lowercase letters (which are easier to type since no shifting needed). Perhaps both cases could be allowed -- thus the following symbols are synonyms: '+' = `'2B = `'2b. But what about mixing cases? Should `'Fd or `'bC be allowed? If this c alternative were adopted, I think I would favor allowing both uppercase and lowercase hex digits, but not mixing case in the same symbol. Overall, however, I don't think this approach is as simple and general purpose (such as for 16-bit characters) as the main proposal where characters are represented by expressions containing the bit sequence.