Idiomatically determine all the characters that can be used for symbols

From Rosetta Code
Idiomatically determine all the characters that can be used for symbols is a draft programming task. It is not yet considered ready to be promoted as a complete task, for reasons that should be found in its talk page.

Idiomatically determine all the characters that can be used for symbols. The word symbols is meant things like names of variables, procedures (i.e., named fragments of programs, functions, subroutines, routines), statement labels, events or conditions, and in general, anything a computer programmer can choose to name, but not being restricted to this list. Identifiers might be another name for symbols.

The method should find the characters regardless of the hardware architecture that is being used (ASCII, EBCDIC, or other).

Task requirements

Display the set of all the characters that can be used for symbols which can be used (allowed) by the computer program. You may want to mention what hardware architecture is being used, and if applicable, the operating system.

Note that most languages have additional restrictions on what characters can't be used for the first character of a variable or statement label, for instance. These type of restrictions needn't be addressed here (but can be mentioned).

See also

AWK[edit]

 
# syntax: GAWK -f IDIOMATICALLY_DETERMINE_ALL_THE_CHARACTERS_THAT_CAN_BE_USED_FOR_SYMBOLS.AWK
BEGIN {
fn = "TEMP.AWK"
cmd = sprintf("GAWK -f %s 2>NUL",fn)
for (i=0; i<=255; i++) {
c = sprintf("%c",i)
if (c ~ /\x09|\x0D|\x0A|\x20/) { ng++; continue } # tab,CR,LF,space
(run(c) == 0) ? (ok1 = ok1 c) : (ng1 = ng1 c) # 1st character
(run("_" c) == 0) ? (ok2 = ok2 c) : (ng2 = ng2 c) # 2nd..nth character
}
printf("1st character: %d NG, %d OK %s\n",length(ng1)+ng,length(ok1),ok1)
printf("2nd..nth char: %d NG, %d OK %s\n",length(ng2)+ng,length(ok2),ok2)
exit(0)
}
function run(c, rc) {
printf("BEGIN{%s+=0}\n",c) >fn
close(fn)
rc = system(cmd)
return(rc)
}
 

output:

1st character: 203 NG, 53 OK ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
2nd..nth char: 193 NG, 63 OK 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

Go[edit]

Go allows the underscore, letters, and digits, with "letters" and "digits" defined by Unicode. The first character must be the underscore or a letter. To be exported, the first character must be an upper case letter, again as defined by Unicode.

package main
 
import (
"fmt"
"unicode"
)
 
func main() {
fmt.Println("Unicode version: ", unicode.Version)
fmt.Println()
fmt.Println("Underscore: _")
fmt.Println("ASCII digits: 0123456789")
fmt.Println("ASCII letters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz")
showRange("Unicode digits: ", unicode.Digit)
showRange("Unicode letters: ", unicode.Letter)
}
 
const Ω = 52
 
var n int
var ряд, 广度一六 int
 
func showRange(hdr string, rt *unicode.RangeTable) {
fmt.Print(hdr)
n = 0
r16 := rt.R16
for r16[0].Hi < 128 {
r16 = r16[1:]
}
for _, rng := range r16 {
for r := rng.Lo; r <= rng.Hi; r += rng.Stride {
fmt.Print(string(r))
n++
if n == Ω {
fmt.Println("...")
return
}
}
}
fmt.Println()
for _, rng := range rt.R32 {
for r := rng.Lo; r <= rng.Hi; r += rng.Stride {
fmt.Print(string(r))
n++
if n == Ω {
fmt.Println("...")
return
}
}
}
fmt.Println()
}
Output:
Unicode version:  7.0.0

Underscore:      _
ASCII digits:    0123456789
ASCII letters:   ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Unicode digits:  ٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧...
Unicode letters: ªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñ...

J[edit]

J is defined in terms of ascii, but that would not prevent it from being ported to other environments. But we can still use J's parser to determine if a specific character combination is a single, legal word:

   a.#~1=#@;: ::0:"1 'b',.a.,.'c'
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

Here, a. is the set of chararacters we are testing. We prefix each of these with an arbitrary letter, and suffix each with an arbitrary character and then try counting how many parsed tokens are formed by the result. If the token count is 1, then that character was a legal word-forming character.

Of course, we also only need to do this once. Once we have a set of these characters, it's faster and easier to use a set membership test on the characters themselves than on the expression which generates them.

Java[edit]

Works with: Java version 8
import java.util.function.IntPredicate;
import java.util.stream.IntStream;
 
public class Test {
public static void main(String[] args) throws Exception {
print("Java Identifier start: ", 0, 0x10FFFF, 72,
Character::isJavaIdentifierStart, "%c");
 
print("Java Identifier part: ", 0, 0x10FFFF, 25,
Character::isJavaIdentifierPart, "[%d]");
 
print("Identifier ignorable: ", 0, 0x10FFFF, 25,
Character::isIdentifierIgnorable, "[%d]");
 
print("Unicode Identifier start: ", 0, 0x10FFFF, 72,
Character::isUnicodeIdentifierStart, "%c");
 
print("Unicode Identifier part : ", 0, 0x10FFFF, 25,
Character::isUnicodeIdentifierPart, "[%d]");
}
 
static void print(String msg, int start, int end, int limit,
IntPredicate p, String fmt) {
 
System.out.print(msg);
IntStream.rangeClosed(start, end)
.filter(p)
.limit(limit)
.forEach(cp -> System.out.printf(fmt, cp));
System.out.println("...");
}
}
Java Identifier start: $ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz¢£¤¥ªµºÀÁÂÃÄÅÆÇÈÉÊ...
Java Identifier part: [0][1][2][3][4][5][6][7][8][14][15][16][17][18][19][20][21][22][23][24][25][26][27][36][48]...
Java Identifier ignorable: [0][1][2][3][4][5][6][7][8][14][15][16][17][18][19][20][21][22][23][24][25][26][27][127][128]...
Unicode Identifier start: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐ...
Unicode Identifier part: [0][1][2][3][4][5][6][7][8][14][15][16][17][18][19][20][21][22][23][24][25][26][27][48][49]...

jq[edit]

jq identifiers[edit]

Excluding key names from consideration, in jq 1.4 the set of characters that can be used in jq identifiers corresponds to the regex: [A-Za-z0-9$_]. Thus, assuming the availability of test/1 as a builtin, the test in jq for a valid identifier character is: test("[A-Za-z0-9$_]").

To generate a string of such characters idiomatically:

[range(0;128) | [.] | implode | select(test("[A-Za-z0-9$_]"))] | add

jq 1.5 also allows ":" as a joining character in the form "module::name".


JSON key names[edit]

Any JSON string can be used as a key. Accordingly, some characters must be entered as escaped character sequences, e.g. \u0000 for NUL, \\ for backslash, etc. Thus any Unicode character except for the control characters can appear in a jq key. Therefore, assuming the availability in jq of the test/1 builtin, the test in jq for whether a character can appear literally in a jq identifier or key is:

test("[^\u0000-\u0007F]")

Symbols[edit]

The following function screens for characters by "\p" class:

def is_character(class):
test( "\\p{" + class + "}" );

For example, to test whether a character is a Unicode letter, symbol or numeric character:

is_character("L") or is_character("S") or is_character("N")

An efficient way to count the number of Unicode characters within a character class is to use the technique illustrated by the following function:

def count(class; m; n):
reduce (range(m;n) | [.] | implode | select( test( "\\p{" + class + "}" ))) as $i
(0; . + 1);

For example the number of Unicode "symbol" characters can be obtained by evaluating:

count("S"; 0; 1114112)

The result is 3958.

Kotlin[edit]

Translation of: Java


According to the Kotlin grammar, the rules regarding which characters can appear in symbols (or identifiers as we usually call them) are the same as in Java, namely:

1. An identifier is a sequence of any number of unicode letters or digits, other than a reserved word.

2. Identifiers are case sensitive.

3. The first character must be a letter, an underscore or a $ sign. Subsequent characters can include digits and certain control characters as well though the latter are ignored for identifier matching purposes.

However, in practice, identifiers which include a $ symbol or control characters don't compile unless (in the case of $) the entire identifier is enclosed in back-ticks. The use of this device also allows one to use a reserved word or many otherwise prohibited unicode characters in an identifier including spaces and dashes.

A Kotlin label name is a valid identifier followed by an @ symbol and an annotation name is an identifier preceded by an @ symbol.

// version 1.1.4-3
 
typealias CharPredicate = (Char) -> Boolean
 
fun printChars(msg: String, start: Int, end: Int, limit: Int, p: CharPredicate, asInt: Boolean) {
print(msg)
(start until end).map { it.toChar() }
.filter { p(it) }
.take(limit)
.forEach { print(if (asInt) "[${it.toInt()}]" else it) }
println("...")
}
 
fun main(args: Array<String>) {
printChars("Kotlin Identifier start: ", 0, 0x10FFFF, 72,
Char::isJavaIdentifierStart, false)
 
printChars("Kotlin Identifier part: ", 0, 0x10FFFF, 25,
Character::isJavaIdentifierPart, true)
 
printChars("Kotlin Identifier ignorable: ", 0, 0x10FFFF, 25,
Character::isIdentifierIgnorable, true)
}
Output:
Kotlin Identifier start:     $ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz¢£¤¥ªµºÀÁÂÃÄÅÆÇÈÉÊ...
Kotlin Identifier part:      [0][1][2][3][4][5][6][7][8][14][15][16][17][18][19][20][21][22][23][24][25][26][27][36][48]...
Kotlin Identifier ignorable: [0][1][2][3][4][5][6][7][8][14][15][16][17][18][19][20][21][22][23][24][25][26][27][127][128]...

ooRexx[edit]

Although this program does not use any feature that is not in Classic Rexx, it is included here to show what characters are valid for symbols in ooRexx.

/*REXX program determines what characters are valid for REXX symbols.*/
/* copied from REXX version 2 */
Parse Version v
Say v
symbol_characters='' /* start with no chars */
do j=0 To 255 /* loop through all the chars.*/
c=d2c(j) /* convert number to character*/
if datatype(c,'S') then /* Symbol char */
symbol_characters=symbol_characters || c /* add to list. */
end
say 'symbol characters:' symbol_characters /*display all */
Output:
REXX-ooRexx_4.2.0(MT)_32-bit 6.04 22 Feb 2014
symbol characters: !.0123456789?ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz 

PARI/GP[edit]

The only symbols that can be used in variable names (including function names as a special case) are a-z, A-Z, 0-9, and the underscore. Additionally, the first character must be a letter. (That is, they must match this regex: [a-zA-Z][a-zA-Z0-9_]*.)

v=concat(concat([48..57],[65..90]),concat([97..122],95));
apply(Strchr,v)
Output:
%1 = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "_"]

Perl 6[edit]

Any Unicode character or combination of characters can be used for symbols in Perl 6. Here's some counting rods and some cuneiform:

sub postfix:<𒋦>($n) { say "$n trilobites" }
 
sub term:<𝍧> { unival('𝍧') }
 
𝍧𒋦
Output:
8 trilobites

Of course, as in other languages, most of the characters you'll typically see in names are going to be alphanumerics from ASCII (or maybe Unicode), but that's a convention, not a limitation, due to the syntactic category notation demonstrated above, which can introduce any sequence of characters as a term or operator.

Actually, the above is a slight prevarication. The syntactic category notation does not allow you to use whitespace in the definition of a new symbol. But that leaves many more characters allowed than not allowed. Hence, it is much easier to enumerate the characters that cannot be used in symbols:

say .fmt("%4x"),"\t", uniname($_)
if uniprop($_,'Z')
for 0..0x1ffff;
Output:
  20	SPACE
  a0	NO-BREAK SPACE
1680	OGHAM SPACE MARK
2000	EN QUAD
2001	EM QUAD
2002	EN SPACE
2003	EM SPACE
2004	THREE-PER-EM SPACE
2005	FOUR-PER-EM SPACE
2006	SIX-PER-EM SPACE
2007	FIGURE SPACE
2008	PUNCTUATION SPACE
2009	THIN SPACE
200a	HAIR SPACE
2028	LINE SEPARATOR
2029	PARAGRAPH SEPARATOR
202f	NARROW NO-BREAK SPACE
205f	MEDIUM MATHEMATICAL SPACE
3000	IDEOGRAPHIC SPACE

We enforce the whitespace restriction to prevent insanity in the readers of programs. That being said, even the whitespace restriction is arbitrary, and can be bypassed by deriving a new grammar and switching to it. We view all other languages as dialects of Perl 6, even the insane ones. :-)

Python[edit]

See String class isidentifier.

Racket[edit]

Symbols in the Racket Guide states that:

Any string (i.e., any character sequence) can be supplied to string->symbol to obtain the corresponding symbol.

Reading Symbols defines what symbols can be "read" without needing quoting.

The docuementation for integer->char says that a character must lie in the ranges: 0 to 55295, and 57344 to 1114111.

That's too much to be printing out here... call (main) yourself, at home.

#lang racket
;; Symbols that don't need to be specially quoted:
(printf "~s~%" '(a a-z 3rd ...---... .hidden-files-look-like-this))
 
;; Symbols that do need to be specially quoted:
(define bar-sym-list
`(|3|
|i have a space|
|i've got a quote in me|
|i'm not a "dot on my own", but my neighbour is!|
|.|
,(string->symbol "\u03bb")
,(string->symbol "my characters aren't even mapped in unicode \U10e443")))
(printf "~s~%" bar-sym-list)
(printf "~a~%" bar-sym-list)
 
(define (main)
(for
((c (sequence-map
integer->char
(in-sequences (in-range 0 (add1 55295))
(in-range 57344 (add1 1114111)))))
(i (in-naturals 1)))
(when (zero? (modulo i 80)) (newline))
(display (list->string (list c)))))
 
Output:
(a a-z 3rd ...---... .hidden-files-look-like-this)
(|3| |i have a space| |i've got a quote in me| |i'm not a "dot on my own", but my neighbour is!| |.| λ |my characters aren't even mapped in unicode 􎑃|)
(3 i have a space i've got a quote in me i'm not a "dot on my own", but my neighbour is! . λ my characters aren't even mapped in unicode 􎑃)

The output to (main) is massive, and probably not dissimilar to Tcl's (anyone want to compare?)

REXX[edit]

version 1[edit]

/*REXX program determines what  characters  are valid for REXX symbols. */
@= /*set symbol characters " " */
do j=0 for 2**8 /*traipse through all the chars. */
_=d2c(j) /*convert decimal number to char.*/
if datatype(_,'S') then @[email protected] || _ /*Symbol char? Then add to list.*/
end /*j*/ /* [↑] put some chars into a list*/
 
say ' symbol characters: ' @ /*display all symbol characters.*/
/*stick a fork in it, we're done.*/

Programming note:   REXX allows any symbol to begin a (statement) label, but variables can't begin with a period (.) or a numeric digit.

All examples below were executed on a (ASCII) PC using Windows/XP and Windows/7 with code page 437 in a DOS window.

Using PC/REXX
Using Personal REXX
Using Regina (versions 3.2 ───► 3.82)
output

     symbol characters:  !#[email protected]_abcdefghijklmnopqrstuvwxyz

Using R4
output

     symbol characters:  !#[email protected]_abcdefghijklmnopqrstuvwxyzÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£áíóúñÑ╡╢╖─╞╟╨╤╥╙╘╒╓╫╪▐αßΓπΣσµτΦΘΩδ∞φ

Using ROO
output

     symbol characters:  !#[email protected]_abcdefghijklmnopqrstuvwxyzÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£áíóúñÑ╡╢╖╞╟╨╤╥╙╘╒╓╫╪▐αßΓπΣσµτΦΘΩδ∞φ

version 2 ooRexx compatible[edit]

Because version 1 does not work correctly with ooRexx - showing this error message:

     2 *-* @
Error 13 running D:\v1.rex line 2:  Invalid character in program
Error 13.1:  Incorrect character in program "@" ('40'X)

I've added version 2 which should work correctly for all Rexx interpreters and compilers

/*REXX program determines what characters are valid for REXX symbols.*/
/* version 1 adapted for general acceptance */
Parse Version v
Say v
symbol_characters='' /* start with no chars */
do j=0 To 255 /* loop through all the chars.*/
c=d2c(j) /* convert number to character*/
if datatype(c,'S') then /* Symbol char */
symbol_characters=symbol_characters || c /* add to list. */
end
say 'symbol characters:' symbol_characters /*display all */
 
Output:
for some interpreters

Note that $#@ are not valid symbol characters for ooRexx.

REXX-ooRexx_4.2.0(MT)_32-bit 6.04 22 Feb 2014
symbol characters: !.0123456789?ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz

REXX-Regina_3.8.2(MT) 5.00 22 Jun 2014
symbol characters: !#[email protected]_abcdefghijklmnopqrstuvwxyz       

Tcl[edit]

Tcl permits any character to be used in a variable or command name (subject to the restriction that :: is a namespace separator and, for variables only, a (…) sequence is an array reference). The set of characters that can be used after $ is more restricted, excluding many non-letter-like symbols, but still large. It is recommended practice to only use ASCII characters for variable names as this makes scripts more resistant to the majority of encoding problems when transporting them between systems, but the language does not itself impose such a restriction.

for {set c 0;set printed 0;set special {}} {$c <= 0xffff} {incr c} {
set ch [format "%c" $c]
set v "_${ch}_"
#puts "testing variable named $v"
if {[catch {set $v $c; set $v} msg] || $msg ne $c} {
puts [format "\\u%04x illegal in names" $c]
incr printed
} elseif {[catch {subst $$v} msg] == 0 && $msg eq $c} {
lappend special $ch
}
}
if {$printed == 0} {
puts "All Unicode characters legal in names"
}
puts "Characters legal after \$: $special"
Output:

Only the first 256 characters are displayed:

All Unicode characters legal in names
Characters legal after $: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ İ ı IJ ij Ĵ ĵ Ķ ķ ĸ Ĺ Ł ł Ń ń Ņ ņ Ň ň ʼn Ŋ ŋ Ō ō Ŏ ŏ Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ş š Ţ ţ Ť ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź ƪ Ƶ ƺ ǀ ǁ ǂ ǃ DŽ Dž dž LJ Lj lj NJ Nj nj Ǎ ǎ Ǐ ǐ Ǒ ǒ Ǔ ǔ Ǖ ǖ ǘ Ǚ ǚ Ǜ ǜ ǝ Ǟ ǟ Ǡ ǡ Ǣ ǣ Ǥ ǥ Ǧ ǧ Ǩ ǩ Ǫ ǫ Ǭ ǭ Ǯ ǯ ǰ DZ Dz dz Ǵ ǵ Ƕ Ǹ ǹ Ǻ ǻ Ǽ ǽ Ǿ ǿ ...

zkl[edit]

zkl only supports ASCII, although other character sets might be finessed.

[0..255].filter(fcn(n){
try{ Compiler.Compiler.compileText("var "+n.text) }
catch{ False }
}).apply("text").concat()
Output:
<compiler noise>
;ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

This code works by compiling "var <char>". Since "var ;" is valid syntax (dead code), ";" is a false positive. We could also use "fcn <char>{}" but "fcn {}" is lambda syntax, so space would be a false positive. "_" is excluded because it is not valid variable name although it can be anywhere in a multi-character name.