<langsyntaxhighlight AWKlang="awk"># usage: gawk -f Idiomatically_determine_all_the_characters_that_can_be_used_for_symbols.awk
function is_valid_identifier(id, rc) {
length(bad2), length(good2), good2)
Well, if the purpose of this task is to determine what can be used as an identifier then in F# anything so long as you enclose it in double backticks so:
<langsyntaxhighlight lang="fsharp">
let ``+`` = 5
printfn "%d" ``+``
<langsyntaxhighlight lang="factor">USING: parser see ;
\ scan-word-name see</langsyntaxhighlight>
or of length 2 starting with the underscore.
<langsyntaxhighlight lang="go">package main
import (
_, _ = fmt.Println("Valid follow:", validFollow.String())
_, _ = fmt.Println("Only follow:", validOnlyFollow.String())
According to the specification we may give predicates for valid symbols and identifiers in Haskell:
<langsyntaxhighlight lang="haskell">import Data.Char
-- predicate for valid symbol
, "else", "foreign", "if", "import", "in", "infix "
, "infixl", "infixr", "instance", "let", "module "
, "newtype", "of", "then", "type", "where", "_"</langsyntaxhighlight>
J is defined in terms of ascii, but that would not prevent it from being ported to other environments. But we can still use J's parser to determine if a specific character combination is a single, legal word:
<langsyntaxhighlight Jlang="j"> a.#~1=#@;: ::0:"1 'b',.a.,.'c'
Here, [ a.] is the set of chararacters we are testing. We prefix each of these with an arbitrary letter, and suffix each with an arbitrary character and then try counting how many parsed tokens are formed by the result. If the token count is 1, then that character was a legal word-forming character.
{{works with|Java|8}}
<langsyntaxhighlight lang="java">import java.util.function.IntPredicate;
Line 275:
<pre>Java Identifier start: $ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz¢£¤¥ªµºÀÁÂÃÄÅÆÇÈÉÊ...
To generate a string of such characters idiomatically:
<langsyntaxhighlight lang="jq">[range(0;128) | [.] | implode | select(test("[A-Za-z0-9$_]"))] | add</langsyntaxhighlight>
jq 1.5 also allows ":" as a joining character in the form "module::name".
Therefore, assuming the availability in jq of the test/1 builtin, the test
in jq for whether a character can appear literally in a jq identifier or key is:
<langsyntaxhighlight lang="jq">test("[^\u0000-\u0007F]")</langsyntaxhighlight>
The following function screens for characters by "\p" class:
<langsyntaxhighlight lang="jq">def is_character(class):
test( "\\p{" + class + "}" );</langsyntaxhighlight>
For example, to test whether a character is a Unicode letter, symbol or numeric character:
<langsyntaxhighlight lang="jq">is_character("L") or is_character("S") or is_character("N")</langsyntaxhighlight>
An efficient way to count the number of Unicode characters within a character class is
to use the technique illustrated by the following function:
<langsyntaxhighlight lang="jq">def count(class; m; n):
reduce (range(m;n) | [.] | implode | select( test( "\\p{" + class + "}" ))) as $i
(0; . + 1);</langsyntaxhighlight>
For example the number of Unicode "symbol" characters can be obtained by evaluating:
<langsyntaxhighlight lang="jq">count("S"; 0; 1114112)</langsyntaxhighlight>
The result is 3958.
For example, x2 is a valid identifier, but 2x is not-- it is interpreted as 2 times the identifier x. In Julia, the Symbol() function turns a string into a symbolic token. So, for example:
<langsyntaxhighlight lang="julia">
for i in 1:0x200000 - 1
Symbol("x" * Char(i))
When run, this loop runs without error up to 0x200000 but not at Unicode symbol numbered 0x200000.
A Kotlin label name is a valid identifier followed by an @ symbol and an annotation name is an identifier preceded by an @ symbol.
<langsyntaxhighlight lang="scala">// version 1.1.4-3
typealias CharPredicate = (Char) -> Boolean
Line 372:
printChars("Kotlin Identifier ignorable: ", 0, 0x10FFFF, 25,
Character::isIdentifierIgnorable, true)
From the 5.4 reference manual: "Names (also called identifiers) in Lua can be any string of Latin letters, Arabic-Indic digits, and underscores, not beginning with a digit and not being a reserved word."
<langsyntaxhighlight lang="lua">function isValidIdentifier(id)
local reserved = {
["and"]=true, ["break"]=true, ["do"]=true, ["end"]=true, ["else"]=true, ["elseif"]=true, ["end"]=true,
Line 398:
print("Valid First Characters: " .. table.concat(vfc))
print("Valid Subsequent Characters: " .. table.concat(vsc))</langsyntaxhighlight>
<pre>Valid First Characters: ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
=={{header|Mathematica}}/{{header|Wolfram Language}}==
<langsyntaxhighlight Mathematicalang="mathematica">chars = Characters[FromCharacterCode[Range[0, 1114111]]];
out = Reap[Do[
If[Quiet[Length[Symbol[c]] == 0],
Line 420:
{c, chars}
]][[2, 1]];
Print["Possible 2nd-nth characters: ", out // Length]</langsyntaxhighlight>
In Wolfram Language almost all characters (there are 1114112 characters defined) can be used in variable/function names. I can't show all valid characters as there are over a million that are allowed. I do not show the list of characters 'out' but rather their length for practical purposes:
As regards identifiers, there exists a general rule which describes how they can be formed. For this rule, the following program prints the allowed starting characters and the allowed characters:
<langsyntaxhighlight Nimlang="nim">import sequtils, strutils
echo "Allowed starting characters for identifiers:"
Line 435:
echo ""
echo "Allowed characters in identifiers:"
echo toSeq(IdentChars).join()</langsyntaxhighlight>
But Nim is a lot more flexible and allows using Unicode symbols in identifiers provided these are letters and digits. Thus, the following program is valid:
<langsyntaxhighlight Nimlang="nim">var à⁷ = 3
echo à⁷</langsyntaxhighlight>
Using escape character <code>`</code>, it is possible to override the rules and to include any character in an identifier and even to use a keyword as identifier. Here is an example of the possibilities:
<langsyntaxhighlight Nimlang="nim">var `const`= 3
echo `const`
Line 468:
var `1` = 2
echo `1`
Line 482:
Although this program does not use any feature that is not in Classic Rexx,
it is included here to show what characters are valid for symbols in ooRexx.
<langsyntaxhighlight lang="oorexx">/*REXX program determines what characters are valid for REXX symbols.*/
/* copied from REXX version 2 */
Parse Version v
symbol_characters=symbol_characters || c /* add to list. */
say 'symbol characters:' symbol_characters /*display all */</langsyntaxhighlight>
<pre>REXX-ooRexx_4.2.0(MT)_32-bit 6.04 22 Feb 2014
The only symbols that can be used in variable names (including function names as a special case) are a-z, A-Z, 0-9, and the underscore. Additionally, the first character must be a letter. (That is, they must match this regex: <code>[a-zA-Z][a-zA-Z0-9_]*</code>.)
<langsyntaxhighlight lang="parigp">v=concat(concat([48..57],[65..90]),concat([97..122],95));
<pre>%1 = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "_"]</pre>
<langsyntaxhighlight lang="perl"># When not using the <code>use utf8</code> pragma, any word character in the ASCII range is allowed.
# the loop below returns: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
for $i (0..0x7f) {
Line 519:
$c = chr($_);
print $c if $c =~ /\p{Word}/;
Translation of AWK, extended with separation of ansi and utf8 handling
<!--<langsyntaxhighlight Phixlang="phix">(notonline)-->
<span style="color: #008080;">without</span> <span style="color: #008080;">js</span> <span style="color: #000080;font-style:italic;">-- file i/o, system_exec, \t and \r chars</span>
<span style="color: #008080;">function</span> <span style="color: #000000;">run</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">ident</span><span style="color: #0000FF;">)</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"utf8 characters: \n===============\n"</span><span style="color: #0000FF;">)</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"bad:%,d, good:%,d\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">ng8</span><span style="color: #0000FF;">,</span><span style="color: #000000;">ok8</span><span style="color: #0000FF;">})</span>
<langsyntaxhighlight Quackerylang="quackery">[ $ "0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrS"
$ QsTtUuVvWwXxYyZz()[]{}<>~=+-*/^\|_.,:;?!'"`%@&#$Q
join ] constant is tokenchars ( --> $ )
[ i^ validtoken if [ i^ emit ] ] ] is alltokens ( --> )
Line 662:
That's too much to be printing out here... call <code>(main)</code> yourself, at home.
<langsyntaxhighlight lang="racket">#lang racket
;; Symbols that don't need to be specially quoted:
(printf "~s~%" '(a a-z 3rd ...---... .hidden-files-look-like-this))
Line 687:
(when (zero? (modulo i 80)) (newline))
(display (list->string (list c)))))
(formerly Perl 6)
Any Unicode character or combination of characters can be used for symbols in Raku. Here's some counting rods and some cuneiform:
<syntaxhighlight lang="raku" perl6line>sub postfix:<𒋦>($n) { say "$n trilobites" }
sub term:<𝍧> { unival('𝍧') }
<pre>8 trilobites</pre>
And here is a Zalgo-text symbol:
<syntaxhighlight lang="raku" perl6line>sub Z̧̔ͩ͌͑̉̎A̢̲̙̮̹̮͍̎L̔ͧ́͆G̰̬͎͔̱̅ͣͫO͙̔ͣ̈́̈̽̎ͣ ($n) { say "$n COMES" }
Z̧̔ͩ͌͑̉̎A̢̲̙̮̹̮͍̎L̔ͧ́͆G̰̬͎͔̱̅ͣͫO͙̔ͣ̈́̈̽̎ͣ 'HE'</langsyntaxhighlight>
<pre>HE COMES</pre>
Actually, the above is a slight prevarication. The syntactic category notation does not allow you to use whitespace in the definition of a new symbol. But that leaves many more characters allowed than not allowed. Hence, it is much easier to enumerate the characters that <em>cannot</em> be used in symbols:
<syntaxhighlight lang="raku" perl6line>say .fmt("%4x"),"\t", uniname($_)
if uniprop($_,'Z')
for 0..0x1ffff;</langsyntaxhighlight>
Line 746:
===version 1===
<langsyntaxhighlight lang="rexx">/*REXX program determines what characters are valid for REXX symbols. */
@= /*set symbol characters " " */
do j=0 for 2**8 /*traipse through all the chars. */
Line 754:
say ' symbol characters: ' @ /*display all symbol characters.*/
/*stick a fork in it, we're done.*/</langsyntaxhighlight>
Programming note: &nbsp; REXX allows any symbol to begin a (statement) label, but variables can't begin with a period ('''.''') or a numeric digit.
<br><br>All examples below were executed on a (ASCII) PC using Windows/XP and Windows/7 with code page 437 in a DOS window.
I've added version 2 which should work correctly for all Rexx interpreters and compilers
<langsyntaxhighlight lang="rexx">/*REXX program determines what characters are valid for REXX symbols.*/
/* version 1 adapted for general acceptance */
Parse Version v
Line 797:
say 'symbol characters:' symbol_characters /*display all */
{{out}} for some interpreters
Line 809:
{{Out}}Best seen running in your browser either by [ ScalaFiddle (ES aka JavaScript, non JVM)] or [ Scastie (remote JVM)].
<langsyntaxhighlight Scalalang="scala">object IdiomaticallyDetermineSymbols extends App {
private def print(msg: String, limit: Int, p: Int => Boolean, fmt: String) =
Line 820:
print("Unicode Identifier part : ", 25, cp => Character.isUnicodeIdentifierPart(cp), "[%d]")
Tcl permits ''any'' character to be used in a variable or command name (subject to the restriction that <code>::</code> is a namespace separator and, for variables only, a <code>(…)</code> sequence is an array reference). The set of characters that can be used after <code>$</code> is more restricted, excluding many non-letter-like symbols, but still large. It is ''recommended practice'' to only use ASCII characters for variable names as this makes scripts more resistant to the majority of encoding problems when transporting them between systems, but the language does not itself impose such a restriction.
<langsyntaxhighlight lang="tcl">for {set c 0;set printed 0;set special {}} {$c <= 0xffff} {incr c} {
set ch [format "%c" $c]
set v "_${ch}_"
Line 838:
puts "All Unicode characters legal in names"
puts "Characters legal after \$: $special"</langsyntaxhighlight>
Only the first 256 characters are displayed:
Identifiers which begin with underscores can only be used as instance field names (one underscore) or static field names (two or more underscores).
<langsyntaxhighlight lang="ecmascript">for (i in 97..122) System.write(String.fromByte(i))
for (i in 65..90) System.write(String.fromByte(i))
Line 859:
Paraphrasing code from the compiler's parser:
<langsyntaxhighlight XPL0lang="xpl0">char C, C1;
[Text(0, "First character set: ");
for C:= 0 to 255 do
Line 875:
zkl only supports ASCII, although other character sets might be finessed.
<langsyntaxhighlight lang="zkl">[0..255].filter(fcn(n){
try{ Compiler.Compiler.compileText("var "+n.text) }
catch{ False }
