Special characters: Difference between revisions

From Rosetta Code
Content added Content deleted
(Ada solution added)
(added ocaml escape sequences)
Line 412: Line 412:
Note that the set of special characters in LaTeX isn't really fixed, but can be changed by LaTeX code. For example, the package <tt>ngerman</tt> (providing German-specific definitions, including easier access to umlaut letters) re-defines the double quote character (") as special character, so you can more easily write German words like "hören" (as <tt>h"oren</tt> instead of <tt>h{\"o}ren</tt>).
Note that the set of special characters in LaTeX isn't really fixed, but can be changed by LaTeX code. For example, the package <tt>ngerman</tt> (providing German-specific definitions, including easier access to umlaut letters) re-defines the double quote character (") as special character, so you can more easily write German words like "hören" (as <tt>h"oren</tt> instead of <tt>h{\"o}ren</tt>).


=={{header|OCaml}}

Character escape sequences
\\ backslash
\" double quote
\' single quote
\n line feed
\r carriage return
\t tab
\b backspace
\ (backslash followed by a space) space
\DDD where D is a decimal digit; the character with code DDD in decimal
\xHH where H is a hex digit; the character with code HH in hex


=={{header|plainTeX}}==
=={{header|plainTeX}}==

Revision as of 09:35, 11 August 2009

Task
Special characters
You are encouraged to solve this task according to the task description, using any language you may know.

List the special characters and escape sequences in the language.

See also: Quotes

Ada

There is no escape sequences in character literals. Any character supported by the source encoding is allowed. The only escape sequence of string literals is "" (doubled double quotation marks) which denotes ". When characters need to be specified by their code positions (in Unicode), this is made using the 'Val attribute: <lang Ada> with Ada.Text_IO; use Ada.Text_IO;

procedure Test is begin

  Put ("Quote """ &  & """" & Character'Val (10));

end Test; </lang> Sample output:

Quote "'"

Note that character and string literals serve all character and string types. For example with Wide_Wide characters (32-bit) and strings: <lang Ada> with Ada.Wide_Wide_Text_IO; use Ada.Wide_Wide_Text_IO;

procedure Test is begin

  Put ("Unicode """ &  & """" & Wide_Wide_Character'Val (10));

end Test; </lang>

ALGOL 68

ALGOL 68 has several built-in character constants. The following characters are (respectively) the representations of TRUE and FALSE, the blank character ".", the character displayed when a number cannot being printed in the width provided. And the null character indicating the end of characters in a BYTES array.

printf(($"flip:"g"!"l$,flip));
printf(($"flop:"g"!"l$,flop));
printf(($"blank:"g"!"l$,blank));
printf(($"error char:"g"!"l$,error char));
printf(($"null character:"g"!"l$,null character))

Output:

flip:T!

flop:F! blank: ! error char:*!

null character:

To handle the output movement to (and input movement from) a device ALGOL 68 has the following four positioning procedures:

print(("new page:",new page));
print(("new line:",new line));
print(("space:",space));
print(("backspace:",backspace))

These procedures may not all be supported on a particular device.

If a particular device (CHANNEL) is set possible, then there are three built-in procedures that allow movement about this device.

  • set char number - set the position in the current line.
  • reset - move to the first character of the first line of the first page. For example a home or tape rewind.
  • set - allows the movement to selected page, line and character.

ALGOL 68 pre-dates the current ASCII standard, and hence supports many non ASCII characters. Moreover ALGOL 68 had to work on 6-bits per byte hardware, hence it was necessary to be able to write the same ALGOL 68 code in strictly upper-case. Here are the special characters together with their upper-case alternatives (referred to as "worthy characters").

Character ASCII Worthy
"₁₀" \ E
"≥" >= GE
"≤" <= LE
"≠" /= ~= NE
"¢" # CO
"⌊" LWB
"⌈" UPB
"⎕" ELEM
"¬" ~ NOT
"÷" % OVER
"×" * TIMES
"⊥" I
"°" NIL
"↑" ** UP
"↓" DOWN
"∨" OR
"∧" & AND
"←" OF
"╰" LWS
"╭" UPS

Most of these characters made their way into European standard characters sets (eg ALCOR and GOST). Ironically the ¢ character was dropped from later versions of America's own ASCII character set.

The character "₁₀" is one ALGOL 68 byte (versus 2 in Unicode).

AutoHotkey

The escape character defaults to accent/backtick (`).

  • `, = , (literal comma). Note: Commas that appear within the last parameter of a command do not need to be escaped because the program knows to treat them literally. The same is true for all parameters of MsgBox because it has smart comma handling.
  • `% = % (literal percent)
  • `` = ` (literal accent; i.e. two consecutive escape characters result in a single literal character)
  • `; = ; (literal semicolon). Note: This is necessary only if a semicolon has a space or tab to its left. If it does not, it will be recognized correctly without being escaped.
  • `n = newline (linefeed/LF)
  • `r = carriage return (CR)
  • `b = backspace
  • `t = tab (the more typical horizontal variety)
  • `v = vertical tab -- corresponds to Ascii value 11. It can also be manifest in some applications by typing Control+K.
  • `a = alert (bell) -- corresponds to Ascii value 7. It can also be manifest in some applications by typing Control+G.
  • `f = formfeed -- corresponds to Ascii value 12. It can also be manifest in some applications by typing Control+L.
  • Send = When the Send command or Hotstrings are used in their default (non-raw) mode, characters such as {}^!+# have special meaning. Therefore, to use them literally in these cases, enclose them in braces. For example: Send {^}{!}{{}
  • "" = Within an expression, two consecutive quotes enclosed inside a literal string resolve to a single literal quote. For example: Var := "The color ""red"" was found."

AWK

AWK uses the following special characters:

  • {...} body of code ("action", or body of if/for/while block)
  • (...) condition in for/while loops; arguments to a function
  • ; statement separator
  • /.../ regular expression
  • "..." string constant
  • a[b] element b of array a
  • # comment marker
  • \ begin excape sequence, as in C; e.g. \n, \t, \x1B, \\

In addition, regular expressions and (s)printf have their own "little languages".

Brainf***

The only characters that mean anything in BF are its commands:

> move the pointer one to the right

< move the pointer one to the left

+ increment the value at the pointer

- decrement the value at the pointer

, input one byte to memory at the pointer

. output one byte from memory at the pointer

[ begin loop if the value at the pointer is not 0

] end loop

All other characters are comments.

C

See C++.

As in C++, ?, #, \, ' and " have special meaning (altogether with { and }). Also trigraphs work (they are an "old" way to avoid the "old" difficulties of finding characters like { } etc. on some keyboards).

C99 standard (but not previous standards) recognizes also universal character names, like C++.

String and character literals are like C++ (or rather the other way around!), and even the meaning and usage of the # character is the same.

C++

C++ has several types of escape sequences, which are interpreted in various contexts. The main characters with special properties are the question mark (?), the pound sign (#), the backslash (\), the single quote (') and the double quote (").

Trigraphs

Trigraphs are certain character sequences starting with two question marks, which can be used instead of certain characters, and which are always and in all contexts interpreted as the replacement character. They can be used anywhere in the source, including, but not limited to string constants. The complete list is:

Trigraph  Replacement letter
  ??(       [
  ??)       ]
  ??<       {
  ??>       }
  ??/       \
  ??=       #
  ??'       ^
  ??!       |
  ??-       ~

Note that interpretation of those trigraphs is the very first step in C++ compilation, therefore the trigraphs can be used instead of their replacement letters everywhere, including in all of the following escape sequences (e.g. instead of \u00CF (see next section) you can also write ??/u00CF, and it will be interpreted the same way).

Also note that some compilers don't interpret trigraphs by default, since today's character sets all contain the replacement characters, and therefore trigraphs are practically not used. However, accidentally using them (e.g. in a string constant) may change the code semantics on some compilers, so one should still be aware of them.

Universal character names and escaping newlines

Moreover, C++ allows to use arbitrary Unicode letters to be represented in the basic execution character set (which is a subset of ASCII), by using a so-called universal character name. Those have one of the forms

\uXXXX
\UXXXXXXXX

where each X is to be replaced by a hex digit. For example, the German umlaut letter ü can be written as

\u00CF

or

\U000000CF

However, letters in the basic execution character set may not be written in this form (but since all those characters are in standard ASCII, writing them as universal character constants would only obfuscate anyway). If the compiler accepts direct usage of of non-ASCII characters somewhere in the code, the result must be the same as with the corresponding universal character name. For example, the following two lines, if accepted by the compiler, should have the same effect: <lang cpp> std::cout << "Tür\n"; std::cout << "T\u00FC\n"; </lang> Note that in principle, C++ would also allow to use such letters in identifiers, e.g. <lang cpp> extern int Tür; // if the compiler allows literal ü extern int T\u00FCr; // should in theory work everywhere </lang> but that's not generally supported by existing compilers (e.g. g++ 4.1.2 doesn't support it).

Another escape sequence working everywhere is to escape the newline: If a backslash is at the end of the line, the next line is pasted to it without any space in between. For example: <lang cpp> int const\ ant; // defines a variable of type int named constant, not a variable of type int const named ant </lang>

String and character literal

A string literal is surrounded by double quotes("). A character literal is surrounded by single quotes ('). Example: <lang cpp> char const str = "a string literal"; char c = 'x'; // a character literal </lang>

The following escape sequences are only allowed inside string constants and character constants:

escape seq.  meaning          ASCII character/codepoint
 \a           alert             BEL ^G/7
 \b           backspace         BS  ^H/8
 \f           form feed         FF  ^L/12
 \n           newline           LF  ^J/10
 \r           carriage return   CR  ^M/13
 \t           tab               TAB ^I/9
 \v           vertical tab      VT  ^K/11
 \'           single quote      '           (unescaped ' would end character constant)
 \"           double quote      "           (unescaped " would end string constant)
 \\           backslash         \           (unescaped \ would introduce escape sequence)
 \?           question mark     ?           (useful to break trigraphs in strings)
 \0           string end marker NUL ^@/0    (special case of octal char value)
 \nnn         (octal char value)            (each n must be an octal digit)
 \xnn         (hex char value)              (each n must be a hexadecimal digit)

Note that C++ doesn't guarantee ASCII. On non-ASCII platforms (e.g. EBCDIC), the rightmost column of course doesn't apply. However, \0 unconditionally has the value 0.

Also note that some compilers add the non-standard escape sequence \e for Escape (that is, the ASCII escape character).

The # character

The # character in C++ is special as it is interpreted only in the preprocessing phase, and shouldn't occur (outside of character/string constants) after preprocessing.

  • If # appears as first non-whitespace character in the line, it introduces a preprocessor directive. For example

<lang cpp>

  1. include <iostream>

</lang>

  • Inside macro definitions, a single # is the stringification operator, which turns its argument into a string. For example:

<lang cpp>

  1. define STR(x) #x

int main() {

 std::cout << STR(Hello world) << std::endl; // STR(Hello world) expands to "Hello world"

} </lang>

  • Also inside macro definitions, ## is the token pasting operator. For example:

<lang cpp>

  1. define THE(x) the_ ## x

int THE(answer) = 42; // THE(answer) expands to the_answer </lang>

Note that the # character is not interpreted specially inside character or string literals.

E

E uses typical C-style backslash escapes within literals. The defined escapes are:

Sequence Unicode Meaning
\b U+0008 (Backspace)
\t U+0009 (Tab)
\n U+000A (Line feed)
\f U+000C (Form feed)
\r U+000D (Carriage return)
\" U+0022 "
\' U+0027 '
\\ U+005C \
\<newline> None (Line continuation -- stands for no characters)
\uXXXX U+XXXX (BMP Unicode character, 4 hex digits)

Consensus has not been reached on handling non-BMP characters. All other backslash-followed-by-character sequences are syntax errors.

Within E quasiliterals, backslash is not special and $\ plays the same role;

<lang e>? println(`1 + 1$\n= ${1 + 1}`) 1 + 1 = 2</lang>

Haskell

Comments

-- comment here until end of line
{- comment here -}

Operator symbols (nearly any sequence can be used)

! # $ % & * + - . / < = > ? @ \ ^ | - ~ :
: as first character denotes constructor

Reserved symbol sequences

.. : :: = \ | <- -> @ ~ => _ 

Infix quotes

`identifier` (to use as infix operator)

Characters

'.'
\ escapes

Strings

"..."
\ escapes

Special escapes

\a alert
\b backspace
\f form feed
\n new line
\r carriage return
\t horizontal tab
\v vertical tab

Other

( )   (grouping)
( , ) (tuple type/tuple constructor)
{ ; } (grouping inside let, where, do, case without layout)
[ , ] (list type/list constructor)
[ | ] (list comprehension)

Unicode characters, according to category:

Upper case (identifiers)
Lower case (identifiers)
Digits (numbers)
Symbol/punctuation (operators)

Java

Math:

& | ^ ~ (bitwise AND, OR, XOR, and NOT)
>> << (bitwise arithmetic shift)
>>> (bitwise logical shift)
+ - * / = % (+ can be used for String concatenation)
any of the previous math operators can be placed in front of an equals sign to make a self-operation replacement:
x = x + 2 is the same as x += 2
++ -- (increment and decrement--before a variable for pre (++x), after for post(x++))
== < > != <= >= (comparison)

Boolean:

! (NOT)
&& || (short-circuit AND, OR)
^ & | (long-circuit XOR, AND, OR)

Other:

{ } (scope)
( ) (for functions)
; (statement terminator)
[ ] (array index)
" (string literal)
' (character literal)
? : (ternary operator)

Escape characters:

 \b     (Backspace)
 \n     (Line Feed)
 \r     (Carriage Return)
 \f     (Form Feed)
 \t     (Tab)
 \0     (Null) Note. This is actually a OCTAL escape but handy nonetheless
 \'     (Single Quote)
 \"     (Double Quote)
 \\     (Backslash)
 \DDD   (Octal Escape Sequence, D is a number between 0 and 7; can only express characters from 0 to 255 (i.e. \0 to \377))

Unicode escapes:

 \uHHHH (Unicode Escape Sequence, H is any hexadecimal digit between 0 and 9 and between A and F)

Be extremely careful with Unicode escapes. Unicode escapes are special and are substituted with the specified character before the source code is parsed. In other words, they apply anywhere in the code, not just inside character and string literals. Variable names can contain foreign characters. It also means that you can use Unicode escapes to write any character in the source code, and it would work. For example, you can say \u002b instead of saying + for addition; you can say String\u0020foo and it would be interpreted as two identifiers: String foo; you can even write the entire Java source file with Unicode escapes, as a poor form of obfuscation.

However, this leads to many problems:

  • \u000A will become a line return in the code, which will terminate line-end comments:
 // hello \u000A this looks like a comment
is a syntax error, because the part after \u000A is on the next line and no longer in the comment
  • \u0022 will become a double-quote in the code, which ends / begins a string literal:
 "hello \u0022 is this a string?"
is a syntax error, because the part after \u0022 is outside the string literal
  • An invalid sequence of \u, even in comments that usually are ignored, will cause a parsing error:
 /*
  * c:\unix\home\
  */
is a syntax error, because \unix is not a valid Unicode escape, even though you think that it should be inside a comment

LaTeX

LaTeX has ten special characters: # $ % & ~ _ ^ \ { }

To make any of these characters appear literally in output, prefix the character with a \. For example, to typeset 5% of $10 you would type

5\% of \$10

Note that the set of special characters in LaTeX isn't really fixed, but can be changed by LaTeX code. For example, the package ngerman (providing German-specific definitions, including easier access to umlaut letters) re-defines the double quote character (") as special character, so you can more easily write German words like "hören" (as h"oren instead of h{\"o}ren).

==OCaml

Character escape sequences

\\     backslash
\"     double quote
\'     single quote
\n     line feed
\r     carriage return
\t     tab
\b     backspace
\  (backslash followed by a space) space
\DDD   where D is a decimal digit; the character with code DDD in decimal
\xHH   where H is a hex digit; the character with code HH in hex

Plain TeX

TeX attachs to each character a category code, that determines its "meaning" for TeX. Macro packages can redefine the category code of any character. Ignoring the category code 10 (blank), 11 (letters) and 12 (a category embracing all characters that are not letters nor "special" characters according to TeX) and few more not interesting here, when TeX begins the only characters that have a category code so that we can consider "special" for the purpose of this page, are

  • \ %

Then plainTeX assigns few more (here I don't list some non-printable characters that also get assigned a "special" category code)

  • { } $ & # ^ _ ~

and these all are "inherited" by a lot of other macro packages (among these, LaTeX).


PowerShell

PowerShell is unusual in that it retains many of the escape sequences of languages descended from C, except that unlike these languages it uses a backtick ` as the escape character rather than a backslash \. For example `n is a new line and `t is a tab.

Python

(From the Python Documentation):

Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:

Escape Sequence Meaning Notes
\newline Ignored  
\\ Backslash (\)  
\' Single quote (')  
\" Double quote (")  
\a ASCII Bell (BEL)  
\b ASCII Backspace (BS)  
\f ASCII Formfeed (FF)  
\n ASCII Linefeed (LF)  
\N{name} Character named name in the Unicode database (Unicode only)  
\r ASCII Carriage Return (CR)  
\t ASCII Horizontal Tab (TAB)  
\uxxxx Character with 16-bit hex value xxxx (Unicode only) (1)
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (Unicode only) (2)
\v ASCII Vertical Tab (VT)  
\ooo Character with octal value ooo (3,5)
\xhh Character with hex value hh (4,5)

Notes:

  1. Individual code units which form parts of a surrogate pair can be encoded using this escape sequence.
  2. Any Unicode character can be encoded this way, but characters outside the Basic Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is compiled to use 16-bit code units (the default). Individual code units which form parts of a surrogate pair can be encoded using this escape sequence.
  3. As in Standard C, up to three octal digits are accepted.
  4. Unlike in Standard C, exactly two hex digits are required.
  5. In a string literal, hexadecimal and octal escapes denote the byte with the given value; it is not necessary that the byte encodes a character in the source character set. In a Unicode literal, these escapes denote a Unicode character with the given value.


Tcl

As documented in man Tcl, the following special characters are defined: <lang Tcl>{...}  ;# group in one word, without substituting content; nests "..."  ;# group in one word, with substituting content [...]  ;# evaluate content as script, then substitute with its result; nests $foo  ;# substitute with content of variable foo $bar(foo) ;# substitute with content of element 'foo' of array 'bar' \a  ;# audible alert (bell) \b  ;# backspace \f  ;# form feed \n  ;# newline \r  ;# carriage return \t  ;# Tab \v  ;# vertical tab \\  ;# backslash \ooo  ;# the Unicode with octal value 'ooo' \xhh  ;# the character with hexadecimal value 'hh' \uhhhh  ;# the Unicode with hexadecimal value 'hhhh'

  1. ;# if first character of a word expected to be a command, begin comment
         ;# (extends till end of line)

{*}  ;# if first characters of a word, interpret as list of words to substitute,

         ;# not single word (introduced with Tcl 8.5)</lang>

XSLT

XSLT is based on XML, and so has the same special characters which must be escaped using character entities:

  • & - &amp;
  • < - &lt;
  • > - &gt;
  • " - &quot;
  • ' - &apos;

Any Unicode character may also be represented via its decimal code point (&#nnnn;) or hexadecimal code point (&#xdddd;).