View on GitHub

dataelemental

Coding to data formats and querying

Encoding Markup in .NET

There are at least four ways to escape HTML or XML characters to entities in .NET, but they work quite differently.

TL;DR: In #PowerShell, the best options to encode to HTML or XML are [Security.SecurityElement]::Escape() (minimal) or [Text.Encodings.Web.HtmlEncoder]::Default.Encode() (comprehensive).

Escape

System.Security.SecurityElement.Escape() is the simplest encoder, only escaping & < > " and ' and passing through all other characters unchanged.

This is fine if you want something lightweight (though not as lightweight as doing the five search-and-replace operations yourself), and you don't want or need any special encoding for any other characters.

Escape Effect

codepoint(s)	name	encoded?	format
U+0022	QUOTATION MARK	✔️	`"`
U+0026	AMPERSAND	✔️	`&`
U+0027	APOSTROPHE	✔️	`'`
U+003C	LESS-THAN SIGN	✔️	`<`
U+003E	GREATER-THAN SIGN	✔️	`>`
all others	Basic Multilingual Plane (remaining)	❌	(unescaped)

Encode

System.Text.Encodings.Web.HtmlEncoder.Default.Encode() is a more comprehensive encoder to not only encode the bare minimum characters, but also to encode anything outside 7-bit ASCII for compatibility, using hex codepoint entities instead of named entities, which work for both HTML and XML.

Encode Effect

codepoint(s)	name	encoded?	format
U+0000–U+001F	C0 Controls	✔️	`` – `&#1F;`
U+0020	SPACE	❌
U+0021	EXCLAMATION MARK	❌	!
U+0022	QUOTATION MARK	✔️	`"`
U+0023	NUMBER SIGN	❌	#
U+0024	DOLLAR SIGN	❌	$
U+0025	PERCENT SIGN	❌	%
U+0026	AMPERSAND	✔️	`&`
U+0027	APOSTROPHE	✔️	`'`
U+0028	LEFT PARENTHESIS	❌	(
U+0029	RIGHT PARENTHESIS	❌	)
U+002A	ASTERISK	❌	*
U+002B	PLUS SIGN	✔️	`+`
U+002C–U+003B	Basic Latin (partial)	❌	, – ;
U+003C	LESS-THAN SIGN	✔️	`<`
U+003D	EQUALS SIGN	❌	=
U+003E	GREATER-THAN SIGN	✔️	`>`
U+003F–U+007E	Basic Latin (remaining printable)	❌	? – ~
U+007F–U+FFFF	Basic Multilingual Plane (remaining)	✔️	`` – ``

HtmlEncode

System.Web.HttpUtility.HtmlEncode() only encodes the minimal symbols as named entities (except decimal for apostrophe, for maximum HTML compatibility with extremely old browsers and HTML parsers), and the Latin-1 Supplement as decimal entities. It doesn't encode any control characters or any characters outside the Latin blocks.

HtmlEncode Effect

codepoint(s)	name	encoded?	format
U+0000–U+001F	C0 Controls	❌	␀ – ␟
U+0020	SPACE	❌
U+0021	EXCLAMATION MARK	❌	!
U+0022	QUOTATION MARK	✔️	`"`
U+0023	NUMBER SIGN	❌	#
U+0024	DOLLAR SIGN	❌	$
U+0025	PERCENT SIGN	❌	%
U+0026	AMPERSAND	✔️	`&`
U+0027	APOSTROPHE	✔️	`'`
U+0028–U+003B	Basic Latin (partial)	❌	( – ;
U+003C	LESS-THAN SIGN	✔️	`<`
U+003D	EQUALS SIGN	❌	=
U+003E	GREATER-THAN SIGN	✔️	`>`
U+003F–U+007E	Basic Latin (remaining)	❌	? – ␡
U+0080–U+009F	C1 Controls	❌	PAD – APC
U+00AD–U+00FF	Latin-1 Supplement	✔️	` ` – `ÿ`
U+0100–U+FFFF	Basic Multilingual Plane (remaining)	❌	Ā – U+FFFF†

† not a valid codepoint

🏚️ AntiXssEncoder

System.Web.Security.AntiXss.AntiXssEncoder offered a variety of encoding choices, but was discontinued after .NET Framework 4.8.1.

Encoding is accomplished with any of three methods: HtmlEncode(), XmlAttributeEncode(), or XmlEncode().

These only encode codepoints up through U+00A0 then U+0370 and above as a decimal entity, and mangles a bunch of the characters in the ranges it doesn't encode, so it's a bad choice for a number of reasons.

AntiXssEncoder Effect

codepoint(s)	name	encoded?	format	notes
U+0000–U+001F	C0 Controls	✔️	`` – ``
U+0020	SPACE	✔️	` `	`XmlAttributeEncode()`
U+0020	SPACE	❌		`HtmlEncode()` & `XmlEncode()`
U+0021	EXCLAMATION MARK	❌	!
U+0022	QUOTATION MARK	✔️	`"`
U+0023	NUMBER SIGN	❌	#
U+0024	DOLLAR SIGN	❌	$
U+0025	PERCENT SIGN	❌	%
U+0026	AMPERSAND	✔️	`&`
U+0027	APOSTROPHE	✔️	`'`	`XmlAttributeEncode()` & `XmlEncode()`
U+0027	APOSTROPHE	✔️	`'`	`HtmlEncode()`
U+0028–U+003B	Basic Latin (partial)	❌	( – ;	some printable 7-bit ASCII
U+003C	LESS-THAN SIGN	✔️	`<`
U+003D	EQUALS SIGN	❌	=
U+003E	GREATER-THAN SIGN	✔️	`>`
U+003F–U+007E	Basic Latin (remaining printable)	❌	? – ~
U+007F	DELETE	✔️	``
U+0080–U+009F	C1 Controls	✔️	`` – ``
U+00A0	NO-BREAK SPACE	✔️	` `
U+00A1–U+00AC	Latin-1 Supplement (partial)	❌️	¡ – ¬
U+00AD	SOFT HYPHEN	✔️	``
U+00AE–U+036F	Latin (remaining), various extensions	❌️	® – ͯ	see †
U+0370–U+FFFF	Basic Multilingual Plane (remaining)	✔️	`Ͱ` – ``

† these blocks:

Latin-1 Supplement (remaining)
Latin Extended-A
Latin Extended-B
IPA Extensions
Spacing Modifier Letters
Combining Diacritical Marks