Encoding Markup in .NET
There are at least four ways to escape HTML or XML characters to entities in .NET, but they work quite differently.
TL;DR: In #PowerShell, the best options to encode to HTML or XML are
[Security.SecurityElement]::Escape()(minimal) or[Text.Encodings.Web.HtmlEncoder]::Default.Encode()(comprehensive).
Escape
System.Security.SecurityElement.Escape() is the simplest encoder, only escaping & < > " and ' and
passing through all other characters unchanged.
This is fine if you want something lightweight (though not as lightweight as doing the five search-and-replace operations yourself), and you don't want or need any special encoding for any other characters.
Escape Effect
| codepoint(s) | name | encoded? | format |
|---|---|---|---|
| U+0022 | QUOTATION MARK | ✔️ | " |
| U+0026 | AMPERSAND | ✔️ | & |
| U+0027 | APOSTROPHE | ✔️ | ' |
| U+003C | LESS-THAN SIGN | ✔️ | < |
| U+003E | GREATER-THAN SIGN | ✔️ | > |
| all others | Basic Multilingual Plane (remaining) | ❌ | (unescaped) |
Encode
System.Text.Encodings.Web.HtmlEncoder.Default.Encode() is a more comprehensive encoder to not only encode the bare
minimum characters, but also to encode anything outside 7-bit ASCII for compatibility, using hex codepoint entities
instead of named entities, which work for both HTML and XML.
Encode Effect
| codepoint(s) | name | encoded? | format |
|---|---|---|---|
| U+0000–U+001F | C0 Controls | ✔️ | � – F; |
| U+0020 | SPACE | ❌ | |
| U+0021 | EXCLAMATION MARK | ❌ | ! |
| U+0022 | QUOTATION MARK | ✔️ | " |
| U+0023 | NUMBER SIGN | ❌ | # |
| U+0024 | DOLLAR SIGN | ❌ | $ |
| U+0025 | PERCENT SIGN | ❌ | % |
| U+0026 | AMPERSAND | ✔️ | & |
| U+0027 | APOSTROPHE | ✔️ | ' |
| U+0028 | LEFT PARENTHESIS | ❌ | ( |
| U+0029 | RIGHT PARENTHESIS | ❌ | ) |
| U+002A | ASTERISK | ❌ | * |
| U+002B | PLUS SIGN | ✔️ | + |
| U+002C–U+003B | Basic Latin (partial) | ❌ | , – ; |
| U+003C | LESS-THAN SIGN | ✔️ | < |
| U+003D | EQUALS SIGN | ❌ | = |
| U+003E | GREATER-THAN SIGN | ✔️ | > |
| U+003F–U+007E | Basic Latin (remaining printable) | ❌ | ? – ~ |
| U+007F–U+FFFF | Basic Multilingual Plane (remaining) | ✔️ |  –  |
HtmlEncode
System.Web.HttpUtility.HtmlEncode() only encodes the minimal symbols as named entities (except decimal for apostrophe,
for maximum HTML compatibility with extremely old browsers and HTML parsers), and the Latin-1 Supplement as decimal
entities. It doesn't encode any control characters or any characters outside the Latin blocks.
HtmlEncode Effect
| codepoint(s) | name | encoded? | format |
|---|---|---|---|
| U+0000–U+001F | C0 Controls | ❌ | ␀ – ␟ |
| U+0020 | SPACE | ❌ | |
| U+0021 | EXCLAMATION MARK | ❌ | ! |
| U+0022 | QUOTATION MARK | ✔️ | " |
| U+0023 | NUMBER SIGN | ❌ | # |
| U+0024 | DOLLAR SIGN | ❌ | $ |
| U+0025 | PERCENT SIGN | ❌ | % |
| U+0026 | AMPERSAND | ✔️ | & |
| U+0027 | APOSTROPHE | ✔️ | ' |
| U+0028–U+003B | Basic Latin (partial) | ❌ | ( – ; |
| U+003C | LESS-THAN SIGN | ✔️ | < |
| U+003D | EQUALS SIGN | ❌ | = |
| U+003E | GREATER-THAN SIGN | ✔️ | > |
| U+003F–U+007E | Basic Latin (remaining) | ❌ | ? – ␡ |
| U+0080–U+009F | C1 Controls | ❌ | PAD – APC |
| U+00AD–U+00FF | Latin-1 Supplement | ✔️ |   – ÿ |
| U+0100–U+FFFF | Basic Multilingual Plane (remaining) | ❌ | Ā – U+FFFF† |
† not a valid codepoint
🏚️ AntiXssEncoder
System.Web.Security.AntiXss.AntiXssEncoder offered a variety of encoding choices, but was discontinued after .NET Framework 4.8.1.
Encoding is accomplished with any of three methods: HtmlEncode(), XmlAttributeEncode(), or XmlEncode().
These only encode codepoints up through U+00A0 then U+0370 and above as a decimal entity, and mangles a bunch of the characters in the ranges it doesn't encode, so it's a bad choice for a number of reasons.
AntiXssEncoder Effect
| codepoint(s) | name | encoded? | format | notes |
|---|---|---|---|---|
| U+0000–U+001F | C0 Controls | ✔️ | � –  |
|
| U+0020 | SPACE | ✔️ |   |
XmlAttributeEncode() |
| U+0020 | SPACE | ❌ | HtmlEncode() & XmlEncode() |
|
| U+0021 | EXCLAMATION MARK | ❌ | ! | |
| U+0022 | QUOTATION MARK | ✔️ | " |
|
| U+0023 | NUMBER SIGN | ❌ | # | |
| U+0024 | DOLLAR SIGN | ❌ | $ | |
| U+0025 | PERCENT SIGN | ❌ | % | |
| U+0026 | AMPERSAND | ✔️ | & |
|
| U+0027 | APOSTROPHE | ✔️ | ' |
XmlAttributeEncode() & XmlEncode() |
| U+0027 | APOSTROPHE | ✔️ | ' |
HtmlEncode() |
| U+0028–U+003B | Basic Latin (partial) | ❌ | ( – ; | some printable 7-bit ASCII |
| U+003C | LESS-THAN SIGN | ✔️ | < |
|
| U+003D | EQUALS SIGN | ❌ | = | |
| U+003E | GREATER-THAN SIGN | ✔️ | > |
|
| U+003F–U+007E | Basic Latin (remaining printable) | ❌ | ? – ~ | |
| U+007F | DELETE | ✔️ |  |
|
| U+0080–U+009F | C1 Controls | ✔️ |  – Ÿ |
|
| U+00A0 | NO-BREAK SPACE | ✔️ |   |
|
| U+00A1–U+00AC | Latin-1 Supplement (partial) | ❌️ | ¡ – ¬ | |
| U+00AD | SOFT HYPHEN | ✔️ | ­ |
|
| U+00AE–U+036F | Latin (remaining), various extensions | ❌️ | ® – ͯ | see † |
| U+0370–U+FFFF | Basic Multilingual Plane (remaining) | ✔️ | Ͱ –  |
† these blocks:
- Latin-1 Supplement (remaining)
- Latin Extended-A
- Latin Extended-B
- IPA Extensions
- Spacing Modifier Letters
- Combining Diacritical Marks