Monday, 3 May 2004

Unicode to Html Entity and back again

WorldPay with C#/.NET [2] also discusses encoding "double byte" (unicode, really) values as Html Entities. The 'final' code is
private string HtmlEntityEncode (string unicodeText) {

int unicodeVal;
string encoded="";
foreach (char c in unicodeText) {
unicodeVal = c;
if ((c >= 49) && (c <= 122)) {
// in 'ascii' range x30 to x7a which is 0-9A-Za-z plus some punctuation
encoded += c; // leave as-is
} else { // outside 'ascii' range - encode
encoded += string.Concat("&#",
unicodeVal.ToString(System.Globalization.NumberFormatInfo.InvariantInfo), ";");
}
}
return encoded;
}


But it's also fairly easy to get your 'original' string back... this code can go anywhere
System.Text.RegularExpressions.Regex entityResolver = 

new System.Text.RegularExpressions.Regex (@"([&][#](?'unicode'\d+);)|([&](?'html'\w+);)");
string outputString = entityResolver.Replace(inputString,
new System.Text.RegularExpressions.MatchEvaluator (ResolveEntity) );

as long as this method is available
private string ResolveEntity (System.Text.RegularExpressions.Match matchToProcess) {

string x = "X"; // default 'char placeholder' if cannot be resolved
if (matchToProcess.Groups["unicode"].Success) {
x = Convert.ToChar(Convert.ToInt32(matchToProcess.Groups["unicode"].Value) ).ToString();
} else {
if (matchToProcess.Groups["html"].Success) {
switch (matchToProcess.Groups["html"].Value.ToLower()) {
// this could be expanded to as many as you like, or (maybe)
// System.Web.HttpUtility.HtmlDecode will work on
// the whole 'entity' string... ?
case "nbsp": x = " ";break;
case "copy": x = "(c)";break;
case "lt": x = "<";break;
case "gt":x = ">";break;
case "amp": x = "&";break;
}
}
}
return x;
}


UPDATE: list of HTML 4 entities which could be used to write a robust ResolveEntity() method using the 'pattern' abpve/