ConceptDev (Craig Dunn's blog): Unicode to Html Entity and back again

Monday, 3 May 2004

Unicode to Html Entity and back again

WorldPay with C#/.NET [2] also discusses encoding "double byte" (unicode, really) values as Html Entities. The 'final' code is

private string HtmlEntityEncode (string unicodeText) {

	int unicodeVal;

	string encoded=""; 

	foreach (char c in unicodeText) {

		unicodeVal = c;

		if ((c >= 49) && (c <= 122)) { 

			// in 'ascii' range x30 to x7a which is 0-9A-Za-z plus some punctuation

			encoded += c;	// leave as-is

		} else { // outside 'ascii' range - encode

			encoded += string.Concat("&#", 

				unicodeVal.ToString(System.Globalization.NumberFormatInfo.InvariantInfo), ";");

		}

	}

	return encoded;

}

But it's also fairly easy to get your 'original' string back... this code can go anywhere

System.Text.RegularExpressions.Regex entityResolver = 

	new System.Text.RegularExpressions.Regex (@"([&][#](?'unicode'\d+);)|([&](?'html'\w+);)");

string outputString = entityResolver.Replace(inputString, 

	new System.Text.RegularExpressions.MatchEvaluator (ResolveEntity) );

as long as this method is available

private string ResolveEntity (System.Text.RegularExpressions.Match matchToProcess) {

	string x = "X"; // default 'char placeholder' if cannot be resolved

	if (matchToProcess.Groups["unicode"].Success) {

		x = Convert.ToChar(Convert.ToInt32(matchToProcess.Groups["unicode"].Value) ).ToString();

	} else {

		if (matchToProcess.Groups["html"].Success) {

			switch (matchToProcess.Groups["html"].Value.ToLower()) {

				// this could be expanded to as many as you like, or (maybe) 

				// System.Web.HttpUtility.HtmlDecode will work on 

				// the whole 'entity' string... ?

				case "nbsp": x = " ";break;

				case "copy": x = "(c)";break;

				case "lt": x = "<";break;

				case "gt":x = ">";break;

				case "amp": x = "&";break;

			}

		}

	}

	return x;

}

UPDATE: list of HTML 4 entities which could be used to write a robust ResolveEntity() method using the 'pattern' abpve/