Sunday, February 8, 2009

Creating valid C# identifiers

In code generation, it’s very common to need to create a valid identifier. For instance, you may need to create a class name, field, property, method, etc based on the name of a table, a file name, a registry key. The problem with this type of generation is that the rules that make one valid may not apply to the other. For instance, “1st File.txt” is a perfectly valid file name, but it’s not a valid C# identifiers.

Valid identifiers in C# are defined in the C# Language Specification, item 2.4.2. The rules are very simple:

- An identifier must start with a letter or an underscore
- After the first character, it may contain numbers, letters, connectors, etc
- If the identifier is a keyword, it must be prepended with “@”

Applying these rules is pretty straightforward. The following code validates items 1 and 2:

private string CleanName(string name)

{ //Compliant with item 2.4.2 of the C# specification
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Nl}\p{Mn}\p{Mc}\p{Cf}\p{Pc}\p{Lm}]");
string ret = regex.Replace(name, "_"); //The identifier must start with a character if (!char.IsLetter(ret, 0))

ret = string.Concat("_", ret);
return ret;
}


To validate item 3, you can use the C# provider as follows:

ret = Microsoft.CSharp.CSharpCodeProvider.CreateProvider("C#").CreateEscapedIdentifier(ret); 


This code will generate an underscore for each space in the identifier. For instance, “c:\1st file.txt” will be generated as “c__1st_file_txt”. If you want to prevent that, change the regex.Replace(name, “_”) with regex.Replace(name, “”). You may also consider capitalizing the first letter after each “_” and then eliminating the “_”.


Finally, you may prefer to have the “keyword” identifiers named with a prefix different from “@”. If that’s the case, use IsValidIdentifier in the CodeDomProvider to know which identifiers are keywords. A full corrected code snippet is below:

private static string CleanName(string name)
{
//Compliant with item 2.4.2 of the C# specification

System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Nl}\p{Mn}\p{Mc}\p{Cf}\p{Pc}\p{Lm}]");
string ret = regex.Replace(name, "");
//The identifier must start with a character or a "_"
if (!char.IsLetter(ret, 0) !Microsoft.CSharp.CSharpCodeProvider.CreateProvider("C#").IsValidIdentifier(ret))

ret = string.Concat("_", ret);
return ret;
}

The only problem you may find after this is with duplicated identifiers. “c:x” will generate the same identifier as “c.x”, which may be a problem depending on your particular code generation needs. If you run into this, use a list to store already used identifiers. When you find that an identifier has been used, add a number at the end and check again.

9 comments:

  1. Thanks! Very useful.

    Mauricio.
    ReplyDelete
  2. Nice and concise. Thanks for the regex! Saved a bunch of time and made me learn the \p{} stuff I didnt know about regex.
    ReplyDelete
  3. thanks;;;coz ilearn;;;;;;
    ReplyDelete
  4. In the last snippet "&&" is missing in the if condition. Currently it is:

    if (!char.IsLetter(ret, 0) !Microsoft.CSharp.CSharpCodeProvider.CreateProvider("C#").IsValidIdentifier(ret))

    but it should be

    if (!char.IsLetter(ret, 0) && !Microsoft.CSharp.CSharpCodeProvider.CreateProvider("C#").IsValidIdentifier(ret))
    ReplyDelete
  5. Just have a question. What if I use _@tr@es as an identifier. Is it valid?
    ReplyDelete
  6. anything that's a valid c# identifier should be valid, but that's not a valid c# identifier, so it will be converted.
    ReplyDelete