Chunks of the following text are taken more or less verbatim from the original CARO naming scheme document, ‘naming.txt’ , although the order of presentation is not. Where current practice has extended the naming scheme, appropriate new material has been interjected. Except when discussing why we should now consider deviating from any formerly suggested ‘rules’, the original ‘naming.txt’ material will not be specifically indicated or acknowledged further.
In a sense, the naming scheme describes an alphabet and a grammar. That is, it comprises sets of symbols from which various name components can be constructed and a set of rules defining how name components are constructed and legally combined, and what name components must be constructed for various kinds of malware entities. Perhaps the easiest part of the standard to define is the set of allowable characters, and as that forms the base for everything else, it seems the logical place to start.
Identifiers
Each name part or component is an identifier. Apart from the crucial ‘family name’ component, each of these identifiers is separated from those that come before and/or after it with a prescribed delimiter string.
Identifiers are constructed from the characters constrained by this regular expression:
[A-Za-z0-9_\-]
This means that the space character has been removed from the list of valid characters. This will cause a lot of renaming for some products to stay conformant with the standard but it is felt this is worthwhile. Existing names containing spaces should either have the spaces replaced with underscores, or have the spaces removed as, for the purposes of name comparisons, the underscore is considered logically equivalent to ‘no character’. It should also be noted that ‘!’ and ‘#’ have been moved from this set to the ‘valid only in delimiters’ set (discussed in a few paragraphs), and both single quote characters, ‘’’ and ‘’’ and the dollar sign, ‘$’, have been removed from the valid characters set. All five of these characters should simply be removed from any existing names that include them. The percent symbol, ‘%’, and the ampersand, ‘&’, have also been removed and where they have been used in existing names they should be replaced with the string ‘Pct’ or ‘And’ respectively (or ‘Pct’ if ‘%’ was the final character in a name). As discussed above regarding replacing spaces with underscore characters, such altered names may have the underscores dropped, according to the practice of the developer concerned. The underscore and hyphen must be avoided at the beginning of all identifiers apart from one special case described later. Identifiers are case-insensitive, but it is recommended that name components should be composed of mixed-case characters. This is encouraged if for no other reason than that it usually improves readability. Usage of underscore (‘_’) where a space might normally appear (or where one appeared in names conformant with the previous version of this specification) is permitted.
Use of numbers
Numeric characters are best avoided entirely apart from where they are required in the purely numeric ‘infective length’ and ‘devolution’ identifiers described later, and in other rare instances where it is desirable to identify version numbers in a name part as distinct from using sub-variant identification (see the discussion of naming viruses derived from kits for an example of a situation where the ‘no numerics in family names’ directive may be judiciously broken). Further, because names are often relayed orally, ‘number names’ (those that relate a number, even if they are not purely composed of numeric characters) are best avoided entirely. Under this guideline, ‘Eight941’ is a poor (although now established) family name (see below), because in the past names such as ‘1704’ were allowed. Homophonic conflicts could lead to further confusion.
‘naming.txt’ allowed quite a number of non-alphanumeric (punctuation and other symbol) characters in identifiers but some aesthetic considerations and the general practice of most developers of avoiding most of these formerly allowed non-alphanumerics has resulted in their removal. Further, some potentially cause problems in some products because of their existing special meanings, be they format operators to the C print routines, logical operators and so on. Some of these characters are well avoided for another reason — to allow flexibility in the future extension of the standard set of special modifiers. Although the need to add further special modifiers should be a rare event, if it becomes necessary it may also require the addition of a new delimiter. One way to do this is to select a character outside either currently allowed character set. The other way is to use a character that is currently legal in the identifier set and reserve it for use as a delimiter. Although the first has been the preferred approach until now, it was felt that the number of ‘useful’ unused characters was dwindling. However, the second approach has the obvious downside of requiring malware with names already using that character to be renamed. Fortunately, most developers have largely avoided the non-alphanumeric characters and they have been very rarely used in ‘official CARO malware names’. Thus it was decided that before more names using the non-alphanumerics were assigned, the largely unused non-alphanumerics should be removed from popular use. This will require a modicum of renaming but should obviate the need for future renamings should further classes of special modifiers need to be assigned their own delimiters. Finally, although new delimiters could be created by combining existing delimiter characters into delimiter strings, as was done when the malware type identifier was introduced, in general this approach is not preferred. As this new standard effectively removes all ‘useful’ non-alphanumeric characters from the ‘valid in identifiers’ set, future new delimiters will have to be added from those characters currently not in either set (developers implementing some form of sanity checking on names may wish to keep this in mind when writing or revising code to perform such checks).
Length of identifiers
In general, name components may be up to twenty characters long, allowing for such family name monstrosities as ‘Green_Caterpillar’ and (long-form) infective PlatformNames such as PowerPoint97Macro and VisualBasicScript. However, shorter names should be used whenever possible, and in some cases specific lengths are mandated or strongly preferred. For example, the identifier for a single locale modifier must be the standard two-character abbreviation for the relevant locale (apart from the special exception that indicates ‘generic Unicode’ as the locale), and although PlatformNames may be up to twenty characters long, it is expected that a ShortFormPlatformName of no more than five characters will also be agreed for each platform (and is likely to be the name used in all but the most formal of situations). Use of ShortFormPlatformNames is encouraged (FullySpecifiedMalwareNames quickly become very bloated otherwise), although use of the long-form is always acceptable in technically demanding naming circumstances. With the family name component, if a shorter name is just an abbreviation of a long name, it is generally better to use the long name. Also, although brevity in each name component is desirable, it should not be slavishly sought. Two and three letter abbreviations that are natural and entirely obvious should not be avoided (thus JS, WM and VBS are quite acceptable ShortFormPlatformNames). However, artificially short abbreviations may unnecessarily deplete the namespace (e.g. JV is not acceptable as a short-form of the Java platform name – at four characters in its LongFormPlatformNames such as Java need not be abbreviated and must not be if the only options are as artificial as this). Also, consideration should always be made of likely future namespace confusion when looking for suitable ShortFormPlatformName – for example, ‘AS’ as a short-form for a scripting platform whose name starts with ‘A’ would be inappropriate as there are already two ‘A-script’ platforms and consideration of possibly virusable platforms should quickly turn up a couple more, and that’s just the potential platforms we know of today! Other specific identifier length limits are mentioned in the relevant sections below.
What does all the above mean in practice? As a malware family name ‘MyParty’ is preferable to ‘Myparty’ but technically equivalent to it (remember, identifiers are case insensitive); both are preferable to ‘myparty’ and ‘MYPARTY’. ‘MyParty’ and ‘My_Party’ are also acceptable names, and although not ideographically equivalent, these two are considered the same name under this scheme (underscore is equivalent to ‘no character’). However, My__Party (two consecutive underscores) is strongly discouraged on aesthetic and ambiguity grounds.
Delimiters
Having mentioned delimiters a few paragraphs back, we should now define them. Delimiters are valid characters in FSMNs , but invalid in all identifiers (except for ‘!’ being allowed recursively ‘inside’ a vendor-specific comment — more below). The current set of characters reserved for use in or as delimiters are defined by the regular expression:
[!#./:@]
In the sections below detailing the use of each name component, the delimiter for that component is included as a literal character or literal string, followed by, or following (as appropriate), the identifier name. In general, the use of the delimiters should be obvious with the discussions of each of the name components focussing on the details of properly choosing, using or specifying that component and/or its variable parts.