Introduction

If you want to use the Universal Analyzer to analyze a language, or you want to use the Universal Importer tool to import data generated by an external analyzer, in addition to defining the objects that can exist in a given language (see Defining a new language ), you also need to describe how to actually analyze the language. To define how to analyze a language, you need to define the following elements in the xxxLanguagePattern.xml file:

  • Where the language starts/stops in a file; useful when several languages are mixed together in a file
  • How comments are defined
  • How strings are defined
  • All the reserved keywords
  • All the types of objects and the patterns used to find these objects in the source code
  • Embedded SQL (if it exists) start and end tags

The main points are:

  • Rules for pattern matching are regular expressions
  • The file containing these rules is only used when the source code is analyzed. It describes the syntax of the objects and their properties in a simple and straightforward manner. Executables like CAST Enlighten do not need this file. This file is not stored in the CAST Analysis Service. A copy of this file must be saved at a safe location.
  • The content of this file is related to the content of the xxxMetaModel.xml file. Use only objects, categories and properties already defined in this file.

If you want to retrieve comments, you must define the pattern for matching comments.

xxxLanguagePattern.xml

Location

For each language supported by the Universal Analyzer, you must add a file that describes how to analyze the language. The file name and location must conform to the following points:

  • The name of the file is composed of the name of the language (name of the category as defined in the metamodel) followed by "LanguagePattern.xml" (such as "PHPLanguagePattern.xml").
  • The language pattern file must be located in the language package directory.
Please note that if this is not respected, the files will not be analyzed.

Structure

When the name is followed by the character '*', '+' or '?', this means that the element can occur respectively 0 to n times, 1 to n times, and 0 or 1 times. Otherwise the element is mandatory.

When the value is a regular expression (like for matching patterns) any character that is a special character in a regular expression must be preceded by an escape character, e.g.:

'?'

becomes

'\?'

If the regular expression contains characters that are not accepted in XML or will make the document incorrect, the value of the XML element must be contained in a CDATA tag.

Main elements

Element

Child Elements

 

Value

Description

Sample

languagePattern

begin, end, escape, comment, string, keyword, "object-type"

 

-

The root of the document.

 

 

begin?

 

RegExp

Contains the starting language characters.

<?
<![CDATA[<\?]]>

 

end?

 

RegExp

Contains the ending language characters.

?>
<![CDATA[\?>]]>

 

escape?

 

char

The escape character for all the elements of the language.

<![CDATA[\]]>

 

comment*

begin, end, nested, multiline

-

Description comments

 

 

 

Begin

RegExp

Beginning of the comment

<![CDATA[#]]>

 

 

End

RegExp

End of the comment

<![CDATA[\r\n]]>

 

 

nested?

boolean

Can the comment be nested? For example, '/*' C++ comments can't be nested.
Default value is false.

 

 

 

multiline?

boolean

Can the comment spread over several lines or not.
Default value is false.

 

 

string*

begin, end, escape

-

Describe a string - used (amongst other things) to handle word based metrics, such as Copy/Paste and Halstead metrics

 

 

 

begin

RegExp

Beginning of the string

<![CDATA["]]>

 

 

end

RegExp

End of the string

<![CDATA["]]>

 

 

escape

char

Escape character for strings.

 

 

keyword*

 

String - used (amongst other things) to handle word based metrics, such as Copy/Paste and Halstead metrics

A reserved keyword of the language.

protected, interface

Note that using Regular Expressions is not recommended in the keyword section.

Note that the following characters must be escaped using a backslash:

( )  [ ] ? * - +

For example, if you need to add list() it should be entered as follows (note the escaping on the brackets):

<keyword>list\(\)</keyword>  

 

operator*

 

RegExp – used to handle word based metrics, such as Copy/Paste and Halstead metrics

Allows a token to be specified as an arithmetical and logical operator.

For example, for C++:

+, -, *, /, %
<<, >>, <<=, >>=
++, --
<, >, <=, >=, !=, ==
&, |, ~, ^
&&, ||,  !
<operator>
<![CDATA[[!-/:-@\[-\_]]]>
</operator> 

 

numeric

 

RegExp - used to handle word based metrics, such as Copy/Paste and Halstead metrics

Allows a token to be specified as numeric literal.

<numeric>
<![CDATA[[[0-9]+]]>
</numeric>

 

identifier

 

RegExp – used to handle word based metrics, such as Halstead metrics

Allows a token to be specified as an identifier.

<identifier>
<![CDATA[([A-Za-z]|"_")([0-9]|[A-Za-z]
|"_"|"@")*]]>
</identifier>

 

Types?

"object-type"

See table below in Object type elements

 

 

 

Links?

"link-type"

 

 

Links ?

 

"link-type"

pattern,
callee

-

Patterns for link recognition.

<inheritLink>
<pattern>
<callee>
</inheritLink>

 

 

pattern

RegExp

Pattern used to find the anchor of the link.

<pattern>
extends([_]|[\r\n]|[\t])+
</pattern>

 

callee

backward

RegExp

Pattern used to find the callee of the link.

<callee>
<![CDATA[[:word:]]
[[:word:]]*]]>
</callee>

 

 

backward?

empty

Specifies when a callee pattern should be 'backward' matched.

Empty element.

Object type elements

Element

Child Elements

 

Value

 

Description

Sample (PHP)

header

pattern, begin, end

 

-

 

Patterns for object header recognition.

<header><pattern>…
<begin>…
<end>…
</header>

 

pattern

 

RegExp

 

Pattern used to find the anchor of the header.

<pattern>
[fF][uU][nN][cC][tT][iI]
[oO][nN]([ ]|[\r\n]|[\t])+]
</pattern>

 

begin?

 

RegExp

 

Beginning of the header

<begin>
public:([_]|[\r\n]|[\t])+]
</begin>

 

end?

 

RegExp

 

End of the header

<end>
\(.*\)([_]|[\r\n]|[\t])+]
</end> 

category.property

pattern, value, backward

 

 

 

Pattern used for object property recognition.

<identification.name><pattern>…
<value>…
<backward>
</identification.name>

 

pattern

 

RegExp

 

Pattern used to find the anchor of the property.

<pattern>
[fF][uU][nN][cC][tT][iI][oO]
[nN]([ ]|[\r\n]|[\t])+]
</pattern>

 

value

backward

 

RegExp

The Value of the property

<value>
<![CDATA[[[:word:]]
[[:word:]]*]]>
</value>

 

backward

empty

 

 

Specifies when a property pattern should be 'backward' matched.

Empty element:
<backward/>

endwithoutbody

-

 

RegExp

 

Used to determine the end of an object which doesn't have a body

<endwithoutbody>;
</endwithoutbody>

body

begin, end, nested, noendpattern

 

-

 

Pattern for object body recognition

<body>
<begin>…
<end>…<nested>…
</noendpattern>
</body>

 

begin

 

RegExp

 

Pattern for the beginning of the body of the object.

<begin>
<![CDATA[{]]>
</begin>

 

end

 

RegExp

 

Pattern for the end of the body of the object.

<end>
<![CDATA[}]]>
</end>

 

nested?

 

boolean

 

Can the character for the beginning and end of the body be found inside the body?
Default value is 'true'.

<nested>false</nested>

 

noendpattern

empty

-

 

Use this element to indicate to the Universal Analyzer that an object does not have an end tag. You can also use this at type level.

Empty element:
<noendpattern/>

noendpattern

empty

 

-

 

Use this element to indicate to the Universal Analyzer that an object does not have an end tag. You can also use this at body level

Empty element:
<noendpattern/>

Note about the <backward/> element

Property values or link callees can be bi-directionally matched. The default match is 'forward', but when values are located before the pattern, you may use the "backward" element to reverse the search direction.

(Backward Properties) << "Object Type Pattern" >> (Forward Properties)

Example:

There may be a function definition in a language 'lng', whereas your code contains the following statement:

lng_my_function function
\{
\}

When configuring your 'lng' language, specify the search direction of the property 'identification.name' by adding the 'backward' element so that the function name will be searched before the matched object type pattern "function":

<lngFunction>
<header> ...</header>
<identification.name>
<pattern>function</pattern>
<value>
<![CDATA[([a-z])+]]>
<backward/>
</value>
</identification.name>
<body> ...</body>
</lngFunction>

Objects with no end pattern and siblings that are recognized using <backward/>

Please see Appendix - Special Cases for more information about this.

Embedded SQL support

If the language you want to analyze includes embedded SQL (i.e. calls to server-side database objects), then you need to tell the Universal Analyzer how to identify the embedded calls. This can be done using a specific tag as follows:

<esql>
<begin><![CDATA[BEGIN_ESQL]]></begin>
<end><![CDATA[END_ESQL]]></end>
</esql>

Modify the [BEGIN_ESQL] and [END_ESQL] fields to reflect the start and end tags of your embedded SQL code. You can add as many <begin> and <end> tags as necessary, for example:

<esql>
<begin><![CDATA[BEGIN_ESQL_1]]></begin>
<end><![CDATA[END_ESQL_1]]></end>
</esql>
<esql>
<begin><![CDATA[BEGIN_ESQL_2]]></begin>
<end><![CDATA[END_ESQL_2]]></end>
</esql>

Please make sure you also modify the corresponding xxxMetaModel.xml file in your language package so that each object that needs to be searched for embedded SQL inherits from the ESQLSearchable category. You can find out more about this in the section Embedded SQL support in the page Defining a new language.