What you need to know about how a C/C++ project is built
Understanding how a C or C++ project is compiled and linked is the best way to configure analysis settings properly.
Building a C/C++ project leads to generate three different final artifacts (i.e: assemblies):
- Static library (.lib, .a)
- Dynamic library (.dll, .so)
- Executable program (.exe)
Please note that the type of the final artifact does not impact the analysis settings but can be used to define the Analysis Unit. Nevertheless, C/C++ languages define rules related to duplicate names and it is reason why it is recommended to associate only one C/C++ project to an Analysis Unit.
The project build process is based on the following steps:
- Source code preprocessing: Each source code file (.c, .cpp, .cxx, .cc…) is processed to include additional lines of code coming from other files (typically header files: .h, .hh, .hpp, .tpp, .inl) and to generate the complete source code, aka the preprocessed source file. Note that a header file can be included into other header files.
- Compiling: Each preprocessed source file is compiled into an object file (.o, .obj).
- Link-editing: All relevant object files are linked together to produce the final artifact.The next figure presents the different elements that contribute to generate the final artifact through the build process.
The critical step in C/C++ source code analysis is the one related to the source code preprocessing. Reasons are as follows:
- This step of the build process is led by the content of the source files and by various compiler options and system options that have to be set to the same value in source code analysis settings. These options are often difficult to get, developers building software projects through dedicated tools like Make, MS Build, or many others.
- A small issue in option management can lead to a failure in header including that will cause missing symbol declarations and meaningless analysis results.
Even if the C/C++ Analyzer tries to be robust regarding this type of errors, the more complete and correct the source code preprocessing is, the more relevant the source code analysis.
Files that are included into others are generally specified through relative paths instead of absolute paths. This allows to make the source code project “movable” from a root folder to another. There are specific rules that must be respected regarding the relative paths are especially about the root folder to take into account.
The complete rules are complex and can vary between different compilers. Each compiler has an ordered list of include paths that will be used as possible roots. If the source code contains preprocessor directives like “#include "a/b.h"” and the include path list contains “c:\folder1” and “c:\folder2”, then the compiler will first search for a file named “c:\folder1\a\b.h”, and if it does not find it, it will search for “c:\folder2\a\b.h”.
Environment profiles are proposed for some compilers to define include paths enabling to include system headers provided with the compiler. However, if there is no environment profile associated to a given compiler, then it is possible to create a new custom one.
Macros can be used for different purposes in C or C++ programs. They can define constants or function-like structures and they can be used to make pieces of code active or not, depending on a specific condition. This last behavior is particularly used to parameterize header files including and, as such, macros management is important.
A macro can be checked as defined or not and when it has been defined, it is possible to test the value it has been associated to.
Two situations must be considered regarding macro definition:
- They can be defined directly in the source code through preprocessor directives. They are taken into account automatically during the source code processing.
- Each compiler defines a list of native values that can be complemented by additional macros defined in the command line when calling the compiler.
In the second situation, it is necessary to reflect the compiler native values and the macros defined in command line in the analysis settings. Note that this should be done by using environment profiles as much as possible.
Assembly organization and sources
By identifying the assemblies and their properties and dependencies, you will obtain the organization of the source code, which you can use to organize the analysis units and the main input for their configuration.
An assembly is a binary file that corresponds either to a library or to an executable. We don't analyze assemblies. In order to do a proper analysis, we need to know how the source code is organized into assemblies.
When the customer provides the source code, he/she often provides a whole system that includes several assemblies. In some cases, the system can even have several hundred assemblies. Each individual source file typically belongs to one assembly. It can also belong to none or several assemblies. Included files are usually shared across assemblies.
Based on the fact that the C/C++ Analyzer works like a compiler and a compiler treats one assembly at a time, you must keep in mind that:
- You will need to create one Analysis Unit per assembly. Also, you should not have the source code of several assemblies in a single Analysis Unit, otherwise, you may have ambiguities when assemblies implement artifacts of the same type and name (especially for functions). This rule could be ignored in cases where CAST is implemented exclusively for the CAST Engineering Dashboard AND when the risk of ambiguity is very low.
- You will have to determine which source files should be used. Sometimes, based on the Operating System (see above), we may have to purposely ignore files. Although this type of situation is less common, its practice has to be addressed with customers who are using CAST to generate precise blueprints.
You should ask the customer how assemblies are built (i.e. which source code to use, and their dependencies). In many cases, each assembly is built from files contained in dedicated folders. In situations where information regarding the build-process is missing, you can refer to the build files for guidance. C/C++ code is often built using a generic mechanism instrumented by such files: usually makefiles on UNIX systems and vcproj files on Windows. When applications are built using simple makefiles, it is generally possible to figure out how the system is compiled. When makefiles are too complex you should also ask for help from an expert in this field.
Database access API
By identifying if and through which API the C/C++ code accesses databases, we can collect additional information that will be used to configure the analyzers and that will affect resolution of links to database artifacts.
For assemblies containing references to database artifacts, you need to primarily identify:
- Which databases the C/C++ code relies on as well as what releases or versions of the aforementioned databases are being used: this information should be provided by the customer. If it is not possible for you to get this information from the customer, you should take the conservative position that any C/C++ code can access any database item in all the database schema provided along with the source code.
- By what means the client code interacts with the database:
- Through embedded SQL (like PRO*C): in this case, the code contains statements beginning with the words EXEC SQL and ending with a semi-colon, with some SQL code between.
- Using the Database standard API (like OCI or OCCI for Oracle, ODBC for SQLServer, CT-LIB for Sybase…).
- For OCI (for Oracle8 and later), you will find lots of functions prefixed by OCI.
- For SQLServer and DB2, (which use the same ODBC API), you can find functions prefixed with SQL (such as SQLConnect).
- For Sybase, functions are prefixed by db (such as dbopen,), bcp, cs, and _ct_.
Note that the database standard API can be referenced directly in the assembly code, or through a library used by the assembly. In fact, when you have a library that contains calls to DB APIs, you should consider that any other assembly depending on it could also use the database standard API (i.e. indirectly).
Note that some applications can use both embedded SQL and database APIs.
Dynamic code assessment
By determining the existence of dynamic code, you will determine whether you will need to create Reference Pattern jobs to add links and augment known dependencies between assemblies. Sometimes the C/C++ code dynamically loads libraries and gets pointers on functions from the library and then executes them. This type of situation requires additional settings, as we will see later.
To recognize dynamic code use, you need to search the source code for the system function calls that are used to load or find DLLs, and find functions in the DLLs:
- On Windows, the system functions to load/get a DLL are LoadLibrary, LoadLibraryEx, GetModuleHandle, GetModuleHandleEx. The system function to get a pointer to a function (i.e the exported-function address retriever) is GetProcAddress.
- On (most) UNIX, the load-library system function is called dlopen. The exported-function address retriever system function is dlsym. The loaded code can be either a third party library or another assembly to analyze. You can figure this out by looking at the names of the library dynamically loaded using LoadLibrary, LoadLibraryEx, or dlopen. If it is more complicated than just names, then ask the customer to see which libraries may be loaded. If you cannot get the info, consider that it may load analyzed code.
Where the loaded code belongs to the code you are analyzing, you must consider that the loading assembly is dependent (see Assembly organization and sources above) on any assembly it may load.
How to define analysis settings
The best solution is to involve a person who knows how the project is built. A C/C++ project is often based on different libraries that are delivered with their own header files and that must be used with specific compilation options. Dealing with libraries can be a complex operation.
Development teams often forget to deliver system headers required to compile the C/C++ projects they send to AIP administrator. It means the machine where the static code analysis is performed must contain a copy of those headers. This can be done either by installing the compiler, by copying the headers, or by using a fake version of the standard headers delivered with CAST AIP distribution. Note that in the last case, results relevance may be slightly decreased.
In order to ease analysis settings tasks, CAST AIP proposes in 7.1 the Test Analysis feature. This allows to preprocess and analyzes the source code without saving any information and without calculating any metrics. It is then faster than a full analysis operation and generates a report showing preprocessing and parsing troubles in order to refine the analysis settings. It is recommended to test analysis settings before performing a full source code analysis.
Analyzing Visual C++ projects
CAST AIP supports Visual C++ projects and can extract information required to define analysis settings automatically. However, Visual C++ project files can depend on environment variables. In this case, these variables can be specified in the corresponding DMT package. Please refer the DMT documentation for more details.
To analyze Visual C++ projects, create the Analysis Units through the DMT tool and, in CMS, start the operation by testing the analysis settings.
The most frequently faced issue is about missing headers. All required include paths being specified from the “vc(x)proj” file, it means that unresolved files are really missing and should be delivered.
Analyzing "cmake" and "makefile" projects
“cmake” project files (CMakeLists.txt) are not supported by CAST AIP. Nevertheless, it is possible to generate equivalent “vc(x)proj” files with CMake, like shown in the following command line:
Please refer to the CMake documentation for more details. Translating “cmake” projects in “vc(x)proj” projects is a solution that must be considered carefully. It is strongly recommended to test the analysis settings first before performing the source code analysis.
If the above solution does not allow to perform source code analysis and generate relevant results, then it is possible to define the analysis settings manually. The configuration process will be iterative:
- Get all required files in the source code delivery.
- Create the proper Analysis Units (usually one per root folder or one per project file).
- Add macros to analysis settings (possibly by reading project files or compilation log).
- Add include paths to analysis settings (possibly by reading project files, or compilation log).
- Test the analysis settings and tune analysis settings if necessary.
- Repeat the process until the analysis settings have been defined correctly.
A way to get the information required by C/C++ analysis settings is either to compile the application or to look at a log generated by the compiler. Macros and include paths are generally defined by using respectively the “-D” and the “-I” options. If these options do not appear, then it is possible to force the compiler to be more verbose or to execute the “makefile” in debug mode by using the “-N” option.
In that situation, it is possible to define include paths and macros in same time. Otherwise, if it is not possible to get this information, it is possible to proceed iteratively. Header files being often used to define macros (that can be used to configure the preprocessing…), it is better to start by include paths settings. Then, when there is no missing header anymore, macros used for configuration purpose can be defined like, for instance, WIN32, UNIX, GCC_VERSION, or USE_SSL. Keep in mind that the number of macros that are not defined in header files is generally low.
Questions to ask the customer
The explicit compiler options
By determining the explicit compiler options, you will determine important parameters for the analyzers.
When applications are built, the compiler can be called with specific options you won't be able to identify directly from the source code, but that will affect the analysis results when parsing the code through CAST. The most important options are include paths and predefined macros (along with possible associated values). These options need to be entered during the configuration of the Analysis Units in the CAST Management Studio.
You should ask a SME (Subject Matter Expert), or find out this information within the build scripts, makefiles, vcproj (on Windows when using MS Visual C++) or build logs.
CAST Management Studio allows to test Analysis Units by performing analysis without saving any results. The objective is to help users to check if all macros have been defined and if all include files are available for the analyzer. This action is available in the Execute tab of the Application editor (Test Analysis) and in the Execute tab of the Analysis Unit editor (Test analysis on the current Analysis Unit). This action replaces the previous Macro Analyzer tool.
Identify additional requirements
This step will help you to identify proactively system or external library headers that will require specific treatment when implementing the Analysis Units. This step is important if you setup your analysis for use with the CAST Engineering Dashboard using custom quality rules based on links to external functions. Detecting these needs sooner will save time by reducing the number of Analysis Units that need to be run to obtain a the results you need.
Your customer may have specific needs in terms of dashboards and blueprints, which require ensuring that specific artifacts appear in the Analysis Service. For instance, is the code uses the ODBC API, you may need to build a Quality Rule that ensures that each function calling SQLConnect also calls SQLClose.
The problem with the C language is that as declaration of functions is not mandatory, the code can use functions without including the headers that declare them. So we cannot count on this mechanism to have the used functions in the database.
Moreover, objects included this way are counted as being part of the code, and thus taken into account in the CAST Engineering Dashboard (which you don't want), as such you will need identify these functions. They will be used to write a couple of C files that will allow these functions to be loaded and used as link destinations. You will have also to create an extra Analysis Unit for this.