Summary: this page describes how to configure the automatic UTF-8 conversion feature available in Console ≥ 2.9.
Introduction
When source code is delivered in CAST Console and this source code contains characters whose encoding or character set is not UTF-8, this can cause problems during the analysis phase. CAST Console (when using the Workflow - Application onboarding with Fast Scan) will warn you that the source code contains non UTF-8 encoded characters and provide a list of the files impacted - see 2.9 - Onboarding with Fast Scan - redesigned job progress screen:
Alert in the Job Progress panel
CAST Console includes a feature to convert these files automatically to UTF-8 during the Content Discovery step of an onboarding with Fast Scan to reduce the number of warnings in the analysis log and improve the analysis results. This feature is disabled "out-of-the-box" and requires a modification to the Node configuration file to enable it. See instructions below.
Step 1 - Edit the properties file
The configuration file is available on EACH Node, therefore if you have more than one Node, you will need to modify the configuration file on each node where you want to configure the feature:
Enterprise mode ≥ 2.x %PROGRAMDATA%\CAST\AIP-Node\application-default.yml Standalone mode ≥ 2.x %PROGRAMDATA%\CAST\AIP-Console-Standalone\application-standalone.yml
Note that the standard configuration is present by default in the %PROGRAMDATA%\CAST\AIP-Node\application.yml
or %PROGRAMDATA%\CAST\AIP-Console-Standalone\application.yml
file, however, you should always edit the application-default.yml /
application-standalone.yml
for customization purposes since it overrides the content of application.yml
and the file is never overwritten during an upgrade.
Step 2 - Make the changes
Copy the following into the application-default.yml /
application-standalone.yml
under the existing "application
" entry:
application: sourceCodeFiles: assumedEncoding: JVM conversionToUtf8: enabled: true appNameFilter: backupFiles: true pathsInReports: absolute fileExtensionsAdded: fileExtensionsRemoved: ani, avi, bin, bmp, bz2, chi, chm, class, com, csv, dib, dll, doc, docx, dump, exe, exp, frx, gif, gz, ico, idb, ilk, iml, ini, jar, jfif, jpe, jpeg, jpg, lib, log, mp3, mp4, msi, pbd, pdb, pch, pdf, png, ppt, pptx, rtf, sys, tar, tif, tiff, tgz, txt, vhdx, war, wav, webp, xls, xlsx, zip preservedEncodings: ".*UTF-(16|32).*" logRetainedFiles: false logRejectedFiles: false logConvertedFiles: true
Item | Description |
---|---|
application.sourceCodeFiles.assumedEncoding | Character set that will be assumed for source files whose character set could not be found: can be "JVM" (default setting), standing for "the default Charset of this Java virtual machine", or the name of any character set supported by the current JVM.
|
application.sourceCodeFiles.conversionToUtf8.enabled | Whether automatic conversion of non UTF-8 source code files to UTF-8 is enabled during Onboarding with Fast Scan - the conversion is applied during the Content Discovery step. Set to true or false. |
application.sourceCodeFiles.conversionToUtf8.appNameFilter | Regular expression defining, if any, the name(s) of the only applications whose non UTF-8 source files must be converted. No filtering occurs when this field is empty. Ensure that the regular expression is surrounded with double quotes (") to avoid issues because of characters having special meaning in a YAML file, such as '+' or '#' for instance. |
application.sourceCodeFiles.conversionToUtf8.backupFiles | Whether source code files that will be converted to UTF-8 should be backed up before conversion. Set to true or false. If true, each file will be backed up in the same directory as the original file, with the same name but whose extension will be suffixed with ".genuine@<date>-<time>(<encoding>)" where <date>-<time> corresponds to the moment where the Fast Scan / Content discovery started (this date and time will be the same for all files), and <encoding> is the encoding or character set found (or assumed) for the file; for instance the backup of the non-UTF-8 file "main.java" found to have been encoded in "Shift_JIS" will be named "main.java.genuine@20230525-084129(Shift_JIS)" if Fast Scan / Content discovery started on May 25th, 2023, at 08:41:29 AM. If over the time several versions of the same files are delivered, their repeated conversion to UTF-8 will result in repeated creation of backup files since their filenames are timestamped. To avoid the accumulation of outdated backup files, once the conversion of all application files to UTF-8 has completed, only the most recent backup of each file that has again be converted to UTF-8 is kept. Once this process has ended, the below line is logged: [INFO] Old backup files (count = N) of '<application-name>' could be deleted in folder '<sources-root-path>' |
application.sourceCodeFiles.conversionToUtf8.pathsInReports | Configures the type of paths that will be logged in reports about files of an application that have been considered for conversion to UTF-8, or that have not been considered because of filtering by file extension. Can be set to one of the following:
In case of error while either detecting the encoding or the character set of a file, or while converting it to UTF-8, the error report will always contain the absolute file path. |
application.sourceCodeFiles.conversionToUtf8.logRetainedFiles | Whether to log the paths of files retained as candidates for conversion to UTF-8 (due to their filename extension). Set to true or false. |
application.sourceCodeFiles.conversionToUtf8.logRejectedFiles | Whether to log the paths of files rejected as candidates for conversion to UTF-8 (because of their filename extension). Set to true or false. |
application.sourceCodeFiles.conversionToUtf8.logConvertedFiles | Whether to log the paths of files that have actually been converted to UTF-8. Set to true or false. |
application.sourceCodeFiles.conversionToUtf8.fileExtensionsAdded | Comma-separated list of strings (there can be none - i.e. this is optional), with no leading dot, whose files with such extension will always be considered (in addition to the file extensions resulting from the application scan) for conversion to UTF-8 because files with these extensions are known to contain some source code. By default, this property will be empty. For example: fileExtensionsAdded: txt, java
|
application.sourceCodeFiles.conversionToUtf8.fileExtensionsRemoved | Comma-separated list of strings (there can be none - i.e. this is optional), with no leading dot, whose files with such extension will never be considered for conversion to UTF-8 because such files are known not to be source code files, or they should not be converted to UTF-8. By default, this property will contain a set list of file extensions, but these can be edited (i.e. existing entries can be removed and new ones added). For example: fileExtensionsRemoved: xls, xlsx
|
application.sourceCodeFiles.conversionToUtf8.preservedEncodings | Regular expression defining (if there is one - i.e. this is optional) the names of character sets and encodings for which source code files (that have been found with the matched character set or encoding) will not be converted to UTF-8 (for convenience, matching will be done regardless of letter case). If this property is empty or disabled, no filtering will occur for a certain file because of the character set or encoding that was found or assumed for this file. By default, this property will contain the following regular expression that will ensure that any source code files found with either UTF-16 or UTF-32 will not be converted to UTF-8 (but additional expressions can be added): preservedEncodings: ".*UTF-(16|32).*" The property value should be surrounded with double quotes (") to avoid issues because of characters having special meaning in YAML, such as '+' or '#'. |
Below is an example configuration which enables automatic conversion to UTF-8 of all non-UTF-8 source code files (except those where the existing character set or encoding matches either UTF-16 or UTF-32) provided the name of the application they belong to contains "foo" (case-sensitively) or contains "bar" (case-insensitively), where backup of the source code files is enabled. Also, the character set that will be assumed for source code files whose character set could neither be found (for instance thanks to a BOM) nor guessed (by sampling followed by a validation of the guess against the entire file) is set to ISO-8859-1, corresponding to the Western European codepage.
application: sourceCodeFiles: assumedEncoding: ISO-8859-1 conversionToUtf8: enabled: true appNameFilter: ".+(foo|[bB][aA][rR]).+" backupFiles: true pathsInReports: absolute fileExtensionsAdded: fileExtensionsRemoved: ani, avi, bin, bmp, bz2, chi, chm, class, com, csv, dib, dll, doc, docx, dump, exe, exp, frx, gif, gz, ico, idb, ilk, iml, ini, jar, jfif, jpe, jpeg, jpg, lib, log, mp3, mp4, msi, pbd, pdb, pch, pdf, png, ppt, pptx, rtf, sys, tar, tif, tiff, tgz, txt, vhdx, war, wav, webp, xls, xlsx, zip preservedEncodings: ".*UTF-(16|32).*" logRetainedFiles: false logRejectedFiles: false logConvertedFiles: true
Save the file when you have completed the changes.
Step 3 - Apply configuration changes
Restart the Node / Standalone release to ensure all changes are taken into account.