Summary: this page describes how to configure the automatic UTF-8 conversion feature available in Console ≥ 2.9.

Introduction

When source code is delivered in CAST Console and this source code contains characters whose encoding or character set is not UTF-8, this can cause problems during the analysis phase. CAST Console (when using the Workflow - Application onboarding with Fast Scan) will warn you that the source code contains non UTF-8 encoded characters and provide a list of the files impacted - see 2.9 - Onboarding with Fast Scan - redesigned job progress screen:

Alert in the Job Progress panel

CAST Console includes a feature to convert these files automatically to UTF-8 during the Content Discovery step of an onboarding with Fast Scan to reduce the number of warnings in the analysis log and improve the analysis results. This feature is disabled "out-of-the-box" and requires a modification to the Node configuration file to enable it. See instructions below.

Step 1 - Edit the properties file

The configuration file is available on EACH Node, therefore if you have more than one Node, you will need to modify the configuration file on each node where you want to configure the feature:

Enterprise mode ≥ 2.x
%PROGRAMDATA%\CAST\AIP-Node\application-default.yml

Standalone mode ≥ 2.x
%PROGRAMDATA%\CAST\AIP-Console-Standalone\application-standalone.yml


Note that the standard configuration is present by default in the %PROGRAMDATA%\CAST\AIP-Node\application.yml or %PROGRAMDATA%\CAST\AIP-Console-Standalone\application.yml file, however, you should always edit the application-default.yml / application-standalone.yml for customization purposes since it overrides the content of application.yml and the file is never overwritten during an upgrade.

Step 2 - Make the changes

Copy the following into the application-default.yml / application-standalone.yml under the existing "application" entry:

application:
  sourceCodeFiles:
    assumedEncoding: JVM
    conversionToUtf8:
      enabled: true
      appNameFilter:
      backupFiles: true
      pathsInReports: absolute
      fileExtensionsAdded:
      fileExtensionsRemoved: ani, avi, bin, bmp, bz2, chi, chm, class, com, csv, dib, dll, doc, docx, dump, exe, exp, frx, gif, gz, ico, idb, ilk, iml, ini, jar, jfif, jpe, jpeg, jpg, lib, log, mp3, mp4, msi, pbd, pdb, pch, pdf, png, ppt, pptx, rtf, sys, tar, tif, tiff, tgz, txt, vhdx, war, wav, webp, xls, xlsx, zip
      preservedEncodings: ".*UTF-(16|32).*"
      logRetainedFiles: false
      logRejectedFiles: false
      logConvertedFiles: true
ItemDescription
application.sourceCodeFiles.assumedEncoding

Character set that will be assumed for source files whose character set could not be found: can be "JVM" (default setting), standing for "the default Charset of this Java virtual machine", or the name of any character set supported by the current JVM.

  • Character set names are case-insensitive.
  • If the character set name you enter is not recognised (i.e. incorrectly spelt or just not a valid character set) then a warning will be recorded as follows:
    • In the Node log file during the line of code (LoC) counting during the Fast Scan.
    • In the Console log file during the Content Discovery step.
application.sourceCodeFiles.conversionToUtf8.enabledWhether automatic conversion of non UTF-8 source code files to UTF-8 is enabled during Onboarding with Fast Scan - the conversion is applied during the Content Discovery step. Set to true or false.
application.sourceCodeFiles.conversionToUtf8.appNameFilterRegular expression defining, if any, the name(s) of the only applications whose non UTF-8 source files must be converted. No filtering occurs when this field is empty. Ensure that the regular expression is surrounded with double quotes (") to avoid issues because of characters having special meaning in a YAML file, such as '+' or '#' for instance.
application.sourceCodeFiles.conversionToUtf8.backupFiles

Whether source code files that will be converted to UTF-8 should be backed up before conversion. Set to true or false.

If true, each file will be backed up in the same directory as the original file, with the same name but whose extension will be suffixed with ".genuine@<date>-<time>(<encoding>)" where <date>-<time> corresponds to the moment where the Fast Scan / Content discovery started (this date and time will be the same for all files), and <encoding> is the encoding or character set found (or assumed) for the file; for instance the backup of the non-UTF-8 file "main.java" found to have been encoded in "Shift_JIS" will be named "main.java.genuine@20230525-084129(Shift_JIS)" if Fast Scan / Content discovery started on May 25th, 2023, at 08:41:29 AM.

If over the time several versions of the same files are delivered, their repeated conversion to UTF-8 will result in repeated creation of backup files since their filenames are timestamped. To avoid the accumulation of outdated backup files, once the conversion of all application files to UTF-8 has completed, only the most recent backup of each file that has again be converted to UTF-8 is kept. Once this process has ended, the below line is logged:

[INFO] Old backup files (count = N) of '<application-name>' could be deleted in folder '<sources-root-path>'

application.sourceCodeFiles.conversionToUtf8.pathsInReports

Configures the type of paths that will be logged in reports about files of an application that have been considered for conversion to UTF-8, or that have not been considered because of filtering by file extension. Can be set to one of the following:

  • relative
  • absolute
  • filename

In case of error while either detecting the encoding or the character set of a file, or while converting it to UTF-8, the error report will always contain the absolute file path.

application.sourceCodeFiles.conversionToUtf8.logRetainedFilesWhether to log the paths of files retained as candidates for conversion to UTF-8 (due to their filename extension). Set to true or false.
application.sourceCodeFiles.conversionToUtf8.logRejectedFilesWhether to log the paths of files rejected as candidates for conversion to UTF-8 (because of their filename extension). Set to true or false.
application.sourceCodeFiles.conversionToUtf8.logConvertedFiles

Whether to log the paths of files that have actually been converted to UTF-8. Set to true or false.
application.sourceCodeFiles.conversionToUtf8.fileExtensionsAdded

Comma-separated list of strings (there can be none - i.e. this is optional), with no leading dot, whose files with such extension will always be considered (in addition to the file extensions resulting from the application scan) for conversion to UTF-8 because files with these extensions are known to contain some source code. By default, this property will be empty. For example:

fileExtensionsAdded: txt, java
  • Extensions are case-insensitive when the Node is installed on Microsoft Windows, case-sensitive when installed on Linux
  • Space(s) around commas, if any, are not significant.
application.sourceCodeFiles.conversionToUtf8.fileExtensionsRemoved

Comma-separated list of strings (there can be none - i.e. this is optional), with no leading dot, whose files with such extension will never be considered for conversion to UTF-8 because such files are known not to be source code files, or they should not be converted to UTF-8. By default, this property will contain a set list of file extensions, but these can be edited (i.e. existing entries can be removed and new ones added). For example:

fileExtensionsRemoved: xls, xlsx
  • These file extensions will be removed not only from those resulting from the Application Scan, but also from the "fileExtensionsAdded" property (see above) if they also exist there (file extensions defined here have a higher precedence than those issued from the Application Scan process or defined by the "fileExtensionsAdded" property).
  • Extensions are case-insensitive when the Node is installed on Microsoft Windows, case-sensitive when installed on Linux
  • Space(s) around commas, if any, are not significant.
application.sourceCodeFiles.conversionToUtf8.preservedEncodings

Regular expression defining (if there is one - i.e. this is optional) the names of character sets and encodings for which source code files (that have been found with the matched character set or encoding) will not be converted to UTF-8 (for convenience, matching will be done regardless of letter case). If this property is empty or disabled, no filtering will occur for a certain file because of the character set or encoding that was found or assumed for this file. By default, this property will contain the following regular expression that will ensure that any source code files found with either UTF-16 or UTF-32 will not be converted to UTF-8 (but additional expressions can be added):

preservedEncodings: ".*UTF-(16|32).*"
The property value should be surrounded with double quotes (") to avoid issues because of characters having special meaning in YAML, such as '+' or '#'.

Below is an example configuration which enables automatic conversion to UTF-8 of all non-UTF-8 source code files (except those where the existing character set or encoding matches either UTF-16 or UTF-32) provided the name of the application they belong to contains "foo" (case-sensitively) or contains "bar" (case-insensitively), where backup of the source code files is enabled. Also, the character set that will be assumed for source code files whose character set could neither be found (for instance thanks to a BOM) nor guessed (by sampling followed by a validation of the guess against the entire file) is set to ISO-8859-1, corresponding to the Western European codepage.

application:
  sourceCodeFiles:
    assumedEncoding: ISO-8859-1
    conversionToUtf8:
      enabled: true
      appNameFilter: ".+(foo|[bB][aA][rR]).+"
      backupFiles: true
      pathsInReports: absolute
      fileExtensionsAdded:
      fileExtensionsRemoved: ani, avi, bin, bmp, bz2, chi, chm, class, com, csv, dib, dll, doc, docx, dump, exe, exp, frx, gif, gz, ico, idb, ilk, iml, ini, jar, jfif, jpe, jpeg, jpg, lib, log, mp3, mp4, msi, pbd, pdb, pch, pdf, png, ppt, pptx, rtf, sys, tar, tif, tiff, tgz, txt, vhdx, war, wav, webp, xls, xlsx, zip
      preservedEncodings: ".*UTF-(16|32).*"
      logRetainedFiles: false
      logRejectedFiles: false
      logConvertedFiles: true	

Save the file when you have completed the changes.

Step 3 - Apply configuration changes

Restart the Node / Standalone release to ensure all changes are taken into account.