Summary: this page describes how to configure the automatic UTF-8 conversion feature available in Console ≥ 2.9.

Introduction

When source code is delivered in CAST Console and this source code contains characters whose encoding or character set is not UTF-8, this can cause problems during the analysis phase. CAST Console (when using the Workflow - Application onboarding with Fast Scan) will warn you that the source code contains non UTF-8 encoded characters and provide a list of the files impacted - see 2.9 - Onboarding with Fast Scan - redesigned job progress screen:

Alert in the Job Progress panel

CAST Console includes a feature to convert these files automatically to UTF-8 during the Content Discovery step of an onboarding with Fast Scan to reduce the number of warnings in the analysis log and improve the analysis results. This feature is enabled "out-of-the-box". If you specifically need to disable this feature or modify any of the options, see the instructions below

Step 1 - Edit the properties file

The configuration file is available on EACH Node, therefore if you have more than one Node, you will need to modify the configuration file on each node where you want to configure the feature:

≥ 3.x
%PROGRAMDATA%\CAST\Imaging\CAST-Imaging-Analysis-Node\application.yml

Enterprise mode ≥ 2.x
%PROGRAMDATA%\CAST\AIP-Node\application.yml

Standalone mode ≥ 2.x
%PROGRAMDATA%\CAST\AIP-Console-Standalone\application.yml

You will find the relevant parameters in the following section:

application:
...
sourceCodeFiles:
# Character set that will be assumed for source files whose character set could
# not be found: can be "JVM", standing for "the default Charset of this Java
# virtual machine", or the name of any Charset supported by the current JVM.
# Note: Character set names are case-insensitive.
assumedEncoding: JVM
conversionToUtf8:
# Whether automatic conversion of non UTF-8 source code files
# to UTF-8 is enabled during Onboarding with Fast Scan.
enabled: true
# Regular expression to match application names to perform UTF-8
# conversion on (will match all applications if left empty).
appNameFilter:
# Option to enable backup before conversion.
backupFiles: true
# Types of paths to display in the conversion logs and reports.
# Valid values: absolute, relative, or filename
pathsInReports: absolute
# Comma-separated list of file extensions, with no leading dot, to add to
# those issued from the Application Scan, that will be converted to UTF-8
# unless they are part of removed extension (set in fileExtensionsRemoved).
fileExtensionsAdded:
# Comma-separated list of file extensions, with no leading
# dot, to ignore during the conversion of files to UTF-8.
fileExtensionsRemoved: ani, avi, bin, bmp, bz2, chi, chm, class, com, csv, dib, dll, doc, docx, dump, exe, exp, frx, gif, gz, ico, idb, ilk, iml, ini, jar, jfif, jpe, jpeg, jpg, lib, log, mp3, mp4, msi, pbd, pdb, pch, pdf, png, ppt, pptx, rtf, sys, tar, tif, tiff, tgz, txt, vhdx, war, wav, webp, xls, xlsx, zip, xml, axml, ccxml, clixml, cproject, dita, ditamap, ditaval, glade, grxml, jelly, kml, mxml, plist, pluginspec, ps1xml, psc1, pt, rdf, rss, scxml, svg, tmCommand, tmLanguage, tmPreferences, tmSnippet, tmTheme, tml, ui, vxml, wxi, wxl, wxs, x3d, xaml, xlf, xliff, xmi, xul, zcml
# Regular expression defining, if any, the names of character sets and
# encodings for which source code files will not be converted to UTF-8.
preservedEncodings: ".*UTF-(16|32).*"
# Whether to log the paths of files retained as candidates for conversion to UTF-8.
logRetainedFiles: false
# Whether to log the paths of files rejected as candidates for conversion to UTF-8.
logRejectedFiles: false
# Whether to log the paths of files that have actually been converted to UTF-8.
logConvertedFiles: true

Step 2 - Make the changes

You can modify any of parameters as described below:

Item Description
application.sourceCodeFiles.assumedEncoding

Character set that will be assumed for source files whose character set could not be found: can be "JVM" (default setting), standing for "the default Charset of this Java virtual machine", or the name of any character set supported by the current JVM.

  • Character set names are case-insensitive.
  • If the character set name you enter is not recognised (i.e. incorrectly spelt or just not a valid character set) then a warning will be recorded as follows:
    • In the Node log file during the line of code (LoC) counting during the Fast Scan.
    • In the Console log file during the Content Discovery step.
application.sourceCodeFiles.conversionToUtf8.enabled Whether automatic conversion of non UTF-8 source code files to UTF-8 is enabled during Onboarding with Fast Scan - the conversion is applied during the Content Discovery step. Set to true or false.
application.sourceCodeFiles.conversionToUtf8.appNameFilter Regular expression defining, if any, the name(s) of the only applications whose non UTF-8 source files must be converted. No filtering occurs when this field is empty. Ensure that the regular expression is surrounded with double quotes (") to avoid issues because of characters having special meaning in a YAML file, such as '+' or '#' for instance.
application.sourceCodeFiles.conversionToUtf8.backupFiles

Whether source code files that will be converted to UTF-8 should be backed up before conversion. Set to true or false.

If true, each file will be backed up in the same directory as the original file, with the same name but whose extension will be suffixed with ".genuine@<date>-<time>(<encoding>)" where <date>-<time> corresponds to the moment where the Fast Scan / Content discovery started (this date and time will be the same for all files), and <encoding> is the encoding or character set found (or assumed) for the file; for instance the backup of the non-UTF-8 file "main.java" found to have been encoded in "Shift_JIS" will be named "main.java.genuine@20230525-084129(Shift_JIS)" if Fast Scan / Content discovery started on May 25th, 2023, at 08:41:29 AM.

If over the time several versions of the same files are delivered, their repeated conversion to UTF-8 will result in repeated creation of backup files since their filenames are timestamped. To avoid the accumulation of outdated backup files, once the conversion of all application files to UTF-8 has completed, only the most recent backup of each file that has again be converted to UTF-8 is kept. Once this process has ended, the below line is logged:

[INFO] Old backup files (count = N) of '<application-name>' could be deleted in folder '<sources-root-path>'

application.sourceCodeFiles.conversionToUtf8.pathsInReports

Configures the type of paths that will be logged in reports about files of an application that have been considered for conversion to UTF-8, or that have not been considered because of filtering by file extension. Can be set to one of the following:

  • relative
  • absolute
  • filename

In case of error while either detecting the encoding or the character set of a file, or while converting it to UTF-8, the error report will always contain the absolute file path.

application.sourceCodeFiles.conversionToUtf8.logRetainedFiles Whether to log the paths of files retained as candidates for conversion to UTF-8 (due to their filename extension). Set to true or false.
application.sourceCodeFiles.conversionToUtf8.logRejectedFiles Whether to log the paths of files rejected as candidates for conversion to UTF-8 (because of their filename extension). Set to true or false.
application.sourceCodeFiles.conversionToUtf8.logConvertedFiles

Whether to log the paths of files that have actually been converted to UTF-8. Set to true or false.
application.sourceCodeFiles.conversionToUtf8.fileExtensionsAdded

Comma-separated list of strings (there can be none - i.e. this is optional), with no leading dot, whose files with such extension will always be considered (in addition to the file extensions resulting from the application scan) for conversion to UTF-8 because files with these extensions are known to contain some source code. By default, this property will be empty. For example:

fileExtensionsAdded: txt, java
  • Extensions are case-insensitive when the Node is installed on Microsoft Windows, case-sensitive when installed on Linux
  • Space(s) around commas, if any, are not significant.
application.sourceCodeFiles.conversionToUtf8.fileExtensionsRemoved

Comma-separated list of strings (there can be none - i.e. this is optional), with no leading dot, whose files with such extension will never be considered for conversion to UTF-8 because such files are known not to be source code files, or they should not be converted to UTF-8. By default, this property will contain a set list of file extensions, but these can be edited (i.e. existing entries can be removed and new ones added). For example:

fileExtensionsRemoved: xls, xlsx
  • These file extensions will be removed not only from those resulting from the Application Scan, but also from the "fileExtensionsAdded" property (see above) if they also exist there (file extensions defined here have a higher precedence than those issued from the Application Scan process or defined by the "fileExtensionsAdded" property).
  • Extensions are case-insensitive when the Node is installed on Microsoft Windows, case-sensitive when installed on Linux
  • Space(s) around commas, if any, are not significant.
application.sourceCodeFiles.conversionToUtf8.preservedEncodings

Regular expression defining (if there is one - i.e. this is optional) the names of character sets and encodings for which source code files (that have been found with the matched character set or encoding) will not be converted to UTF-8 (for convenience, matching will be done regardless of letter case). If this property is empty or disabled, no filtering will occur for a certain file because of the character set or encoding that was found or assumed for this file. By default, this property will contain the following regular expression that will ensure that any source code files found with either UTF-16 or UTF-32 will not be converted to UTF-8 (but additional expressions can be added):

preservedEncodings: ".*UTF-(16|32).*"
The property value should be surrounded with double quotes (") to avoid issues because of characters having special meaning in YAML, such as '+' or '#'.

Below is an example configuration which enables automatic conversion to UTF-8 of all non-UTF-8 source code files (except those where the existing character set or encoding matches either UTF-16 or UTF-32) provided the name of the application they belong to contains "foo" (case-sensitively) or contains "bar" (case-insensitively), where backup of the source code files is enabled. Also, the character set that will be assumed for source code files whose character set could neither be found (for instance thanks to a BOM) nor guessed (by sampling followed by a validation of the guess against the entire file) is set to ISO-8859-1, corresponding to the Western European codepage.

application:
sourceCodeFiles:
assumedEncoding: ISO-8859-1
conversionToUtf8:
enabled: true
appNameFilter: ".+(foo|[bB][aA][rR]).+"
backupFiles: true
pathsInReports: absolute
fileExtensionsAdded:
fileExtensionsRemoved: ani, avi, bin, bmp, bz2, chi, chm, class, com, csv, dib, dll, doc, docx, dump, exe, exp, frx, gif, gz, ico, idb, ilk, iml, ini, jar, jfif, jpe, jpeg, jpg, lib, log, mp3, mp4, msi, pbd, pdb, pch, pdf, png, ppt, pptx, rtf, sys, tar, tif, tiff, tgz, txt, vhdx, war, wav, webp, xls, xlsx, zip
preservedEncodings: ".*UTF-(16|32).*"
logRetainedFiles: false
logRejectedFiles: false
logConvertedFiles: true

Save the file when you have completed the changes.

Step 3 - Apply configuration changes

Restart the Node / Standalone release to ensure all changes are taken into account.