Summary: this page describes how to configure the automatic UTF-8 conversion feature available in Console ≥ 2.9.
Introduction
When source code is delivered in CAST Console and this source code contains characters whose encoding or character set is not UTF-8, this can cause problems during the analysis phase. CAST Console (when using the Workflow - Application onboarding with Fast Scan) will warn you that the source code contains non UTF-8 encoded characters and provide a list of the files impacted - see 2.9 - Onboarding with Fast Scan - redesigned job progress screen:
Alert in the Job Progress panel
CAST Console includes a
feature to convert these files automatically to UTF-8 during
the Content Discovery step of an onboarding
with Fast Scan to reduce the number of warnings in the
analysis log and improve the analysis results. This feature
is enabled "out-of-the-box". If you specifically need
to disable this feature or modify any of the options, see the
instructions below
Step 1 - Edit the properties file
The configuration file is available on EACH Node, therefore if you have more than one Node, you will need to modify the configuration file on each node where you want to configure the feature:
≥ 3.x
%PROGRAMDATA%\CAST\Imaging\CAST-Imaging-Analysis-Node\application.yml
Enterprise mode ≥ 2.x
%PROGRAMDATA%\CAST\AIP-Node\application.yml
Standalone mode ≥ 2.x
%PROGRAMDATA%\CAST\AIP-Console-Standalone\application.yml
You will find the relevant parameters in the following section:
application:
...
sourceCodeFiles:
# Character set that will be assumed for source files whose character set could
# not be found: can be "JVM", standing for "the default Charset of this Java
# virtual machine", or the name of any Charset supported by the current JVM.
# Note: Character set names are case-insensitive.
assumedEncoding: JVM
conversionToUtf8:
# Whether automatic conversion of non UTF-8 source code files
# to UTF-8 is enabled during Onboarding with Fast Scan.
enabled: true
# Regular expression to match application names to perform UTF-8
# conversion on (will match all applications if left empty).
appNameFilter:
# Option to enable backup before conversion.
backupFiles: true
# Types of paths to display in the conversion logs and reports.
# Valid values: absolute, relative, or filename
pathsInReports: absolute
# Comma-separated list of file extensions, with no leading dot, to add to
# those issued from the Application Scan, that will be converted to UTF-8
# unless they are part of removed extension (set in fileExtensionsRemoved).
fileExtensionsAdded:
# Comma-separated list of file extensions, with no leading
# dot, to ignore during the conversion of files to UTF-8.
fileExtensionsRemoved: ani, avi, bin, bmp, bz2, chi, chm, class, com, csv, dib, dll, doc, docx, dump, exe, exp, frx, gif, gz, ico, idb, ilk, iml, ini, jar, jfif, jpe, jpeg, jpg, lib, log, mp3, mp4, msi, pbd, pdb, pch, pdf, png, ppt, pptx, rtf, sys, tar, tif, tiff, tgz, txt, vhdx, war, wav, webp, xls, xlsx, zip, xml, axml, ccxml, clixml, cproject, dita, ditamap, ditaval, glade, grxml, jelly, kml, mxml, plist, pluginspec, ps1xml, psc1, pt, rdf, rss, scxml, svg, tmCommand, tmLanguage, tmPreferences, tmSnippet, tmTheme, tml, ui, vxml, wxi, wxl, wxs, x3d, xaml, xlf, xliff, xmi, xul, zcml
# Regular expression defining, if any, the names of character sets and
# encodings for which source code files will not be converted to UTF-8.
preservedEncodings: ".*UTF-(16|32).*"
# Whether to log the paths of files retained as candidates for conversion to UTF-8.
logRetainedFiles: false
# Whether to log the paths of files rejected as candidates for conversion to UTF-8.
logRejectedFiles: false
# Whether to log the paths of files that have actually been converted to UTF-8.
logConvertedFiles: true
Step 2 - Make the changes
You can modify any of parameters as described below:
Item | Description |
---|---|
application.sourceCodeFiles.assumedEncoding |
Character set that will be assumed for source files whose character set could not be found: can be "JVM" (default setting), standing for "the default Charset of this Java virtual machine", or the name of any character set supported by the current JVM.
|
application.sourceCodeFiles.conversionToUtf8.enabled | Whether automatic conversion of non UTF-8 source code files to UTF-8 is enabled during Onboarding with Fast Scan - the conversion is applied during the Content Discovery step. Set to true or false. |
application.sourceCodeFiles.conversionToUtf8.appNameFilter | Regular expression defining, if any, the name(s) of the only applications whose non UTF-8 source files must be converted. No filtering occurs when this field is empty. Ensure that the regular expression is surrounded with double quotes (") to avoid issues because of characters having special meaning in a YAML file, such as '+' or '#' for instance. |
application.sourceCodeFiles.conversionToUtf8.backupFiles |
Whether source code files that will be converted to UTF-8 should be backed up before conversion. Set to true or false. If true, each file will be backed up in the same directory as the original file, with the same name but whose extension will be suffixed with ".genuine@<date>-<time>(<encoding>)" where <date>-<time> corresponds to the moment where the Fast Scan / Content discovery started (this date and time will be the same for all files), and <encoding> is the encoding or character set found (or assumed) for the file; for instance the backup of the non-UTF-8 file "main.java" found to have been encoded in "Shift_JIS" will be named "main.java.genuine@20230525-084129(Shift_JIS)" if Fast Scan / Content discovery started on May 25th, 2023, at 08:41:29 AM. If over the time several versions of the same files are delivered, their repeated conversion to UTF-8 will result in repeated creation of backup files since their filenames are timestamped. To avoid the accumulation of outdated backup files, once the conversion of all application files to UTF-8 has completed, only the most recent backup of each file that has again be converted to UTF-8 is kept. Once this process has ended, the below line is logged: [INFO] Old backup files (count = N) of '<application-name>' could be deleted in folder '<sources-root-path>' |
application.sourceCodeFiles.conversionToUtf8.pathsInReports |
Configures the type of paths that will be logged in reports about files of an application that have been considered for conversion to UTF-8, or that have not been considered because of filtering by file extension. Can be set to one of the following:
In case of error while either detecting the encoding or the character set of a file, or while converting it to UTF-8, the error report will always contain the absolute file path. |
application.sourceCodeFiles.conversionToUtf8.logRetainedFiles | Whether to log the paths of files retained as candidates for conversion to UTF-8 (due to their filename extension). Set to true or false. |
application.sourceCodeFiles.conversionToUtf8.logRejectedFiles | Whether to log the paths of files rejected as candidates for conversion to UTF-8 (because of their filename extension). Set to true or false. |
application.sourceCodeFiles.conversionToUtf8.logConvertedFiles |
Whether to log the paths of files that have actually been converted to UTF-8. Set to true or false. |
application.sourceCodeFiles.conversionToUtf8.fileExtensionsAdded |
Comma-separated list of strings (there can be none - i.e. this is optional), with no leading dot, whose files with such extension will always be considered (in addition to the file extensions resulting from the application scan) for conversion to UTF-8 because files with these extensions are known to contain some source code. By default, this property will be empty. For example: fileExtensionsAdded: txt, java
|
application.sourceCodeFiles.conversionToUtf8.fileExtensionsRemoved |
Comma-separated list of strings (there can be none - i.e. this is optional), with no leading dot, whose files with such extension will never be considered for conversion to UTF-8 because such files are known not to be source code files, or they should not be converted to UTF-8. By default, this property will contain a set list of file extensions, but these can be edited (i.e. existing entries can be removed and new ones added). For example: fileExtensionsRemoved: xls, xlsx
|
application.sourceCodeFiles.conversionToUtf8.preservedEncodings |
Regular expression defining (if there is one - i.e. this is optional) the names of character sets and encodings for which source code files (that have been found with the matched character set or encoding) will not be converted to UTF-8 (for convenience, matching will be done regardless of letter case). If this property is empty or disabled, no filtering will occur for a certain file because of the character set or encoding that was found or assumed for this file. By default, this property will contain the following regular expression that will ensure that any source code files found with either UTF-16 or UTF-32 will not be converted to UTF-8 (but additional expressions can be added): preservedEncodings: ".*UTF-(16|32).*" The property value
should be surrounded with double quotes (") to avoid issues because of
characters having special meaning in YAML, such as '+' or '#'.
|
Below is an example configuration which enables automatic conversion to UTF-8 of all non-UTF-8 source code files (except those where the existing character set or encoding matches either UTF-16 or UTF-32) provided the name of the application they belong to contains "foo" (case-sensitively) or contains "bar" (case-insensitively), where backup of the source code files is enabled. Also, the character set that will be assumed for source code files whose character set could neither be found (for instance thanks to a BOM) nor guessed (by sampling followed by a validation of the guess against the entire file) is set to ISO-8859-1, corresponding to the Western European codepage.
application:
sourceCodeFiles:
assumedEncoding: ISO-8859-1
conversionToUtf8:
enabled: true
appNameFilter: ".+(foo|[bB][aA][rR]).+"
backupFiles: true
pathsInReports: absolute
fileExtensionsAdded:
fileExtensionsRemoved: ani, avi, bin, bmp, bz2, chi, chm, class, com, csv, dib, dll, doc, docx, dump, exe, exp, frx, gif, gz, ico, idb, ilk, iml, ini, jar, jfif, jpe, jpeg, jpg, lib, log, mp3, mp4, msi, pbd, pdb, pch, pdf, png, ppt, pptx, rtf, sys, tar, tif, tiff, tgz, txt, vhdx, war, wav, webp, xls, xlsx, zip
preservedEncodings: ".*UTF-(16|32).*"
logRetainedFiles: false
logRejectedFiles: false
logConvertedFiles: true
Save the file when you have completed the changes.
Step 3 - Apply configuration changes
Restart the Node / Standalone release to ensure all changes are taken into account.