Purpose

 This metric measures the ratio between the number of duplicated, copy/pasted artifacts and the total number of artifacts.

Applicable in CAST Version
Release
Yes/No
8.3.x(tick)
8.2.x(tick)
Applicable RDBMS
RDBMS
Yes/No
Oracle Server(tick)
Microsoft SQL Server(tick)
CSS(tick)
Details


Definition

This metric measures the ratio between the number of duplicated, copy/pasted artifacts and the total number of artifacts.
Copy / Paste detection is based on statistical detection methods. The statistical methods used compute a similarity metric between all artifacts. Artifacts are reported as copy / pasted when the similarity is higher than 90% (see metric parameter SIMILARITY).
Like any statistical method, the detection algorithms require a well sized sample in order to provide significant results: testing these algorithms with a couple of classes will not do the job, a real life application's source code is required to yield usable results. The minimal size required stands at around 5000 lines of code.
Below such a size, the algorithms detect the full list of exact copies for the copy/paste code detection but slightly modified copy/paste code will not always be detected.
Also, for optimal efficiency, the copy/pasted code detection is enabled only for artifacts larger than 10 lines of code (methods, functions, procedures, triggers, programs...).

Scope computation

The metric "Avoid Too Many Copy Pasted Artifacts" scopes the content of all the KB and not only the module selected in the Portfolio tree.

The Total objects number used for the metric grade computation is the total number of artifacts in the KB involved for the snapshot computation but is not the the total number of artifacts belonging to used system/application/module of the portfoli tree.

The metric grade can be changed even the code source of the application (involved for the snapshot computation) has not been modified. This change is related to the modification of the KB content.

List of Very High Risk Objects

The column Value: gives the object name having 90% of code similarity with object mention in Object name column.
If for same object name, many similar objects have been detected, all similar objects are reported in the Value column.

Similarity concept

The code that is being identified as copy pasted is detected with the algorithm implemented behind the CAST SIMILARITY.
by default, the parameter similarity has 90 as value and therefore,
The CAST SIMILARITY concept means that the sources which are copy pasted, are 90 percent similar. To detect copy/paste code, we use NLP statistical classification methods (Rocchio, Bayes, SVM, etc.). These statistical methods are used to detect copy/pasted code and commented-out code. Read below the description of the algorithm to detect similarity:

  1. Vectorize each (code and comments)
    1. for each token, we compute its term frequency (TF) and its document frequency (DF)
  2. Computing weigths (Weigth(wi,dj)= TF(wi, dj)* Log(D/DF(wi))
    1. wi is the ith word of the vocabulary
    2. dj is the current document
    3. D is the total number of documents
    4. TF(wi, dj) is the occurrence of word wi on document dj (Term Frequency)
    5. DF(wi) is the amount of documents where we have seen wi (Document Frequency)
  3. Computing models (simple vector average)
    1. Code model
    2. Comment model
  4. Computing similarities
    1. A comment vector is classified as code if the code model is closer than the comment model.
    2. the similarity is computed with a dot product

Reference on statistical classification methods:
Joachims, Thorsten, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference on Machine Learning (ICML), 1997
http://www.cs.cornell.edu/People/tj/publications/joachims_97a.ps.gz
http://www.cs.cornell.edu/People/tj/publications/joachims_97a.pdf

Parameters configuration

The parameters have values that can be changed at will in the metric tree Configuration page of AD Administration. The documentation is based on default values and therefore, 10 lines of code is equal to value in CODELINE parameter.

Is the diagnostic computed for all technologies and particularly UA technology?

The diagnostic is computed for each technology even the Universal technology computed by Universal analyzer.
For MainFrame code source, the metic is computed on expanded code source (after expanding code of referenced copybook). Since CAST 7.0, the metic is computed on real code source without considering expanded code of referenced copybook.

Notes/comments


Related Pages