Performance Evaluation Issues in Image Forensics
 Tampering Localization, Performance Evaluation and Human Perception
 Generation of Hypothetical Predicted Mask
 Observations
We observed the fact that, in recent literature, the performance of tampering localization techniques are evaluated by rather universal metrics, such as accuracy, ROC curve, AUC, Fscore and so on. These metrics are designed for binary classification problems where individual samples are independent. However, it is easy to see that in the case of image tampering localization, the image patches (pixels) that are fed into the classifier do not follow the metrics’ design. As it will be shown in this article, these metrics not very descriptive when applied to image tampering detection practice.
Tampering Localization, Performance Evaluation and Human Perception
Due to the complexity of image tampering localization problem, it is very unlikely for the user of an automatic image tampering localization system to consider its results as a final decision. That is to say, such systems usually serve as an assistance to allow human inspectors to make better judgments. Usually the localization is done at patch level, where a patch is a small region (say, 10 × 10) extracted from a rectangular grid in the image.
It is also possible to achieve localization at pixel level, which has a smaller granularity and therefore higher precision, but there is little difference between these two approaches essentially. It can be seen that the patch level approach is less accurate, but it is more commonly used because a patch of image contains more statistical features and reduces the dimensionality of image data (because there are less patches than pixels). For simplicity, in the subsequent discussion, we assume that the localization is done at patch level.
In recent literature, authors usually use general purpose machine learning performance evaluation metrics to assess the effectiveness of image tampering localization methods. The commonly used metrics include accuracy, AUC and Fscore. However, the image tampering localization scenario is different from normal binary classification problems for the following reasons:
 A single image contains an enormous amount of inputs, whose outputs need to be inspected as a whole
 The inputs to the tampering localization classifier are not independent
 The goal of tampering localization is to generate human perceivable results
These factors render these metrics less reflective in tampering localization scenario. Or at least their values should be interpreted in a different manner. For reference, we briefly introduce the concepts in performance evaluation below:
(Concepts from confusion matrix: : true positive, : true negative, : false positive, : false negative, : all positive, : all negative.)
 The ROC curve is created by plotting the recall against the fallout at various threshold settings.
 The Area Under the Curve (AUC) is the area under the ROC curve.
Generation of Hypothetical Predicted Mask
We would like to show that the decline of metric effectiveness is not bond to specific classifier. In fact, it is due to the the limitation of performance metrics themselves as they are not designed for this scenario. In order to demonstrate how ubiquitous the problem is, we devise a way to randomly generate predicted masks based on a given performance metric, which can yield the \emph{bad cases} we want at high probability. In all the following examples, it is assumed that the ground truth mask is already known. To show the different visual perception effects of distinctive shapes in masks, we apply the evaluation on several different predefined pixel level masks, which are the illustrated in the figure below. There corresponding patch level masks can be acquired easily and therefore are not attached.
For the subsequent discussion, the ground truth mask will be denoted by , where the value on th row, th column is given by
In our illustrations, is shown by white patches, while is shown by black patches. The hypothetical predicted mask will be denoted by , which can take on both binary values or real values on depending on the context. It can be clearly seen that has the same dimensionality as .
Generate hypothetical predicted mask given accuracy
It is relatively simple to generate a hypothetical predicted mask given an accuracy value . In the scenario, only needs to take on binary values. To generate , we can sample from a Bernoulli distribution , where and . The algorithm is shown as below.
It is easy to see why will have accuracy approximately, because the event of being assigned with a correct value has a probability of .
Generate hypothetical predicted mask given AUC
Because the AUC implies a huge degree of freedom, given an AUC value , it is difficult to enumerate all possible ROC curves, which may lead to different visual effects. As our purpose is to create an illustration that loosely represents the given AUC value, we would like to make the following assumptions to make the problem more tractable:
 When the AUC is high, the shapes of ROC curves tend to look alike and therefore can be approximated by a family of curves.
 Suppose an ROC curve is uniformly discretized into a set of points , where is wellsorted by their spatial occurrence from the top right corner to the bottom left corner. We assume that the threshold values are also uniformed distributed (by the length of the ROC curve) from 0 to 1 on these points.
The family of curves that is chosen for ROC curve approximation is a simple 3linesegment scheme. The equation of the first line segment is given by , where is the gradient. Because the ROC curve is constrained in a square box, it can be seen that intersects with the diagonal at . We would like to scale the line segment between the origin and by a factor of . The second line segment is the symmetry of the first line segment about the same diagonal, and the third line segment connects the previous two line segments. It can be expressed as the piecewise function below:
Because the shape of the curve are completely determined by the choice of , for simplicity, we shall call it a ROC curve. Different ROC curves and their corresponding AUC values are shown as below.
When , because , and its symmetry will both be close to . Therefore, it is easy to see that when , .
The relationship between and is given by
Because it is nontrivial to solve for given , a table is attached to help one select a nearest value given , which is shown as below.
We also need to parameterize the ROC curve into by its length, where indicates the bottom left corner and indicates the top right corner. More specifically, we need to determine a function that returns the Cartesian coordinate of the parameter on the curve. Since , it is easy to compute that the length of the first line segment, denoted by , is
The length of the second line segment equals to the first, and the length of the third line segment, denoted by , is
Denote the total length of a ROC curve by , it can be seen that
By interpolating the segments linearly, we can see that the relationship between the component of Cartesian coordinate and is as follows:

If , then

If , then

If , then
It can be seen that . Now we can generate given and with the algorithm below.
This algorithm can indeed generate that has a ROC curve, because as the and changes between two points of the ROC curve, the algorithm assigns patches with certain values so that the and of will change in the same way. Once the ROC curve of follows the ROC curve, will also have the AUC value defined by the ROC curve.
Observations
Big gap in metrics may lead to similar performance in practice
This image shows the the hypothetical output maps given particular accuracy scores.
The image shows the hypothetical output maps given particular values, where each value of corresponds to an AUC score.
It can be seen that objects are observable in some output maps with lower metric scores.
Methods with lower metrics may work better
It is pointed out by Zhou et al.^{1} that the edges of the tampered region are easier to detect because the statistical patterns change greatly there. What if there is a classifier that is very sensitive to the edges of the tampered region but not the inner? For example, assume that the tampered region is given by medium circle in mask templates. The outputs of classifier and are shown as below. If we compute the AUC of classifier , the value is only around 0.62, which is lower than the output of classifier , which has an AUC of approximately 0.70. However, the shape of tampered region is much clearer in the output of classifier . That is to say, output maps with lower AUC values may look better for humans.

Zhou, Peng, et al. “Learning rich features for image manipulation detection.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. ↩