Image Manipulation Dataset

The Image Manipulation Dataset is a ground truth database for benchmarking the detection of image tampering artifacts.

It includes 48 base images, separate snippets from these images, and a software framework for creating ground truth data. The idea is to "replay" copy-move forgeries by copying, scaling and rotating semantically meaningful image regions. Additionally, Gaussian noise and JPEG compression artifacts can be added, both on the snippets and on the final tampered images. As a consequence, this dataset can also be used to apply other algorithms than copy-move forgery detection, but also algorithms for the detection of resampling and double-JPEG compression.

Feature Description by Example

To illustrate this, have a look at the walk-through example to the right.

In the first row, the original image (left) shows a person in front of stones. The tampered image (middle) does not show the person, as it is covered by copies of the environment. The right image is a visualization of the image regions that were involved in hiding the person.

In the second row, the copied regions are separately shown (left, middle). At the right, a ground-truth image for the application in copy-move forgery detection is shown. Here, white regions denote copy-moved pixels.

With the provided software, the original image and the provided snippets can be combined in many different ways. For intance, the snippets could be a little bit scaled, and JPEG-compression artifacts can be added before insertion. The software subsequently computes ground truth for the respective transformation of the snippets.

Thus, by changing the splicing parameters, the 48 image+snippet combinations serve as a rich toolbox for extensive evaluation of image forensics algorithm performance.

The transformation and splicing process is completely script-guided, such that setups for the evaluation in a particular article can be directly reproduced. Additionally, the source code of the framework is published under an open source license, and we are happy to receive your comments for extensions, or even your patches to extend its usefulness for the community.

Download the Dataset

The image data is available Initiates file download here, and the software framework for ground truth generation here.
UPDATE! It is now possible to download the benchmark images that we used for our TIFS paper directly (~30GB data). Scroll to the end of the section to access the data.

In case that you generate the data, you need both, the image data and the software framework, to create a ground-truth benchmark. In order to build the software framework, please see also the build instructions below.

If you intend to use the benchmark for copy-move forgery detection (CMFD), you may be interested in our Opens internal link in current window CMFD project page for some more resources.

If you use the code or the data, please cite our paper: V. Christlein, C. Riess, J. Jordan, C. Riess, E. Angelopoulou: "An Evaluation of Popular Copy-Move Forgery Detection Approaches", IEEE Transactions on Information Forensics and Security, vol. 7, no. 6, pp. 1841-1854, 2012.

Precomputed Dataset:

If you intend to use the benchmark data that was also used for the IEEE TIFS paper, then you can use the precomputed data from below. The file names contain the parameters of the respective operation. For instance, "gj80" means "addition of global (i.e. full-image) JPEG artifacts of quality 80". Similarly, "r4" means "rotation of the snippet by 4 degrees".

Note that file names containing the substring "_gt" are the ground truth for the respective file name without that string.

Unmodified/original images ("orig")
Unmodified/original images with JPEG compression ("orig_jpeg")
1-to-1 splices (i.e. direct copy of snippet into image, "nul")
Splices with added Gaussian noise ("lnoise")
Splices with added JPEG artifacts ("jpeg")
Rotated copies ("rot", "rotExtra", "rotExtra2")
Scaled copies ("scale", "scaleExtra", "scaleExtra2")
Combined effects ("cmb_easy1", "cmbExtra", "cmbExtra2")
Copies that were pasted multiple times ("multi_paste")

Most of the test cases exist also in downscaled versions, for evaluating the performance deterioriation of downscaled CMFD forgeries:

Unmodified/original downscaled ("orig_sd")
1-to-1 splices downscaled ("nul_sd")
Splices with added JPEG artifacts, downscaled ("jpeg_sd")
Rotated copies, downscaled ("rot_sd")
Scaled copies, downscaled ("scale_sd")

Gradually downscaled 1-to-1 copies ("scale_down")

Build and Setup the Framework

The framework consists of C++ code and perl scripts. The C++ code is the 'workhorse', while perl is used to glue together the appropriate command calls for the C++ code.

Extraction of the Dataset

The images are packed in a bzip-compressed archive. On a linux machine, extract it with the command

tar xjvf benchmark_data.tar.bz2

The data requires appproximately 1GB of hard drive space.

The code is bzipped as well, extract it with the command

tar xjvf gt_cmfd.tar.bz2

Library dependencies of the code

To build the code, the system requires a C++ compiler, cmake (minimum version 2.6), boost (tested on boost 1.35 and boost 1.42) and OpenCV version 2.2.

Make sure that these libraries are accessible on your system. Earlier versions of the code have been successfully built on Visual Studio; we expect this to work also with the current version, although not explicitly tested.

Build instructions

The build system cmake creates a Makefile (or Visual Studio solution, if executed on Microsoft Windows). cmake requires a small amount of configuration, in order to build smoothly. In detail, enter the directory of the extracted code, create a build directory and execute cmake:

cd gt_cmfd

mkdir build

cd build

ccmake ../

(under Microsoft Windows, execute cmake, enter gt_cmfd as source directory, and another directory as build directory.)

The cmake interface shows up, press 'c' to configure the data. Once this is done, you are most likely required to enter a configuration value for OpenCV_DIR. Enter here the directory within your openCV 2.2 installation, where the configuration file OpenCVConfig.cmake is located (typically ${opencv_build_dir}/share/opencv). If boost has not been found, you also have to fix these paths manually.

Once all components have been found (type 'c' inbetween to re-configure with your added paths), type 'g' to generate the Makefile. Then, type

make

The binary 'vole' is stored in the newly created subfolder 'bin/', and can be executed by typing

./bin/vole --help

This shows the available subcommands. To get more information about the command line parameters for a particular subcommand, append the subcommand name as first parameter to the program, e.g.

./bin/vole splice --help

Configuration

Using the C++ binaries directly is a bit tedious, as a lot of parameters have to be passed for splicing an image. Thus, we recommend to use the provided perl scripts in the directory 'gt_cmfd/ground_truth_db/scripts/' to generate the command calls.

To create spliced data, the script 'splice_calls.pl' is most important. Before it can be used, edit its configuration files

gt_cmfd/ground_truth_db/scripts/db_setup.pl

and replace the paths for $vole to the C++ binary, and $db_root to your image directory. Consider to use absolute paths, such that the commands can be run from everywhere.

In order to create spliced images, the splicing configuration from

gt_cmfd/ground_truth_db/scripts/splice_config.pl

are used. Four setups are already provided:

'nul' for only splicing the images without additional transformations or noise,
'rot' to create rotated forgeries
'rot_noise' to create forgeries with varying rotation and varying noise
'too_much_dont_try' to demonstrate the full bandwith of parameters that can be adjusted

To create the splices, call the splicing generator with an output directory, the selected configuration and an image number (between 1 and 48), e.g.

./gt_cmfd/ground_truth_db/scripts/splice_calls.pl /var/tmp/ nul 10

This prints the command calls to create the spliced images and the associated ground truth. If you want to post-process the ground truth, consider to call ./bin/vole gt_postproc

Note that the iteration over a number of parameters can quickly explode the number of output images, for instance 'too_much_dont_try' creates 7600 calls for every input image, which is clearly too much for a reasonable evaluation.

Contact

For questions or comments please feel free to contact Christian Riess or Johannes Jordan.