The Image Manipulation Dataset is a ground truth database for benchmarking the detection of image tampering artifacts.
It includes 48 base images, separate snippets from these images, and a software framework for creating ground truth data. The idea is to "replay" copy-move forgeries by copying, scaling and rotating semantically meaningful image regions. Additionally, Gaussian noise and JPEG compression artifacts can be added, both on the snippets and on the final tampered images. As a consequence, this dataset can also be used to apply other algorithms than copy-move forgery detection, but also algorithms for the detection of resampling and double-JPEG compression.
To illustrate this, have a look at the walk-through example to the right.
In the first row, the original image (left) shows a person in front of stones. The tampered image (middle) does not show the person, as it is covered by copies of the environment. The right image is a visualization of the image regions that were involved in hiding the person.
In the second row, the copied regions are separately shown (left, middle). At the right, a ground-truth image for the application in copy-move forgery detection is shown. Here, white regions denote copy-moved pixels.
With the provided software, the original image and the provided snippets can be combined in many different ways. For intance, the snippets could be a little bit scaled, and JPEG-compression artifacts can be added before insertion. The software subsequently computes ground truth for the respective transformation of the snippets.
Thus, by changing the splicing parameters, the 48 image+snippet combinations serve as a rich toolbox for extensive evaluation of image forensics algorithm performance.
The transformation and splicing process is completely script-guided, such that setups for the evaluation in a particular article can be directly reproduced. Additionally, the source code of the framework is published under an open source license, and we are happy to receive your comments for extensions, or even your patches to extend its usefulness for the community.
The image data is available here, and the software framework for ground truth generation here.
UPDATE! It is now possible to download the benchmark images that we used for our TIFS paper directly (~30GB data). Scroll to the end of the section to access the data.
In case that you generate the data, you need both, the image data and the software framework, to create a ground-truth benchmark. In order to build the software framework, please see also the build instructions below.
If you intend to use the benchmark for copy-move forgery detection (CMFD), you may be interested in our CMFD project page for some more resources.
If you use the code or the data, please cite our paper: V. Christlein, C. Riess, J. Jordan, C. Riess, E. Angelopoulou: "An Evaluation of Popular Copy-Move Forgery Detection Approaches", IEEE Transactions on Information Forensics and Security, vol. 7, no. 6, pp. 1841-1854, 2012.
Precomputed Dataset:
If you intend to use the benchmark data that was also used for the IEEE TIFS paper, then you can use the precomputed data from below. The file names contain the parameters of the respective operation. For instance, "gj80" means "addition of global (i.e. full-image) JPEG artifacts of quality 80". Similarly, "r4" means "rotation of the snippet by 4 degrees".
Note that file names containing the substring "_gt" are the ground truth for the respective file name without that string.
Most of the test cases exist also in downscaled versions, for evaluating the performance deterioriation of downscaled CMFD forgeries:
The framework consists of C++ code and perl scripts. The C++ code is the 'workhorse', while perl is used to glue together the appropriate command calls for the C++ code.
The images are packed in a bzip-compressed archive. On a linux machine, extract it with the command
tar xjvf benchmark_data.tar.bz2
The data requires appproximately 1GB of hard drive space.
The code is bzipped as well, extract it with the command
tar xjvf gt_cmfd.tar.bz2
To build the code, the system requires a C++ compiler, cmake (minimum version 2.6), boost (tested on boost 1.35 and boost 1.42) and OpenCV version 2.2.
Make sure that these libraries are accessible on your system. Earlier versions of the code have been successfully built on Visual Studio; we expect this to work also with the current version, although not explicitly tested.
The build system cmake creates a Makefile (or Visual Studio solution, if executed on Microsoft Windows). cmake requires a small amount of configuration, in order to build smoothly. In detail, enter the directory of the extracted code, create a build directory and execute cmake:
cd gt_cmfd
mkdir build
cd build
ccmake ../
(under Microsoft Windows, execute cmake, enter gt_cmfd as source directory, and another directory as build directory.)
The cmake interface shows up, press 'c' to configure the data. Once this is done, you are most likely required to enter a configuration value for OpenCV_DIR. Enter here the directory within your openCV 2.2 installation, where the configuration file OpenCVConfig.cmake is located (typically ${opencv_build_dir}/share/opencv). If boost has not been found, you also have to fix these paths manually.
Once all components have been found (type 'c' inbetween to re-configure with your added paths), type 'g' to generate the Makefile. Then, type
make
The binary 'vole' is stored in the newly created subfolder 'bin/', and can be executed by typing
./bin/vole --help
This shows the available subcommands. To get more information about the command line parameters for a particular subcommand, append the subcommand name as first parameter to the program, e.g.
./bin/vole splice --help
Using the C++ binaries directly is a bit tedious, as a lot of parameters have to be passed for splicing an image. Thus, we recommend to use the provided perl scripts in the directory 'gt_cmfd/ground_truth_db/scripts/' to generate the command calls.
To create spliced data, the script 'splice_calls.pl' is most important. Before it can be used, edit its configuration files
gt_cmfd/ground_truth_db/scripts/db_setup.pl
and replace the paths for $vole to the C++ binary, and $db_root to your image directory. Consider to use absolute paths, such that the commands can be run from everywhere.
In order to create spliced images, the splicing configuration from
gt_cmfd/ground_truth_db/scripts/splice_config.pl
are used. Four setups are already provided:
To create the splices, call the splicing generator with an output directory, the selected configuration and an image number (between 1 and 48), e.g.
./gt_cmfd/ground_truth_db/scripts/splice_calls.pl /var/tmp/ nul 10
This prints the command calls to create the spliced images and the associated ground truth. If you want to post-process the ground truth, consider to call ./bin/vole gt_postproc
Note that the iteration over a number of parameters can quickly explode the number of output images, for instance 'too_much_dont_try' creates 7600 calls for every input image, which is clearly too much for a reasonable evaluation.
For questions or comments please feel free to contact Christian Riess or Johannes Jordan.