Istanbul Technical University

Signal Processing for Computational Intelligence Group

Uterine Cervical Cancer Dataset

Cervical cancer caused by Human Papilloma Virus (HPV) is one of the preventable cancers with the help of periodic screening. Thanks to early diagnosis, many patients can survive and regain their health. It is very important to evaluate the Hematoxylin and Eosin (H&E) digital histopathology images used during these scans. At this point, the accuracy of the diagnosis can change the course of treatment. The development of AI systems that can assist pathologists in the diagnosis process has gained great importance. Based on this, we reveal a data set consisting of large-scale H&E images available to researchers worldwide. We hope that the methods developed through this data set can cover beautiful milestones for many helpful diagnoses.


In our data set study, 128 whole slides obtained from 54 patients in the pathology laboratory of IMU Hospital were scanned with high resolution. Images were painted with HE Ki67 and p16 immunohistochemical stains. Digital images obtained under x20 and x40 optical zoom were recorded in TIFF format without any compression. The dimensions of the whole slide images range from 7,500x7,700 to 55,700x165,000. The whole slide images are difficult to process. There may be lesions of more than one class on a slide. For this reason, lesions were separated from the whole slide images (a). In addition, each lesion (b) was divided into small pieces and Small Epithelial Pieces (SEP) (c) were created. Our data set contains a total of 128 whole slides, 350 lesions and 957 SEP images.

The diagnosis for each lesion and SEP is given according to the CIN system. Diagnoses exist as independent diagnoses of two pathologists. In cases where they were unstable with pathologists, they presented another third diagnosis on Ki67 and p16 for consensus. Thus, our data set contains inter-observer variability.

A way was followed in indexing digital images: Whole Slide Name _ Lesion Number _ SEP Number (WSN_LS_#_PC_#). It is noted that all lesion and SEP images have a mask for segmentation problems. Papilla coordinates are given beside upper and lower bounds. For CIN and SIL systems, class distributions are presented in Table-1 and Table-2 in which is shown that it is an imbalanced dataset.

If you would like to use this dataset, please fill this form and we will reply with the instructions.

Uterine Cervical Cancer Dataset Datasets were proposed in the following paper: Albayrak, Abdulkadir, Asli Unlu Akhan, Nurullah Calik, Abdulkerim Capar, Gokhan Bilgin, Behcet Ugur Toreyin, Bahar Muezzinoglu, Ilknur Turkmen, and Lutfiye Durak-Ata. "A whole-slide image grading benchmark and tissue classification for cervical cancer precursor lesions with inter-observer variability." Medical & Biological Engineering & Computing (2021): 1-17.

If you use this dataset please cite this paper.