ElectroCom61 Dataset
Computer Vision Dataset Electronics Multi-Class

ElectroCom61: A Multiclass Dataset for Detection of Electronic Components


Project Overview: ElectroCom61 is a curated dataset comprising 2,121 annotated images of 61 different electronic components collected from the Electronic Lab Support Room at United International University (UIU). It is designed to train and evaluate machine learning models for real-time component detection. Images were captured from multiple angles under diverse lighting and background conditions to reflect real-world variability. Each image was standardized through auto-orientation and resized to 640×640 pixels. The dataset is split into training (70%), validation (20%), and test (10%) sets to support robust evaluation.

dataset code

MosquitoFusion Dataset
Computer Vision Object Detection Mosquito Dataset ICLR 2024

MosquitoFusion: A Multiclass Dataset for Real-Time Mosquito Detection


Project Overview: The MosquitoFusion dataset features 1,204 expertly curated and annotated images aimed at advancing real-time mosquito detection systems. The data is split into training (87%), validation (8%), and test (5%) subsets. Preprocessing includes auto-orientation and resizing to 640×640 pixels, with a strict filter to exclude unannotated samples. The dataset is further enhanced using data augmentation techniques such as flipping, cropping, rotation, and grayscale conversion to boost robustness and generalizability. This high-quality dataset was accepted at ICLR 2024 (Tiny Papers Track) as an invited oral presentation in the “Notable” category.

dataset code

JailbreakTracer Corpus
LLM Safety Prompt Classification Synthetic Data Toxicity Detection

JailbreakTracer Corpus: Datasets for Toxic Prompt and Forbidden Question Classification


Project Overview: The JailbreakTracer Corpus contains two curated datasets for analyzing and classifying prompts that aim to bypass the safety mechanisms of large language models (LLMs).

Toxic Prompt Classification Dataset: Consists of 16,029 non-toxic and 1,952 toxic prompts initially. After synthetic augmentation using GPT, the dataset expanded to 37,333 toxic and 16,053 non-toxic prompts—addressing class imbalance and enabling robust training of LLM safety classifiers.

Forbidden Question Reasoning Dataset: Designed for reasoning-based classification of unethical prompts across 13 distinct categories, including Hate Speech, Malware, Economic Harm, Pornography, Legal Opinion, and more. Each class includes 8,250 samples for balanced multi-class modeling.

dataset code

Let's Connect