CrypticBio: A Large Multimodal Dataset for Visually Confusing Biodiversity

Anonymous Author(s)

Cryptic species are groups of two or more species that are nearly indistinguishable based on visual characteristics alone. State-of-the-art AI-ready biodiversity datasets TreeOfLife-10M, BioTrove, TaxaBind-8K focus on taxa identification holistically. Benchmark datasets that directly address the morphological confusion of groups of two or more species are significantly smaller, targeting only a single taxon. Thus, the challenge of accurately identifying such subtle differences in a wide range of taxa remains to be investigated.

Abstract

We present CrypticBio, the largest publicly available multimodal dataset of visually confusing species groups, specifically curated to support the development of AI models in the context of biodiversity identification applications. Visually confusing or cryptic species are groups of two or more taxa that are nearly indistinguishable based on visual characteristics alone. While much existing work addresses taxonomic identification in a broad sense, datasets that directly address the morphological confusion of cryptic species are small, manually curated, and target only a single taxon. Thus, the challenge of identifying such subtle differences in a wide range of taxa remains unaddressed. Curated from real-world trends in species misidentification among community annotators of iNaturalist, CrypticBio contains 52K cryptic groups spanning 67K species, represented in 166 million images. Rich research-graded metadata annotations including scientific, multicultural, and multilingual species terminology, hierarchical taxonomy, spatiotemporal context, and cryptic group species, pose unique challenges for multimodal AI in biodiversity research, and are readily accessible via the CrypticBio-Curate pipeline. To highlight the importance of the dataset, we benchmark a suite of state-of-the-art models across CrypticBio curated subsets of common, unseen, endangered, and invasive species, and demonstrate the substantial impact of spatiotemporal context on zero-shot learning for cryptic species. By introducing CrypticBio, we aim to catalyze progress toward real-world-ready biodiversity AI models capable of handling the nuanced challenges of species ambiguity.

Dataset Description

The dataset is curated from research-grade observations provided by the Global Biodiversity Information Facility (GBIF), including validated data from iNaturalist and Observation.org. Images in CrypticBio are annotated with rich metadata including detailed taxonomic descriptions and observation context, enabling extensive filtering and analysis. Cryptic species groups for each species are organically derived from iNaturalist's record of historical misidentifications. We include species scientific names with multicultural and multilingual species vernacular naming practices from the iNaturalist Taxonomy Archive, to preserve ecological knowledge and increase cultural reach.

Spatiotemporal context is included as an additional modality which can then eventually be aligned with species image-text embedding as shown in TaxaBind.

Cryptic species have historically emerged as a consequence of biogeographic isolation (natural barriers, such as rivers, mountains, or deserts; deforestation; agricultural expansion; or man-made structures) which disrupted gene flow between populations and ultimately promoted allopatric divergence over evolutionary timescales, as shown below.

We hypothesize that the integration of spatiotemporal context will provide complementary cues beyond visual appearance alone and ultimately enhance the identification accuracy of cryptic species.

Dataset Curation Pipeline

Our Github includes the pipeline for the data preparation. The metadata can be downloaded from the HuggingFace dataset card: CrypticBio (main) and CrypticBio-Benchmarks. This procedure will generate machine learning-ready data from the downloaded metadata in four steps.


# Load configuration
config = load_config('config.yaml')

# Step 1: Process metadata
params = config.get('dataset', {})
dataset = load_metadata(**params)

# Step 2: Filter metadata
params = config.get('filters', {})
filtered = filter_metadata(dataset, **params)

# Step 3: Download images
params = config.get('download', {})
filtered = image_download(filtered, **params)

# Step 4: Download metadata
params = config.get('generate_pairs', {})
save_dataset(filtered, **params)
            

Models and Benchmaks

CrypticBio consists of several benchmark datasets. We evaluate state-of-the-art CLIP-style models trained on biodiversity data using the scientific and vernacular terminology of species. We use BioCLIP; BioTrove-CLIP's BioCLIP and OpenAI ViT-B-16 variants; and TaxaBind as image-only baseline models. For multimodal learning, we add embeddings obtained from the image encoders to those obtained from TaxaBind location and environmental features encoders, which are then used for zero-shot classification.

Existing benchmarks, handpicked, and target only a single taxon; our new benchmarks are described below.

New Benchmarks

From CrypticBio, we created four new benchmark datasets for fine-grained image classification: CrypticBio-Common, CrypticBio-CommonUnseen, CrypticBio-Endangered, and CrypticBio-Invasive.

CrypticBio-Common We curate a cryptic species subset of a common species from each taxon Arachnida, Aves, Insecta, Plantae, Fungi, Mollusca, and Reptilia spanning n=158. We randomly select 100 samples from each species in a cryptic group where there are more than 150 observation per species.

CrypticBio-CommonUnseen To assess zero-shot performance on common species from CrypticBio-Common not encountered during training of state-of-the-art models, we specifically curate a subset spanning data from 01-09-2024 to 01-04-2025. We randomly select 100 samples from each species in a cryptic group where there are more than 150 observation per species, spanning n=133 species.

CrypticBio-Endangered We propose a cryptic species subset of endangered species according to global IUCN Red List. We randomly select 30 samples from Arachnida, Fungi, Insecta, Mollusca, and Reptilia taxa and corresponding cryptic group, spanning n=37 species, filtering out taxa where there are less than 150 observation.

CrypticBio-Invasive We also propose a cryptic species subset of invasive alien species (IAS) according to global the Global Invasive Species Database (GISD). IAS are a significant concern for biodiversity as their records appear to be exponentially rising across the Earth. We randomly select 100 samples from each invasive species cryptic group, spanning n=72 species, filtering out taxa where there are less than 150 observation.

Example images from CrytpicBio benchmarks: (left) CrytpicBio-Endangered Calidris pygmaea cryptic species group; (right) CrytpicBio-Invasive Acacia mearnsii cryptic species group.

Empirical Analysis

The table below reports the top-1 zero-shot accuracy performance on various benchmarks. We include a 95% confidence intervals for all reported metrics, calculated using binomial proportion confidence interval method (denoted as ±). Generally, mixing scientific and common terminology yields best performance scores. We find that fusing image and location embeddings improves performance on zero-shot image classification for cryptic species groups.

Zero-shot learning on various models and benchmarks. I / L / E refers to image / location / environmental features embeddings; AP refers to Amazon Parrots; SLP refers to Squamata Lacertidae Podarcis; CRR refers to Chiroptera Rhinolophidae Rhinolophus; CB-C refers to CrypticBio-Common; CB-CU refers to CrypticBio-CommonUnseen; CB-E refers to CrypticBio-Engendered; CB-I refers to CrypticBio-Invasive; WA refers to weighted average; BC refers to BioCLIP; BT-B refers to BioTrove-CLIP-BioCLIP; BT-O refers to BioTrove-CLIP-OpenAI; TB refers to TaxaBind. Location (L) and environmental features (E) are TaxaBind embeddings. All new benchmarks report zero-shot accuracy with mixed scientific and vernacular terminology.