News
Singapore's Duplicate Image Problem: The Numbers Driving a Digital Storage Crisis
Government agencies and businesses are drowning in redundant visual data — and the scale of the waste is only now becoming clear.
4 min read
Updated 5 h ago
News
Government agencies and businesses are drowning in redundant visual data — and the scale of the waste is only now becoming clear.
4 min read
Updated 5 h ago
Singapore's public and private sector databases collectively hold tens of millions of duplicate digital images, a problem that costs organisations measurable money every year and is quietly undermining the city-state's ambition to run lean, AI-ready data infrastructure. The issue sits at the intersection of two converging pressures: explosive growth in image-heavy digital records and a push by agencies to clean their datasets before feeding them into machine learning pipelines.
The timing matters because Singapore's Smart Nation and Digital Government Office has been accelerating AI adoption across government ministries since 2024. Dirty data — and duplicate images are among the most common forms of it — degrades model accuracy and inflates cloud storage bills simultaneously. One industry benchmark widely cited in data engineering circles puts the share of duplicate or near-duplicate images in large unstructured enterprise archives at between 20 and 35 percent. Apply that range to any organisation storing tens of thousands of product photos, identity documents, or property images and the redundancy problem becomes a line item.
The Housing and Development Board manages image records for more than one million residential flats across towns from Woodlands to Tampines. Property listings, resale flat inspections, renovation permits and estate maintenance requests all generate photographs. When the same flat is listed, inspected and re-listed over several years, near-identical images accumulate across different case files without automated deduplication in place. The HDB did not provide a specific figure for duplicate image volumes when contacted for this story, but the structural conditions — high transaction turnover, multi-department workflows, legacy document management systems — are textbook generators of image redundancy.
The Inland Revenue Authority of Singapore and the Ministry of Manpower face analogous pressures on the identity document side. Passport photos, work pass headshots, and supporting application images are submitted repeatedly across different agencies, often as separate uploads rather than through a centralised pull from a single verified source. The National Digital Identity framework, anchored by Singpass, was partly designed to reduce exactly this kind of re-submission friction, but full integration across all document workflows remains incomplete as of mid-2026.
In the private sector, e-commerce platforms operating out of one-north and the Mapletree Business City cluster in Alexandria Road deal with duplicate product imagery at scale. A single SKU — a bottle of shampoo, a kitchen appliance — can have dozens of near-identical images uploaded by different sellers or at different resolutions. Platforms that have run internal deduplication audits typically report storage reductions of 15 to 28 percent after a first pass, according to data engineering practitioners familiar with local deployments. At current AWS Singapore regional pricing of roughly USD 0.025 per gigabyte per month for standard S3 storage, a company sitting on 100 terabytes of images could save between SGD 5,000 and SGD 14,000 a month simply by removing confirmed duplicates — before any compression or tiering strategy is applied.
Perceptual hashing — a technique that generates a compact fingerprint from an image's visual content rather than its file metadata — has become the standard first-line tool for duplicate detection. Libraries such as ImageHash and open-source pipelines built on Python can process several thousand images per minute on modest hardware. The more demanding problem is near-duplicate detection: images of the same subject taken seconds apart, or the same document scanned at slightly different angles. That requires embedding-based similarity search, the kind of vector database work that companies at the Jurong Innovation District and AI Singapore's 100E Pasir Panjang Road campus have been actively building capability around since 2023.
Organisations that have not yet audited their image archives should start with a storage inventory — identifying which systems hold unstructured image data, what formats are in use, and when files were last accessed. Files untouched for more than 24 months and smaller than a defined resolution threshold are the lowest-risk candidates for automated deduplication review. Singapore's Personal Data Protection Commission guidelines require that images containing identifiable individuals be handled under data minimisation principles, which adds both a legal incentive and a compliance framework for organisations to act. The commission's advisory on data protection by design, updated in January 2025, specifically references storage reduction as a concrete implementation of that principle. The cost savings are real. So is the regulatory nudge. The question for most organisations is no longer whether to run deduplication — it is how long they can afford to wait.

News

News

News

News
About this article
Published by The Daily Singapore
Spread the word
Daily brief
Free, in your inbox before 7am. Weekdays.
Before you go
The day's Singapore news in a 2-minute read. Free, weekday mornings.