News
Singapore's Duplicate Image Problem: The Numbers Hiding in Plain Sight
New data reveals the scale of redundant digital imagery clogging government portals, commercial platforms and public archives across the island.
4 min read
Updated 4 h ago
News
New data reveals the scale of redundant digital imagery clogging government portals, commercial platforms and public archives across the island.
4 min read
Updated 4 h ago

Singapore's public and commercial digital repositories collectively hold tens of millions of images — and a significant share of them are exact or near-exact duplicates. That is the central finding emerging from audit work carried out by technology teams across several statutory boards and private platforms in the first half of 2026, as the city-state's push to rationalise its data infrastructure enters a more aggressive phase.
The timing matters. Singapore's Infocomm Media Development Authority has been driving a broader data hygiene initiative under the Smart Nation 2.0 framework, which places cloud cost efficiency and AI-readiness at the centre of government IT spending. Duplicate images are not merely an aesthetic inconvenience — they consume storage, slow down machine-learning pipelines, and distort the training datasets that underpin the AI tools Singapore is betting heavily on.
Internal audits reviewed by technology teams at several government-linked platforms suggest that duplicate or near-duplicate images can account for between 15 and 30 percent of total image libraries, depending on how aggressively an organisation has historically enforced upload governance. For large consumer platforms, the ratio climbs higher. One analysis of a mid-sized e-commerce portal operating out of One-North, the research and business park in Buona Vista, found that roughly one in four product images was a functional duplicate of another file already stored in the same database.
The cost is not trivial. Cloud storage pricing on major providers used by Singapore enterprises — Amazon Web Services operates a significant regional presence at its Asia-Pacific Singapore region, while Google Cloud and Microsoft Azure both run local availability zones — typically runs at fractions of a cent per gigabyte per month. But at scale, duplicated image libraries translate into measurable waste. A library of 10 million images averaging 2 megabytes each consumes roughly 20 terabytes. If 20 percent of those are duplicates, that is 4 terabytes of redundant data, generating recurring costs every billing cycle with no operational benefit.
The National Library Board, which manages digital archive programmes including the National Archives of Singapore on Canning Rise, has been among the more public-facing institutions grappling with this issue. Digitisation drives over the past decade have pulled in material from multiple sources, and without centralised deduplication checkpoints at the point of ingest, parallel copies accumulate across collections. The Archives' digital holdings now run into the hundreds of terabytes across all media types.
The technical approaches to deduplication have matured considerably. Perceptual hashing algorithms — which generate a compact fingerprint of an image based on visual content rather than file metadata — can identify near-duplicates that differ only in compression, cropping, or minor colour adjustment. Tools such as these are now being evaluated by teams at GovTech, the government's central technology agency headquartered at Sandcrawler Building in one-north Fusionopolis Way, as part of a broader data quality toolkit.
For commercial operators, the Singapore Business Federation has flagged data rationalisation as a cost-reduction lever for retail and logistics members navigating tighter margins. Product catalogue management — where duplicate imagery is endemic — is an area where automated deduplication can reduce both storage bills and the manual effort required to maintain clean listings.
The practical advice for organisations reviewing their own image libraries is straightforward: establish a deduplication check at the point of upload rather than attempting retrospective cleanup. Retrospective audits on libraries exceeding one million files typically require dedicated compute time measured in days, not hours, even with optimised hashing. Setting a SHA-256 or perceptual hash gate at ingest costs almost nothing operationally and prevents the problem from compounding.
Singapore's ambition to position itself as a regional AI hub depends partly on the quality of the data assets its institutions and companies bring to the table. Redundant imagery is a solvable problem — the numbers just need to be taken seriously first.

News

News

News

News
About this article
Published by The Daily Singapore
Spread the word
Daily brief
Free, in your inbox before 7am. Weekdays.
Before you go
The day's Singapore news in a 2-minute read. Free, weekday mornings.