Singapore's Infocomm Media Development Authority flagged duplicate image contamination as a live operational problem this week, as multiple public-sector agencies moved to audit their digital repositories ahead of a third-quarter deadline tied to the Smart Nation 2.0 infrastructure refresh. The issue is more prosaic than it sounds — and more expensive.
Duplicate images clog storage, distort AI training datasets, and inflate licensing costs. For a city-state positioning itself as a regional hub for artificial intelligence and data services, messy back-end repositories are a reputational and commercial liability. The urgency increased after the Government Technology Agency, known as GovTech, circulated an internal guidance note — confirmed by people familiar with the matter but not publicly released as of Saturday — encouraging statutory boards to conduct duplicate-detection sweeps before migrating legacy data to the new Whole-of-Government data platform.
What Triggered the Week's Activity
The immediate catalyst was a routine infrastructure audit at the National Library Board's Lee Kong Chian Reference Library on Victoria Street, where archivists discovered that digitised historical photographs uploaded to the NLB's online portal, BookSG and its successor platforms, contained significant duplication rates. Some image batches, particularly those sourced from heritage collections scanned between 2018 and 2022, had been ingested multiple times during platform migrations, according to a person familiar with the audit who was not authorised to speak on record. The NLB declined to provide specific figures before publication.
That finding reverberated because it is not isolated. The Housing and Development Board, which maintains hundreds of thousands of images tied to flat listings, renovation permits, and estate documentation, has been running its own deduplication exercise since May. HDB's digital estate covers more than one million residential units across towns from Tampines to Bukit Batok, and image duplication at that scale creates non-trivial storage overhead. Object storage costs on commercial cloud platforms have hovered around US$0.023 per gigabyte per month for standard tiers — a figure that compounds fast when duplicated assets are counted multiple times across backup cycles.
GovTech has been testing perceptual hashing tools — software that generates a fingerprint for each image and flags near-identical copies even when file names differ — as part of its data-quality toolkit. The technology is not new. What is new is the institutional pressure to actually deploy it systematically, driven partly by the AI governance frameworks Singapore published in 2024 and updated earlier this year, which require agencies to document the provenance and integrity of data used in automated decision-making.
Why This Matters Beyond Housekeeping
The stakes are higher than they appear on a storage invoice. Singapore's AI strategy depends on clean, well-labelled local datasets. Duplicated images in government training sets produce models that overfit to repeated examples, skewing outputs in subtle ways that are hard to detect downstream. The Smart Nation and Digital Government Office has been explicit, in published policy documents, that data quality is a prerequisite for responsible AI deployment in public services.
For the private sector, the government's deduplication drive carries practical signals. Companies in the Marina Bay and one-north tech clusters that supply data-management services to public agencies say procurement conversations have shifted in recent weeks, with buyers asking more detailed questions about deduplication capabilities and audit trails. None of those conversations have produced announced contracts as of this weekend.
Citizens are unlikely to notice any immediate change in the services they use. MyInfo, the government's personal-data platform, and Singpass-linked services already operate with tighter data-quality controls. The clean-up work is happening in the layers that support back-office functions — archiving, planning, estate management — rather than in consumer-facing applications.
The practical advice for anyone dealing with public-sector digital submissions right now is straightforward: label image files with unique identifiers before uploading, avoid resubmitting documents from earlier applications, and check agency portals for updated file-naming conventions, several of which were quietly revised this month. GovTech has indicated it will publish updated developer guidelines covering image submission standards before the end of July.