Singapore's public and private digital archives are sitting on a problem that has quietly compounded for years. Duplicate images — the same photograph stored multiple times under different file names, metadata tags, or database entries — now clog the storage systems of organisations ranging from the National Library Board to private real estate portals listing HDB resale flats in Tampines and Jurong West. The immediate question is not whether to fix it, but how, and who pays.
The issue has sharpened into a genuine policy concern because Singapore is spending heavily to position itself as a regional data and AI hub. The Infocomm Media Development Authority's Digital Connectivity Blueprint, released in 2023, set ambitious targets for data infrastructure efficiency. Running bloated, image-duplicated databases directly contradicts those efficiency targets — and wastes server capacity at a moment when data centre electricity costs in Singapore have climbed alongside broader energy price pressures.
Why the Moment of Decision Has Arrived
Two converging pressures are forcing the issue now. First, the Government Technology Agency, known as GovTech, has been expanding its centralised cloud migration push across public-sector systems. That migration creates a natural audit point: agencies must inventory what they are moving, and duplicate images surface immediately when proper deduplication checks are applied. Second, commercial property and housing platforms operating out of offices along Cecil Street and Shenton Way have started competing on data quality, not just listing volume. A portal that can guarantee a clean, single canonical image per property listing gains a measurable edge in a market where buyers scroll through hundreds of entries on their phones.
The National Heritage Board faces a related but distinct version of the challenge. Its digitisation programmes — including the ongoing work to archive photographs from the former National Library building on Stamford Road — have produced high-resolution scans that sometimes exist in three or four near-identical versions across different project folders. Rationalising those records requires human curatorial judgment, not just automated deduplication software, because two images that look identical to an algorithm may carry different provenance metadata that historians need.
The costs are real. Cloud storage in Singapore's enterprise market runs at roughly S$0.023 per gigabyte per month on standard tiers, according to publicly available pricing from major hyperscalers with regional nodes here. For a mid-sized government agency storing tens of millions of image files, even a 20 percent reduction in duplicates translates to a meaningful annual saving — and that figure scales sharply for larger repositories.
The Decisions That Cannot Be Deferred
Three choices now sit on the table for both public agencies and commercial operators. The first is whether to pursue automated deduplication using perceptual hashing — a technique that identifies visually similar images even when file metadata differs — or to build manual review workflows. Automated tools are faster and cheaper upfront, but they carry a non-trivial error rate that could delete images with archival value.
The second decision concerns governance. Singapore has no single statutory body responsible for image data quality across sectors. GovTech governs public-sector digital infrastructure, but its remit does not extend to private platforms. The Personal Data Protection Commission at Robinson Road handles personal data rules, but duplicate images of, say, a Bishan street scene raise no personal data issue — they simply waste space and degrade search quality.
The third and most consequential decision is about timing. Agencies that delay the cleanup past the end of 2026 risk compounding the problem as AI training pipelines increasingly draw on local image repositories. Duplicate images fed into training sets skew model outputs, a technical problem that becomes harder to correct retroactively once models are deployed.
For organisations managing image libraries right now, the practical path forward involves three steps: commission a deduplication audit before the end of the third quarter of 2026, establish clear rules for which version of a duplicate is canonical and why, and assign a named data steward — not just a team — to sign off on every deletion. The technology to solve this problem exists. The institutional will to act on it, before the next round of digital infrastructure expansion locks in today's inefficiencies, is the only variable still in question.