Singapore's push to digitise everything — land titles, HDB flat records, medical histories, court documents — has produced an unexpected problem: nobody agreed on a single standard, and now the nation's public databases are riddled with duplicate image files running into the tens of millions. The Info-communications Media Development Authority confirmed earlier this year that a whole-of-government deduplication exercise is underway, touching agencies from the Housing Development Board to the National Archives of Singapore on Canning Rise.
The timing matters. Singapore is spending heavily to position itself as a regional artificial intelligence hub, with the National AI Strategy 2.0 — launched in late 2023 — calling for high-quality, machine-readable datasets across the public sector. Duplicate images are not a cosmetic irritant. They skew model training, inflate storage costs, and compromise the integrity of search results in systems that citizens use daily, from the HDB's flat application portal to the SingPass document vault.
How the Mess Accumulated
The duplication problem has roots in the 1990s, when individual ministries began scanning paper records independently. The National Archives ran its own programme. The Land Transport Authority digitised vehicle and licensing documents on a separate track. Hospitals under the National University Health System and Singapore Health Services scanned patient records using different resolution standards and naming conventions. When these systems were later connected — first through eCitizen, then through the Singpass app — no one wrote a master deduplication rule into the integration layer.
By 2018, when GovTech took over the Smart Nation infrastructure role in full, internal audits reportedly flagged the redundancy problem. But fixing it required every agency to temporarily freeze updates to shared repositories, a disruption that kept getting deferred. The National Library Board's digitised newspaper archive — accessible from its reference library at Victoria Street — is one of the cleaner datasets, because it was built under a single contractor with strict file-naming rules from the outset. Most other collections were not so lucky.
A 2024 benchmark study by the Singapore Management University's School of Computing and Information Systems — using publicly cited figures from that institution's published research — found that government-held image repositories in comparable city-states carry duplication rates of between 18 and 34 percent once cross-agency indexing is applied. Singapore has not published its own figure officially, but the IMDA's current deduplication tender, awarded in the first quarter of 2026, covers more than 200 terabytes of scanned government documents.
What the Fix Actually Involves
Deduplication at this scale is not a single software patch. The approach being rolled out uses perceptual hashing — a technique that generates a short fingerprint from an image's visual content rather than its file name — to cluster near-identical scans. A file scanned at 200 dpi and the same file scanned at 300 dpi will look different to a traditional checksum tool but nearly identical to a perceptual hash. GovTech's data engineering team at Mapletree Business City in Pasir Panjang is running the matching engine, with results piped back to each originating agency for human review before any file is deleted.
The practical stakes are significant for ordinary residents. The HDB's resale portal processes tens of thousands of flat transactions each year — resale prices averaged around S$570,000 island-wide in early 2026 — and each transaction requires verified document images. A wrongly flagged duplicate can stall a sale. Lawyers at firms along Cecil Street handling conveyancing work say document retrieval delays have been a recurring friction point, though the agencies themselves have not publicly attributed those delays to the duplication backlog.
The current deduplication contract runs through the third quarter of 2027. Agencies are expected to migrate cleaned image libraries into a new central object store before the end of that year. Residents who use SingPass to store personal documents — identity cards, birth certificates, tenancy agreements — are not directly affected during the transition, but the back-end clean-up will eventually make document verification faster. For now, anyone dealing with time-sensitive transactions involving government-held scanned records should build in extra lead time and confirm document status directly with the relevant agency rather than assuming the portal is fully current.