Singapore's National Library Board quietly rolled out a new duplicate-image detection pipeline this week, deploying it across the NLB's digital repository at Victoria Street — a system that archivists have been pressure-testing since May 2026. The tool, built on perceptual hashing combined with machine-learning similarity scoring, is designed to flag near-identical scans and photographs before they are formally ingested into the national digital collection.
The timing is not accidental. NLB has set a target of processing its two-millionth digitised asset by the end of the third quarter of 2026. Duplicate records inflate that count artificially, and more practically, they consume server storage that the Board's data centre at Punggol Digital District is already managing under expanding demand.
Why Duplicates Became a Pressing Problem
The issue crept up gradually. When government agencies accelerated digitisation drives during and after the COVID-19 disruptions, multiple teams — often working in parallel — scanned the same physical collections from the National Archives of Singapore on Canning Rise. Newspapers, civic photographs, maps of the old Kampong Glam district, and building records from the Urban Redevelopment Authority all went through scanners at different resolutions and colour profiles. The result was thousands of near-identical image files with marginally different metadata, making manual review impractical at scale.
Duplicate images cause problems beyond storage bloat. Search results return redundant hits, researchers waste time triaging results, and automated cataloguing systems assign conflicting tags to files that represent the same original object. For a city-state that has positioned itself as a regional leader in AI-augmented public services — and that spent S$1 billion on its Smart Nation and Digital Government initiatives across the 2021–2025 fiscal cycle — messy back-end data is an awkward gap in the narrative.
The new screening tool processes image batches overnight. According to documentation published on the NLB's developer portal on June 30, the system uses a two-stage check: a fast perceptual hash pass that eliminates obvious duplicates in milliseconds, followed by a convolutional neural network comparison for near-duplicates that share composition but differ in crop, brightness, or compression artefacts. Files flagged as probable duplicates are quarantined for human review rather than deleted automatically — a deliberate safeguard against false positives stripping unique records from the archive.
Agencies Align on a Common Standard
The NLB is not working in isolation. The Infocomm Media Development Authority, headquartered at South Beach Avenue, has been coordinating with at least four statutory boards since January 2026 to establish a shared image-quality and deduplication standard for government digital assets. The goal is a single metadata framework that would allow agencies to cross-check holdings without duplicating effort across departmental silos.
For Singapore's creative and research communities based around institutions like Lasalle College of the Arts on McNally Street and the National University of Singapore's Centre for Digital Humanities, cleaner archives mean more reliable datasets for computational research. Researchers using the NLB's NewspaperSG portal have long flagged duplicate pages as one of the more tedious friction points in their workflow.
The practical implications extend to the private sector too. Digital asset management is a growing service line among Singapore-based firms that handle marketing libraries, legal document repositories, and property listing photographs — categories where duplicates carry real financial cost in storage and licensing confusion.
NLB's deduplication project is expected to complete its first full system scan of the existing collection by September 2026. After that, the detection pipeline will run as a standing pre-ingest check. For researchers and members of the public using the NLB's digital services, the most visible change will be cleaner, faster search results with fewer redundant thumbnails cluttering catalogue pages. For the archivists on Victoria Street, it means fewer late nights triaging files by hand — and a two-millionth asset count that will actually mean something.