ARMA Magazine

QC-ing the QC

This article is my advice for a holistic approach to QA/QC for document capture projects. It is also a recommendation to audit the effectiveness of QC methodologies used (or to be used) by a service provider. I tried my best to harness lessons learned and expertise gained during more than half a century in the computer industry, half of which has been focused on digital document capture. I always tried to defy conventional wisdom using creative new methodologies against a very conservative industry segment once dominated by micrographic entrepreneurs. After a few years of assisting several micrographic companies in their transition to digital, I started my own digitization service bureau. 30 years, 2 billion images, hundreds of projects, hundreds of thousands of lines of programming code, and millions of dollars later, one may think that current projects would be error-free, but no such luck. As in the story of the scorpion and the frog, it is in its nature. Perfection is an asymptotic curve.

Under the premise that profits are proportional to productivity and client satisfaction, I stubbornly continued to innovate and experiment with creative approaches to problems commonly perceived as simple, reinventing QA/QC for each project while being wary of substandard QC procedures represented by other vendors. In hindsight, although I failed more often than I succeeded, the aggregate benefits of successes greatly offset the aggregate cost of failures, which by the way is a price worth paying for valuable lessons learned.

Errors and Omissions (EOs)

Out of all challenges involved in production, QC is the most underappreciated and abused task, despite being the most important protection against errors and omissions (EOs), the digital counterpart of misfiled, debased, or lost documents. Conventional methodologies (some inherited from the micrographic era) offer substandard levels of success in preventing, monitoring, detecting, and remediating EOs in a document digitization project. To prevent EOs from occurring (or minimize their occurrence) we need a robust production workflow running on production-level hardware and software tools. Monitoring EOs requires the planting of tracking seeds and gathering profuse, multidimensional* production metrics along the entire production workflow. Detection success depends on the results of applying timely analytics on the gathered metrics. As EOs tend to creep up from the very early stages of a project, a late discovery may have devastating, often irreparable consequences. Remediation requires methodologies to bridge the gap between the habitats of original and digital documents. There is enough there to write a whole book, but for this article, I will try to condense it into a few thousand words.

What are the Causes of EOs?

Human errors, equipment malfunction, inadequate technology, software bugs/defects, poor project management, poor originals, inadequate reconciliation of parallel production lines, etc. Media challenges such as the following also contribute to the proliferation of EOs:

Some industries tolerate errors and omissions in their document digitization projects more than others. In fact, some clients irresponsibly assume that their Digitization Service Provider (DSP) will reasonably comply with an explicit or implicit error tolerance declared in a Service Level Agreement (SLA). Clients should always ask DSPs to explain their QC methodologies and demand solid answers, then mistrust and verify. If clients tolerate weak answers, the egg is on their face. Answers by DSPs may include a combination of:

  1. Random sampling: Although useful in most projects, random sampling overpromises and underdelivers. It helps, but it is often not enough.
  2. 100% QC: an abstract notion sometimes attempted at a very high cost. However, with ingenuity, we may get close enough.
  3. Counting pages: Error-prone and unreliable. Matching counts is frequently misleading due to undercounts offsetting overcounts. Reconciling human counts against computer counts often yields false positives and false negatives.
  4. Dual blind verification: Only useful in indexing, often imperfect.
  5. Control totals: Crosscheck of input and output metrics at every workflow step. Crucial and effective, but usually one-dimensional*.
  6. Lookup tables: Crosscheck against user-supplied data. Although most useful in indexing, it could extend way further.
  7. Page-by-page video QC: A video is taken showing each and every page before scanning. The system then presents a graphical interface that allows an operator to compare still video frames against scanned images, looking for errors and omissions. This is an effective but expensive solution to a select set of extreme QC circumstances.
  8. Good documentation: Statement of Work (SOW), Key Performance Indicators (KPI), Service Level Agreement (SLA), Acceptance Test Criteria, and Production Instructions are crucial in containing EOs, if properly adhered to.
  9. Capture methodology: Batch-oriented methodologies (the entire workflow is applied against a batch of documents) is used in large volume backfile conversion projects, while transactional methodologies (the entire workflow is applied against each document) is used in low volume day-forward capture. Used otherwise signals trouble.
  10. Bill of health: A comprehensive multidimensional* post-mortem report of all raw metrics and analytical results. My favorite and highly recommended. On a side note, I am currently experimenting with Machine Learning (ML) to enhance the analytic component.
An example of page-by-page video QC

*Multi-dimensional analysis: A concept that enables valuable checks and balances based on four distinct workflow perspectives (dimensions):

  1. At origination: Raw data captured when the service provider accepts custody of a collection, and a manifest/inventory is produced.
  2. During Production: Contextual data captured at the beginning or the end of each task throughout the production workflow.
  3. At publishing: Contextual data captured after documents are classified and indexed.
  4. At submittal: Final data captured when deliverables are subject to preliminary or final acceptance.

By cross-referencing, comparing, and analyzing aggregate numbers based on data represented in these four coordinate systems EOs become more conspicuous. What could (and will!) possibly go wrong?:

  1. At Origination: Some documents, even entire batches, may not have made it entirely through the production workflow. They may be stuck in the middle of it, mishandled, or believed present when they were not. If the manifest was unsuitably created, these M.I.A. documents may linger in the dark for a long time or forever.
  2. During Production: Pages can be missed, mutilated, obstructed, out of sequence, out of scale, illegible, wrongly split or merged, overlapped, files corrupted, files not compliant to SOW, defective blank page detection. Page groups can be wrongly classified/indexed, not indexed, duplicated, “buried”, or “made up”.
  3. Once Published: Incorrect structures (missing or extraneous sections), truncated documents, missing documents, “made up” documents.
  4. At Submittal: Similar to once published above, but affected by EOs introduced thereafter. Deliverables may be incomplete, redundant, not fully compliant with SOW. Originals may be returned or disposed of without a valid digital counterpart.

A simplified production workflow includes tasks such as boxing and inventory, chain of custody, logistics and transportation, manifest, prepping/de-prepping, batching, coordinating parallel production lines, scanning, image processing, QC/repair, classification, coding, indexing, lookups, formatting, publishing, de-prepping, testing, on-demand work in progress (WIP) requests handling, deployment, reporting, submittal, and final acceptance.

The following is a partial list of things to watch when assessing answers from a DSP:

A creative inventory method we use consists of a person wearing smart glasses exposing each tab folder or binder title in a box. This allows a video capture of each and every (identified) document on each box by just one person in a couple of minutes. The video captured is later turned into still frames that a special software program uses to search and find any document label in a handful of clicks—without the cost of data entry. If you wonder why smart glasses, the answer is that they allowed us to cut labor costs in half when we no longer needed one person handling a camera and a second person fingering the folder tabs. A more traditional inventory process will simply enter the pertinent label data on computer or paper forms.

In Conclusion

The quality of results from a large document digitization project depends on the robustness of the production infrastructure and on the effectiveness of the QC methodologies used. It is the end user’s responsibility to timely and thoroughly assess the former before the project starts and the latter (QC the QC) before, during, and after the project concludes.  

Author

  • A document imaging pioneer, Manuel produced innovative products and strategies in the computer industry for over 50 years. He developed imaging products for industries such as government, financial, nuclear, and healthcare. He has trained resellers, published articles, and dictated conferences in six countries. Since 1992, he has focused on the development of technology and strategies for the document imaging service bureau industry. Manuel has devoted years to university teaching, as well as to software development, production, and sales. A naturalized US citizen, Manuel moved to California in 1979 after studying computer science in the University of Buenos Aires, Argentina. Manuel is currently President of Isausa, Inc. and a Past President of the Central Coast Chapter for ARMA International.

(Visited 928 times, 1 visits today)

About the Author

Manuel Bulwa
A document imaging pioneer, Manuel produced innovative products and strategies in the computer industry for over 50 years. He developed imaging products for industries such as government, financial, nuclear, and healthcare. He has trained resellers, published articles, and dictated conferences in six countries. Since 1992, he has focused on the development of technology and strategies for the document imaging service bureau industry. Manuel has devoted years to university teaching, as well as to software development, production, and sales. A naturalized US citizen, Manuel moved to California in 1979 after studying computer science in the University of Buenos Aires, Argentina. Manuel is currently President of Isausa, Inc. and a Past President of the Central Coast Chapter for ARMA International.