Project:Digitization

From HBMa

Layout segmentation & handwritten text recognition

The first step is layout segmentation. The registry books have mostly the form of tables. The segmentation model developed in the pilot project detects individual fields in the header and individual records (rows). The next step is transcription, for which models are already available. However, the existing HTR models work reliably for neatly written text, but not for registry records. It is therefore necessary to continuously improve these models.

For both steps, Kraken and eScriptorium are used.

Text to database

The next step is to transform the transcribed text into a format suitable for uploading to the database. Again using machine learning (text-to-text transformers), each record is processed.

For example, the text from the child's father field:

Moses Raudnitz Sinagogendiener S. d. Isak Raudnitz u. d. Rachel geb. Boskowitz

the father's name, his occupation and the names of his parents are detected and database prefixes are added (P23 for father, P27 for grandfather etc.):

[[P23:Moses Raudnitz]] [[P23/P26:Sinagogendiener]] [[P27:Isak Raudnitz]] [[P28:Rachel Raudnitz geb. Boskowitz]]

After further formatting, the data can be uploaded using Quick Statements.

Data quality

The last phase of digitization is the creation of entries for individual persons (child, parents, midwives, fiancé, spouse, deceased, etc.) and identification of persons across all records. In other words, database entries (e.g. Q117) are assigned to text strings (e.g. "Josefa Taubeles").