Tuesday, February 17, 2009

A UCLA Catalogue and Further Details

During the course of our research Dr. Lang discovered a nascent catalogue being developed by UCLA (UCLA Catalogue of Digitalized Medieval Manuscripts) that digitizes collected medieval—and sometimes older—manuscripts, stores them into their database and posts them online to be freely viewed and even read. Because these documents predate the Printing Press, all of them are handwritten. They are collected from around the world and from a variety of medieval cultures, meaning they are written in a variety of languages, including Latin, Old English, Middle English, German, and French, all of which conveniently share the same alphabet. The digitization of the documents is essentially photographing every page of the document on a black background and uploading them to be viewed one at a time. One could thereby—inconveniently—read through the whole manuscript by clicking on each image individually and zooming in to make it viewable. The pictures have high enough pixels that they do not distort when enhanced; however, the process of reading images of penmanship through a series of pictures is less than desirable. It seems that this mode of viewing is present despite UCLA’s overarching intentions with the catalogue. Though they certainly aim to exhibit the documents, the notion of online galleries has yet to catch vogue. UCLA prides itself upon being avant guard in their nearly all of their developments, and this catalogue seems little exception. Consequently, Dr. Lang and I find hope that they would take interest in the technology we are developing for the transcription of handwritten manuscripts via crowdsourcing. The preliminary research in this vein has lead to seeing both potential and complications.

Since the languages used in the catalogue’s manuscripts all employ the same lettering, it is viable that we could have nearly all of them transcribed by crowdsource workers. The concern is the language variation. Though it is possible that we will find users familiar enough with French, German, English, or even Latin to discern a word by context when the handwriting is obscure, it is unlikely that many crowdsourcers know the intricacies of Old and Middle English to accomplish the same thing. We will likely have to provide additional compensation for the transcription of more messy texts or texts with dead or rare languages.

Beyond the language complications of the manuscripts, there is also the issue of the images themselves. The images as UCLA has them can be used “as is” and distributed through Mturk, or we can use Snapter to obtain a cropped and flattened PDF file. Snapter is preferable because of the quality of the image produced; however, Snapter—or at least the trial version of Snapter we now have available—is not automated. If the program is used in its current form, either we or the libraries using our service will have to select how each individual image will be cropped and flattened—the type of algorithm used to produce a flat PDF picture varies based upon whether the user has uploaded an image of an open book (two pages), single document, card, or board. We will have to find a way to efficiently manage mass amounts of uploaded documents, which means finding some method of formatting images without having to manually manipulate each one.
The images below show an image from UCLA’s Catalogue of Digital Medieval Manuscripts (right), the image uploaded into Snapter with formatting selected (far left), and the output (center).

1 comment: