On the Written, the Digital, and the Transcription: 2009

Monday, September 7, 2009

Manuscript Complications

An earlier post designated UCLA's Catalogue of Digitized Medieval Manuscripts as our primary affiliate project for transcription. We thought that since the collection is incipient, there might be more alacrity for joining with our project to develop a program to promote the distribution of classical writings and enable the viewers of their collection to actually comprehend the contents of their manuscripts. However, this convenience also presented a sort of catch-22. Though the cryptic nature of most of the manuscripts in their collection validate to some degree our endeavors, they also complicate the process of transcription: if MTurk users are unable to read the manuscripts, then they are unable to transcribe them for anyone else. It is also then commensurately difficult to "proofread" the submissions we would receive from those workers.

After discovering that at least for immediate purposes UCLA's collection proved undesirable, I decided to browse the internet for other digital collections of handwritten manuscripts. The best candidate is an online collection presented by the Library of Congress called "The Thomas Jefferson Papers." It is purportedly the largest collection of Thomas Jefferson's handwritten manuscripts in the world (over 27,000), including letters and speeches. These have several advantages. Being as though they were written to be read, they are predominantly legible. Also, though the speech is dated in some respects, the letters and speeches are written in English, providing the workers with context to aid in transcription. Furthermore, the collection has a cultural relevance that may further prove to substantiate our project. Our pilot-run will likely be using one of these manuscripts.

Below is a link to an image of one of Jefferson's letters of correspondence:

General Correspondence #10

Logic, Logistics, and Formats

The last entry discussed in brief the my attempts to develop an adequate template for crowdsourcing the manuscripts to be transcribed. Since, I have encountered a few new challenges--and occasionally "frustrations"--and learned a bit more of what is required of the "Requesters" using MTurk. The next few entries will address some of the logistics of "piloting" our concept through Mechanical Turk.

It should first be noted that, while we were originally intending to have MTurk HITs e-mail the image transcriptions to WrittenRummage@gmail.com, it has now shown itself prudent to operate primarily through MTurk. In designing a template for the general form of all transciption tasks we will be requesting, I found that MTurk supplies the requester with a variety of tools to serve whatever function might prove necessary, including multiple choice bubbles for surveys or text boxes for purposes similar to ours. Given that as a Requester we have to have one of these supplied tools in our template, it is inferred that Mechanical Turk wants most--if not all--information transfer to be done through their program. This is not entirely inconvenient or impractical; we can use the provided text box as the area in which HITs will transcribe, thus saving several delaying steps of log-ins and file formatting. I therefore changed the instructions in the template.

Images will be provided once the first MTurk request is actually made.

Sunday, August 23, 2009

Task After the Drought

It has been almost unforgivably long since my last entry; however, the project trudges forward and hastens after a swift deceleration. The impending task is developing a proof of concept for WrittenRummage, our decided small non-profit name. This entails a sort of pilot run: using Mturk to crowdsource the task of transcribing a viewable--and decipherable--image of handwritten text, and establishing a template to repeat the task an indeterminate amount of times. The actual process of defining a template and making explicit our expectations of the worker is more intricate than originally anticipated. Deadlines, compensation, the pre-determined quality rating of the worker, modes of communication, our project's credibility, and potential for repeated workers must all be adequately attended to in establishing the task proper. This I will attempt to complete in the ensuing day or two.

It is part of our intent to have automatic incremental increase for the compensation of the workers each day the task is available. That is, given that the intial wage offered for completing the task is denoted quantity $C, day two of the task's posting, should have not yet been accepted by a worker, would have an increased wage of $C + $0.01 or the like automatically. This process would occur until a predetermined cap. I am contacting Mturk to see if this process is possible.

Tuesday, February 17, 2009

A UCLA Catalogue and Further Details

During the course of our research Dr. Lang discovered a nascent catalogue being developed by UCLA (UCLA Catalogue of Digitalized Medieval Manuscripts) that digitizes collected medieval—and sometimes older—manuscripts, stores them into their database and posts them online to be freely viewed and even read. Because these documents predate the Printing Press, all of them are handwritten. They are collected from around the world and from a variety of medieval cultures, meaning they are written in a variety of languages, including Latin, Old English, Middle English, German, and French, all of which conveniently share the same alphabet. The digitization of the documents is essentially photographing every page of the document on a black background and uploading them to be viewed one at a time. One could thereby—inconveniently—read through the whole manuscript by clicking on each image individually and zooming in to make it viewable. The pictures have high enough pixels that they do not distort when enhanced; however, the process of reading images of penmanship through a series of pictures is less than desirable. It seems that this mode of viewing is present despite UCLA’s overarching intentions with the catalogue. Though they certainly aim to exhibit the documents, the notion of online galleries has yet to catch vogue. UCLA prides itself upon being avant guard in their nearly all of their developments, and this catalogue seems little exception. Consequently, Dr. Lang and I find hope that they would take interest in the technology we are developing for the transcription of handwritten manuscripts via crowdsourcing. The preliminary research in this vein has lead to seeing both potential and complications.

Since the languages used in the catalogue’s manuscripts all employ the same lettering, it is viable that we could have nearly all of them transcribed by crowdsource workers. The concern is the language variation. Though it is possible that we will find users familiar enough with French, German, English, or even Latin to discern a word by context when the handwriting is obscure, it is unlikely that many crowdsourcers know the intricacies of Old and Middle English to accomplish the same thing. We will likely have to provide additional compensation for the transcription of more messy texts or texts with dead or rare languages.

Beyond the language complications of the manuscripts, there is also the issue of the images themselves. The images as UCLA has them can be used “as is” and distributed through Mturk, or we can use Snapter to obtain a cropped and flattened PDF file. Snapter is preferable because of the quality of the image produced; however, Snapter—or at least the trial version of Snapter we now have available—is not automated. If the program is used in its current form, either we or the libraries using our service will have to select how each individual image will be cropped and flattened—the type of algorithm used to produce a flat PDF picture varies based upon whether the user has uploaded an image of an open book (two pages), single document, card, or board. We will have to find a way to efficiently manage mass amounts of uploaded documents, which means finding some method of formatting images without having to manually manipulate each one.
The images below show an image from UCLA’s Catalogue of Digital Medieval Manuscripts (right), the image uploaded into Snapter with formatting selected (far left), and the output (center).

Sunday, January 25, 2009

A Meeting of Interests

Despite our expectations, the Holy Spirit Research Center has few handwritten manuscripts in its possession. Much of its material is comprised of old articles, magazines, and typeset research/publications. Though these do not need to be transcribed from being handwritten to searchable, digital text—which was our original intent in developing this technology—the nonetheless expressed an interest in the idea of converting the texts they do possess to being both digital and searchable. The Head of the research center expressed a desire to begin digitizing the texts they possess en masse via a device released by Snapster. This device uses two digital cameras to photograph the pages of an open-resting book—such that the binding will not be broken—and then uses the Snapster technology we intend to use to make organized PDF files of the book. This, however, does not convert the digitized texts to being searchable, which has become an information-age necessity to facilitate and hasten research. Here Dr. Lang and I intercepted interests with the Holy Spirit Research Center: we can use crowdsourcing technology to take digitized texts—and still if necessary handwritten texts—and convert them to searchable texts. This will begin with some texts that are largely unreadable by standard OCR technology: a compilation of some of the original Azusa Street movement articles. This will be the first step in our progress to producing a nonprofit transcription method to local and eventually national libraries.

Castingwords.com and Crowdsourcing (in Brief)

Castingwords.com is an online audio transcription service that receives audio of various qualities via either mail or internet. It takes the audio that was sent, breaks it into various segments of time, and uses the Amazon-based program “Mechanical Turk” to transcribe the audio using crowdsourcing technology. Crowdsourcing is the process by which a company outsources a function to an undefined network of people rather than hiring one or several professionals to accomplish the same function*. The company chooses the “winning” method of solving the function, compensates the successful user the predetermined (generally cheap) reward, and keeps the rights to the work and method. This process can occur by users cooperating, operating individually, or many individuals completing individual tasks that amount to a cohesive whole. It provides the company a solution to its function without having to pay higher wages, while the users benefit with a quick job and compensation, which is especially profitable for users in foreign countries where they have a greater value in the American dollar than their monetary system. Furthermore, the company has a broader pool of amateur and potentially expert talent to select from and pays strictly when it is satisfied with the product.

Negatives, Drawbacks, Counterarguments
Corwdsourcing by nature entails a lack of accountability. In not hiring a professional or working with a specific enclave of people, the company is less able to hold the employed accountable to smooth progress or finishing by a certain deadline. There are weaker forms of accountability, such as predetermined dates for completion, just as there are predetermined expectations for quality, but there is not the level of commitment that comes standard with professional employment with contracts. The worker is viable to either begin a task and not complete it or not attain to the standard of the company. Thus there is no assurance that anyone who undertakes the crowdsourced task will produce the quality of work the company prefers. Because of this risk, the time—and money—taken to inspect the quality and accuracy of the work, especially on large-scale projects, could potentially result in crowdsourcing being less profitable than hiring another more controlled business model. We will attempt to circumvent this discrepancy by not only crowdsourcing the original manuscript but also the proofreading process. If the submitted files match in size, the transcription will be accepted; if not, then we will crowdsource again until the transcription and proofread text match.

Snapter in Brief

Snapter is a recently developed internet software that converts digital images of paper into PDF format. It uses “complicated algorithms” to crop, straighten, and flatten the pictured paper—the flattening is especially useful when one is uploading images of books with the notorious “roll” in the page. One has then successfully scanned their book without the wear-and-tear on the book, which is generally ideal when working with old books that have accrued both value and dust. Snapter would ideally be able to facilitate the process of scanning the books libraries want to transcribe without the concern of destroying the original manuscript in the process. With the images converted into PDF format by Snapter, we can then easily distribute the PDF files to the crowdsourcing users (probably through Mturk) in a format they can access, read, and transcribe.

On the Written, the Digital, and the Transcription

Followers

Blog Archive

About Me