On the Written, the Digital, and the Transcription

Friday, February 25, 2011

Data Collection Logistics

In an attempt to expedite the transcription process, I have found it necessary to raise the compensation per transcribed page. Each page is at present valued at a dime, which, while still considerably cheaper than the $1.oo per page value suggested by one chagrined Turker, is not as desirable as the $.03 and $.05 we were paying last month. Unfortunately, while there doesn't seem to be a proportional relationship yet between compensation and quality, there is a relationship between compensation and the rate of tasks accomplished. While tasks can be accepted for low rates, they, understandably, are not the priority for Turkers. Being at the "bottom of the barrel" is never preferred when working with deadlines--senior paper deadlines now, but also library deadlines later if this project develops. My concern right now is finding a tenable value that enables our product to be expedient and affordable. It may take some calculus.

Requesting high numbers of tasks is preferable if manageable. One present hurdle is how to identify which document the worker has attempted to transcribe. As of now, I am having to manually check by surveying the manuscript images. This would have to be replaced by some method of automation if larger quantities of manuscripts had to be transcribed at a time--for instance, if multiple libraries needed to have work done. Also, sometimes certain tasks within a given cluster of manuscripts get picked up, leaving others still untouched. This means that new CVS files have to be made containing only those manuscripts that were left unfinished from prior requests. This also has to be done for dud or inadequate submissions from workers. Sometimes workers simply cannot read the manuscript; sometimes I receive an advertisement as submission (and I'm usually not interested enough in the product to bear the inconvenience of having the task request wasted). Again, automation to alleviate some of these issues will be difficult but probably necessary.

Monday, September 7, 2009

Manuscript Complications

An earlier post designated UCLA's Catalogue of Digitized Medieval Manuscripts as our primary affiliate project for transcription. We thought that since the collection is incipient, there might be more alacrity for joining with our project to develop a program to promote the distribution of classical writings and enable the viewers of their collection to actually comprehend the contents of their manuscripts. However, this convenience also presented a sort of catch-22. Though the cryptic nature of most of the manuscripts in their collection validate to some degree our endeavors, they also complicate the process of transcription: if MTurk users are unable to read the manuscripts, then they are unable to transcribe them for anyone else. It is also then commensurately difficult to "proofread" the submissions we would receive from those workers.

After discovering that at least for immediate purposes UCLA's collection proved undesirable, I decided to browse the internet for other digital collections of handwritten manuscripts. The best candidate is an online collection presented by the Library of Congress called "The Thomas Jefferson Papers." It is purportedly the largest collection of Thomas Jefferson's handwritten manuscripts in the world (over 27,000), including letters and speeches. These have several advantages. Being as though they were written to be read, they are predominantly legible. Also, though the speech is dated in some respects, the letters and speeches are written in English, providing the workers with context to aid in transcription. Furthermore, the collection has a cultural relevance that may further prove to substantiate our project. Our pilot-run will likely be using one of these manuscripts.

Below is a link to an image of one of Jefferson's letters of correspondence:

General Correspondence #10

Logic, Logistics, and Formats

The last entry discussed in brief the my attempts to develop an adequate template for crowdsourcing the manuscripts to be transcribed. Since, I have encountered a few new challenges--and occasionally "frustrations"--and learned a bit more of what is required of the "Requesters" using MTurk. The next few entries will address some of the logistics of "piloting" our concept through Mechanical Turk.

It should first be noted that, while we were originally intending to have MTurk HITs e-mail the image transcriptions to WrittenRummage@gmail.com, it has now shown itself prudent to operate primarily through MTurk. In designing a template for the general form of all transciption tasks we will be requesting, I found that MTurk supplies the requester with a variety of tools to serve whatever function might prove necessary, including multiple choice bubbles for surveys or text boxes for purposes similar to ours. Given that as a Requester we have to have one of these supplied tools in our template, it is inferred that Mechanical Turk wants most--if not all--information transfer to be done through their program. This is not entirely inconvenient or impractical; we can use the provided text box as the area in which HITs will transcribe, thus saving several delaying steps of log-ins and file formatting. I therefore changed the instructions in the template.

Images will be provided once the first MTurk request is actually made.

Sunday, August 23, 2009

Task After the Drought

It has been almost unforgivably long since my last entry; however, the project trudges forward and hastens after a swift deceleration. The impending task is developing a proof of concept for WrittenRummage, our decided small non-profit name. This entails a sort of pilot run: using Mturk to crowdsource the task of transcribing a viewable--and decipherable--image of handwritten text, and establishing a template to repeat the task an indeterminate amount of times. The actual process of defining a template and making explicit our expectations of the worker is more intricate than originally anticipated. Deadlines, compensation, the pre-determined quality rating of the worker, modes of communication, our project's credibility, and potential for repeated workers must all be adequately attended to in establishing the task proper. This I will attempt to complete in the ensuing day or two.

It is part of our intent to have automatic incremental increase for the compensation of the workers each day the task is available. That is, given that the intial wage offered for completing the task is denoted quantity $C, day two of the task's posting, should have not yet been accepted by a worker, would have an increased wage of $C + $0.01 or the like automatically. This process would occur until a predetermined cap. I am contacting Mturk to see if this process is possible.

Tuesday, February 17, 2009

A UCLA Catalogue and Further Details

During the course of our research Dr. Lang discovered a nascent catalogue being developed by UCLA (UCLA Catalogue of Digitalized Medieval Manuscripts) that digitizes collected medieval—and sometimes older—manuscripts, stores them into their database and posts them online to be freely viewed and even read. Because these documents predate the Printing Press, all of them are handwritten. They are collected from around the world and from a variety of medieval cultures, meaning they are written in a variety of languages, including Latin, Old English, Middle English, German, and French, all of which conveniently share the same alphabet. The digitization of the documents is essentially photographing every page of the document on a black background and uploading them to be viewed one at a time. One could thereby—inconveniently—read through the whole manuscript by clicking on each image individually and zooming in to make it viewable. The pictures have high enough pixels that they do not distort when enhanced; however, the process of reading images of penmanship through a series of pictures is less than desirable. It seems that this mode of viewing is present despite UCLA’s overarching intentions with the catalogue. Though they certainly aim to exhibit the documents, the notion of online galleries has yet to catch vogue. UCLA prides itself upon being avant guard in their nearly all of their developments, and this catalogue seems little exception. Consequently, Dr. Lang and I find hope that they would take interest in the technology we are developing for the transcription of handwritten manuscripts via crowdsourcing. The preliminary research in this vein has lead to seeing both potential and complications.

Since the languages used in the catalogue’s manuscripts all employ the same lettering, it is viable that we could have nearly all of them transcribed by crowdsource workers. The concern is the language variation. Though it is possible that we will find users familiar enough with French, German, English, or even Latin to discern a word by context when the handwriting is obscure, it is unlikely that many crowdsourcers know the intricacies of Old and Middle English to accomplish the same thing. We will likely have to provide additional compensation for the transcription of more messy texts or texts with dead or rare languages.

Beyond the language complications of the manuscripts, there is also the issue of the images themselves. The images as UCLA has them can be used “as is” and distributed through Mturk, or we can use Snapter to obtain a cropped and flattened PDF file. Snapter is preferable because of the quality of the image produced; however, Snapter—or at least the trial version of Snapter we now have available—is not automated. If the program is used in its current form, either we or the libraries using our service will have to select how each individual image will be cropped and flattened—the type of algorithm used to produce a flat PDF picture varies based upon whether the user has uploaded an image of an open book (two pages), single document, card, or board. We will have to find a way to efficiently manage mass amounts of uploaded documents, which means finding some method of formatting images without having to manually manipulate each one.
The images below show an image from UCLA’s Catalogue of Digital Medieval Manuscripts (right), the image uploaded into Snapter with formatting selected (far left), and the output (center).

Sunday, January 25, 2009

A Meeting of Interests

Despite our expectations, the Holy Spirit Research Center has few handwritten manuscripts in its possession. Much of its material is comprised of old articles, magazines, and typeset research/publications. Though these do not need to be transcribed from being handwritten to searchable, digital text—which was our original intent in developing this technology—the nonetheless expressed an interest in the idea of converting the texts they do possess to being both digital and searchable. The Head of the research center expressed a desire to begin digitizing the texts they possess en masse via a device released by Snapster. This device uses two digital cameras to photograph the pages of an open-resting book—such that the binding will not be broken—and then uses the Snapster technology we intend to use to make organized PDF files of the book. This, however, does not convert the digitized texts to being searchable, which has become an information-age necessity to facilitate and hasten research. Here Dr. Lang and I intercepted interests with the Holy Spirit Research Center: we can use crowdsourcing technology to take digitized texts—and still if necessary handwritten texts—and convert them to searchable texts. This will begin with some texts that are largely unreadable by standard OCR technology: a compilation of some of the original Azusa Street movement articles. This will be the first step in our progress to producing a nonprofit transcription method to local and eventually national libraries.

Castingwords.com and Crowdsourcing (in Brief)

Castingwords.com is an online audio transcription service that receives audio of various qualities via either mail or internet. It takes the audio that was sent, breaks it into various segments of time, and uses the Amazon-based program “Mechanical Turk” to transcribe the audio using crowdsourcing technology. Crowdsourcing is the process by which a company outsources a function to an undefined network of people rather than hiring one or several professionals to accomplish the same function*. The company chooses the “winning” method of solving the function, compensates the successful user the predetermined (generally cheap) reward, and keeps the rights to the work and method. This process can occur by users cooperating, operating individually, or many individuals completing individual tasks that amount to a cohesive whole. It provides the company a solution to its function without having to pay higher wages, while the users benefit with a quick job and compensation, which is especially profitable for users in foreign countries where they have a greater value in the American dollar than their monetary system. Furthermore, the company has a broader pool of amateur and potentially expert talent to select from and pays strictly when it is satisfied with the product.

Negatives, Drawbacks, Counterarguments
Corwdsourcing by nature entails a lack of accountability. In not hiring a professional or working with a specific enclave of people, the company is less able to hold the employed accountable to smooth progress or finishing by a certain deadline. There are weaker forms of accountability, such as predetermined dates for completion, just as there are predetermined expectations for quality, but there is not the level of commitment that comes standard with professional employment with contracts. The worker is viable to either begin a task and not complete it or not attain to the standard of the company. Thus there is no assurance that anyone who undertakes the crowdsourced task will produce the quality of work the company prefers. Because of this risk, the time—and money—taken to inspect the quality and accuracy of the work, especially on large-scale projects, could potentially result in crowdsourcing being less profitable than hiring another more controlled business model. We will attempt to circumvent this discrepancy by not only crowdsourcing the original manuscript but also the proofreading process. If the submitted files match in size, the transcription will be accepted; if not, then we will crowdsource again until the transcription and proofread text match.