Ingest & Cataloging


Upon receiving digitized assets from Media Preserve, the digital archivist immediately transferred all digital files to a 24 TB Network Area Storage (NAS) drive, operating at RAID 5 (18 TB available for storage). Preservation files were then uploaded to Amazon Web Services’s Glacier storage, while Access files were migrated to a G-Technology 6TB hard drive. Preservation and Access copies were then deleted from the NAS drive in order to efficiently manage current and expected media ingest and storage needs. Princeton University received a duplicate set of all digital assets that remain in cold storage at their facility.

Cataloging & Descriptive Metadata

After assets were ingested, PEN America catalogers, a mix of interns, freelance consultants, and a digital archivist engaged in cataloging material. Due to the scale of the collection, catalogers were able to assess and catalog each asset as a unique digital object. In this way, our cataloging method introduced bibliographic-level descriptions for objects while taking into account traditional practices in archival science for grouping returns that generated a flexible, user-oriented collection

Initial assessment of subject headings was evaluated against the UNESCO Thesaurus, a Simple Knowledge Organization System (SKOS). The UNESCO SKOS is a three-tiered hierarchy that allows for efficient alphabetical and hierarchical browsing. The seven major subject fields provide a logical conceptual model that would later inform PEN America’s taxonomy. The URL/URI for each subject heading used was recorded in order to display the controlled vocabulary for future projects and employees of the archive. In instances where UNESCO SKOS did not adequately fulfill our cataloging needs, Library of Congress was then consulted, and if further information was needed, catalogers conducted a Wikipedia query.

Ultimately, a PEN America-specific taxonomy was created utilizing the three-tiered hierarchy of the UNESCO SKOS. This will hopefully help to ensure that the current and deliberate subject headings that were developed for PEN America will be consistently used and, in cases where new subject headings need to be developed, some amount of intellectual deliberation will need to be engaged in creating them. This is an intentional measure of insulation in order to avoid creating a flat tagging system that might easily become idiosyncratic.

Each participant has a corresponding URL/URI harvested from Wikipedia and eventually DBpedia. Wikipedia provided consistent authority control in naming individuals by referencing various authority files such as WorldCat, VIAF, LCCN, ULAN, and ISNI.

PBCore Metadata Schema for sound and moving images was utilized for Technical and Descriptive metadata application. Media Preserve supplied Technical metadata for analog assets, bundled with the digital files. PEN America then used the schema to apply descriptive, intellectual content.

Data Import

As catalogers applied metadata to assets, the data was cleaned and formatted for batch import into the database component of Each metadata category had to be associated directly with a unique asset identifier (in this case we used Princeton University’s ID). In the case of ContributorRole, we found that we had to create multiple relationships per Contributor to the unique ID. For instance:

C0760_c2200—> Salman Rushdie (Author)
C0760_c2200—> Salman Rushdie (Essayist)

Post-Production and Upload

Content was evaluated for post-production editing. Due to the scale of the project, PEN America was able to parse out duplicate events when necessary and discovered events previously unknown. Furthermore, our post-production editor was able to splice together and make continuous streaming files of events that were on separate media.

Final Cut Pro and Logic Pro were utilized due to the softwares’ batch edit and export capabilities, drastically reducing the time needed to upload content to third party hosts (SoundCloud and YouTube).

Design and Development: Flyleaf Creative, Inc & Medium Rare Interactive Inc.

Use Cases

Research and discussion about current practices, design, and human-computer-interaction took place between Flyleaf Creative (design firm), Medium Rare Interactive (developers), and PEN America. Specifically, lengthy considerations were taken in how to create a positive and immersive user experience. We determined that primary importance would be placed on comprehensive and diverse search features. These search features would allow for various types of users (personas) to easily search, retrieve, and access material while also allowing for a sense of serendipitous discovery. In order to achieve this experience, we sought to blend traditional advanced Boolean keyword searches with interactive elements that provide users with a sense of involvement and action.


A major focal point for design consideration was how to expose the breadth of the archive to prompt users into discovery without burying the material under multiple clicks or under complicated search terms and mechanisms. The decision to put search first directly onto the home page for desktop users became a core principle for the project. A set of three search scenarios for users was developed, including Explore/Browse, Related Content, and Advanced/Integrated/Faceted search. Early on in the process, emphasis was placed on the ability to integrate advanced search operability with the notion of serendipitous discovery.


In conjunction with PEN America, Flyleaf Creative developed a basis for the types of individuals that we hypothesized would be utilizing the site, otherwise known as personas. We determined that these personas would consist of the following:

  • The Academic
  • The Professional Writer
  • Multimedia Producers
  • The LIT-phile Explorer/Novice
  • Supporters/Donors/Foundations
  • Activists/Advocates


The wireframe process involved in-depth discussions about user experience, information retrieval behavior, and expectations as well as technological and budget allowances. Flyleaf Creative developed five static, simple mock-ups that focused on how to present content relationships through contextual storytelling that exists, primarily, on the home page. Feedback in the early stages helped to gain insight into how users desire to engage, explore, and experience the content.

Design and Typography

The site follows PEN America’s branding guidelines. There were many design considerations for the various hierarchies applied to each archive asset to promote easy identification and reading to those who are researching.

Interactive Prototype

The static wireframes provided the basis for our initial, interactive prototype. This consisted of enough content to simulate search returns and basic functionality. The interactive prototype was an essential component in the design and development of the archive site as it allowed for a comprehensive user experience (UX) evaluation by graduate students at Pratt Institute’s School of Information. The students, under the supervision of Dr. Craig MacDonald, provided qualitative feedback on possible design enhancements that would fulfill discoveries stemming from the UX evaluation. Please see the complete report, PEN America Archive-Pratt Final Report.

Development & Programming

From the start, the archive was imagined as an API-driven application first and a website second. Discussions around the data model required to meet the needs of accurately mapping asset metadata dominated the early discussions. The API remains accessible (with an API key that can be acquired through PEN America) to individuals and institutions that wish to access the data directly as Comma Separated Values (CSV) and JavaScript Object Notation (JSON).

Ruby on Rails was selected as the application development framework for its flexibility, ease of development, and large developer pool to ensure the ability to maintain the archive into the future.

The public-facing front end was built using the React JavaScript framework for a modern-feeling experience. Initially, this was built directly on top of the Rails application, but the need to work with batches of curated content did not fit well with the goals of the archive API and data model. WordPress was a much better fit for this type of content management, and so a new content management layer for the front end was added. The WordPress layer accesses the API to build sets of curated content from the archive managed as custom posts and also to display general web pages (About, FAQ, etc.).

The MediaPreserve-Preservation Technologies

Ingest and Transfer of Audio and Video Assets

Upon receipt of materials, the intake technician entered all assets into MediaKeeper. Any notable defects were recorded and reported to the engineers and the project managers. Asset inspections were fully compliant with ISO standard 18933.2012. Once all items were entered, any discrepancies between the material received and the packing inventory were reported to the client.

Following check-in, video engineers retrieved batches of tapes for digitization. These tapes were then moved to the appropriate studio. Each barcode was scanned and entered into the MediaKeeper record to create a reference point for the deck on which the tape was played and the engineer who performed the work.

The video engineer removed the asset from its container and scanned the unique barcode to retrieve the inspection data and the client profile. Any serious issues were reported to the project manager so the client could be notified. All splices were checked for integrity or replaced.

Analog playback from one of our VTR models moved through a Digital DPS-290 TBC/Synchronizer, through an analog to SDI converter frame synchronizer, into an SDI audio output, and into the SDI audio input of a capture card. This signal path provided the highest signal-to-noise ratio in the analog domain, while also providing the best-quality analog conversion to the SDI digital domain into the capture cards. We used high-quality codecs that created a wide range of output formats to suit clients’ needs and provide user flexibility. Our various capture cards were installed on both Apple and PC platforms. Both platforms used fiber optics to connect to the SAN network.

The engineer then inserted media into the VTR and carefully adjusted all video and audio signal levels for optimum performance. The engineer set up the appropriate video capture software to capture and digitally encode the original analog video content to a digital video file that was simultaneously written to the SAN network.

The design of the multiple ingest studios was the key difference between our system and other vendors—each tape was loaded and monitored by an engineer. We did not leave ingest and quality control to a robot, but provided the same ingest and QC with direct engineer involvement. During the ingest/reformatting process, the engineer monitored and documented any defects that were found in the recordings. This information would then be extracted from our system and written to the XML metadata report. Following ingest, the tape was returned to its container, the barcode scanned, and the tape checked out of the studios and returned to the storage area. All media was fully monitored during ingest, enabling our engineers to parse individual assets for separate instantiations.

At the end of this process, files were moved to the client’s hard drives. Checksums were generated every time files were moved.

Video Digitization Specifications:
Archival Master: 10-bit uncompressed MOV
Access Copy: DV50
Streaming File: H.264 MPEG4

Audio Digitization Specifications:
Archival Master: 24 bit/96Khz BWF
Access Copy: 16 bit/44.1kHz BWF
Streaming File: MP3 @192 kpbs