Contextualizing AI-Generated Transcripts for Researchers

Overview

This project was completed during my time as the Senior Data Science Fellow at University of California San Francisco’s Industry Documents Library. The Industry Documents Library is a digital archive containing millions of documents publicly disclosed in litigation against industries that impact public health, including drug, chemical, food, and fossil fuel industries. The archive’s mission is to preserve open access to this information and to support research on the commercial determinants of public health.

This vast collection of resources encompasses documents, images, videos, and recordings. These materials can be studied individually, but increasingly, researchers using the archive are interested in examining trends across whole collections, or subsets of it. In this way, the Industry Documents Library is also a trove of data that can be used to uncover trends and patterns in the history of industries impacting public health. In this project, I helped the Industry Documents Library investigate what information is lost or changed when its collections are transformed into data – specifically, textual data in the form of AI-generated video transcripts.

In this project we focused on a combination of collections metadata and computer-generated transcripts of video files. Like all information, data is not objective but constructed. Metadata is usually entered manually and is subject to human error. Video transcripts generated by computer programs are never 100% accurate. If accuracy varies based on factors such as the age of the video or the type of event being recorded, how might this impact conclusions drawn by researchers who are treating all video transcriptions as equally accurate? What guidance can the library provide to prevent researchers from drawing inaccurate conclusions from computer-generated text?

So far, this project has culminated in a blog post, multiple presentations, and a research poster that was accepted for presentation at TextXD 2022.

Built with

lubov mckone
lubov mckone
mlis candidate

Related