Docuscope

Case Study #2

Project Title: DocuScope, http://www.cmu.edu/hss/english/research/docuscope.html

Principals: David Kaufer & Suguru Ishizaki – DocuScope Inventors

What it is: The project creators provided the following description.
DocuScope is a corpus-based text analysis tool that supports both quantitative and qualitative analyses of rhetorical strategies found in a broad range of textual artifacts, using a standard home-grown dictionary consisting of more than 40 million unique patterns of English that are classified into over 100 rhetorical functions. DocuScope also provides an authoring environment allowing investigators to build their own customized dictionaries according to their own language theories. *Peer-Reviewed research published with both the standard and customized dictionaries is discussed, as well as tradeoffs, limitations, and directions for the future.

What the Bold Text Means and Why it is Important.

Corpus-based text analysis tool. Answers the crucial “What is it?” question. Every digital technology description should clearly and succinctly explain what the object is. The tool may be novel but the reasons for its existence should be familiar and easy to understand. The above description indicates that DocuScope is a corpus tool, in the family of applications of a search function, a concordance program, a dictionary pattern matcher, or a mathematically complex machine learning algorithm.

Quantitative and qualitative analyses. Answers “What does it do?” Every digital technology description should explain how it analyzes information to generate new knowledge. The particular description of DocuScope indicates that researchers can use DocuScope to collect numerical information over corpora, but also to arrive at richer qualitative understandings as well. Like most digital tools in the humanities, DocuScope seeks to supplement rather than replace human reading and interpretive processes.

Rhetorical strategies. Answers “What is its special focus?” Every digital technology description should explain the focus of its analysis. DocuScope’s default dictionaries attempt to capture rhetorical (or reader experience) patterns across a corpora. It is not designed to capture morphemes, syntactical structure or formal semantics.

Using a standard home-grown dictionary. Answers “How does it work?” Every digital technology description should explain the analytic engine behind its analyses. DocuScope relies on a large hand-crafted database built empirically by human knowledge workers over a decade. An answer to this question also helps evaluators sub-classify the first “What is it?” question. DocuScope relies on hand-made dictionaries rather than proximity relations (like a concordance program) or machine learning.

Authoring environment. . . .customized dictionaries. . . .own Language theories. Answers “Can it be customized by individual users?” Every digital technology description should explain whether researchers can customize it to their own particular interests and how.

*Peer-Reviewed research. Answers the all-important question “Can researchers use the tool and generate new knowledge as established by a community of peers?” We star this because this is likely to be the single most important datum for review committees.

Tradeoffs, limitations, and directions. Answers the “loose ends” questions. What’s wrong with it and how will it get better?

References to Docuscope

Books

Amal, A. M., Kaufer, D., Ishizaki, S., & Dreher, K. (2012). Arab Women in Arab News: Old Stereotypes and New Media. Bloomsbury Academic.

Kaufer, D. & Buter, B. (2000). Designing Interactive Worlds with Words: Principles of Writing as Representational Composition. Routledge.

Kaufer, D. & Butler, B. (1996). Rhetoric and the Arts of Design. Routledge.

Kaufer, D., Ishizaki, S., Butler, B., & Collins, J. (2004). The Power of Words: Unveiling the Speaker and Writer’s Hidden Craft. Routledge.

Journals

Collins, J., Kaufer, D., Vlachos, P., Butler, B., & Ishizaki, S. (2004). Detecting collaborations in text comparing the authors’ rhetorical language choices in the Federalist Papers. Computers and the Humanities, 38(1), 15-36.

Geisler, C., Kaufer, D. & Itext Working Group. (2001). Future directions for research on the relationship between information technology and writing. Journal of Business and Technical Communication, Part I, 270-308.

Kaufer, D. (2006). Genre variation and minority ethnic identity: exploring the personal profile in Indian American community publications. Discourse & Society, 17(6), 761-784.

Kaufer, D. & Al-Malki, A. M. (2009). A “first” for women in the kingdom: Arab/West representations of female trendsetters in Saudi Arabia. Journal of Arab and Muslim Media Research, 2(2), 113-133.

Kaufer, D. & Al-Malki, A. M. (2009). The War on Terror through Arab-American eyes: the Arab-American press as a rhetorical counterpublic. Rhetoric Review, 28(1), 47-65.

Kaufer, D. & Hariman, R. (2008). A corpus analysis evaluating Hariman’s theory of political style. Text & Talk, 28(4), 475-500.

Kaufer, D. & Ishizaki, S. (2006). A corpus study of canned letters: mining the latent rhetorical proficiencies marketed to writers in a hurry and non-writers. IEEE Transactions on Professional Communication, 49(3), 254-266.

Kaufer, D., Ishizaki, S., Collins, J., & Vlachos, P. (2004). Teaching language awareness in rhetorical choice using Itext and visualization in classroom genre assignments. Journal for Business and Technical Communication, 18(3), 361-402.

Kaufer, D., Parry-Giles, S., & Klebanov, B. B. (forthcoming). Tracking “image bites” across the public/private divide: NBC News coverage of Hillary Clinton from scorned wife to senate candidate. Journal of Language and Politics.

Klebanov, B. B., Kaufer, D., & Franklin, H. (forthcoming). A figure in a field: semantic field-based analysis of antithesis. Journal of Cognitive Semiotics.

Parry-Giles, S. & Kaufer, D. (forthcoming). Lincoln reminiscences and nineteenth-century portraiture: the private virtues of presidential character. Rhetoric and Public Affairs.

Chapters in Edited Volumes

Hu, Y., Kaufer, D., & Ishizaki, S. (2010). Genre and Instinct. Computing with Instinct, Lecture Notes in Artificial Intelligence, LNAI 5897, ed. Cai, Y. Springer.

Ishizaki, S. & Kaufer, D. The DocuScope Text Analysis and Visualization Environment. (2011). Invited chapter for Applied Natural Language Processing and Content Analysis: Identification, Investigation, and Resolution, ed. McCarthy, P. & Boonthum, C.

Kaufer, D. (2004). Public vs. Private Rhetoric: An Analysis of the NY Times Writers on Writing Series. The Public in Rhetorical Theory, ed. Kent, T. & Couture, B. Utah State Press, 163-185.

Kaufer, D., Geisler, C., Ishizaki, S., & Vlachos, P. (2005). Computer-Support for Genre Analysis and Discovery. Ambient Intelligence for Scientific Discovery, ed. Cai, Y. Springer, 129-151.

Kaufer, D., Geisler, C., Vlachos, P., & Ishizaki, S. (2006). Mining Textual Knowledge for Writing Research and Education. Writing & Digital Media, ed. Waes, L. V., Leijten, M., & Neuwirth, C. Amsterdam: Elsevier, 115-129.

Kaufer, D., Ishizaki, S., & Al-Malki, A. M. (2007). A Framework for Training Writing Teachers in the Discourse Patterns Underlying Cross-institutional Writing Assignments. Sustaining Excellence in Communicating Across the Curriculum: Cross-institutional Experiences and Best Practices. Cambridge Scholars Press, UK.

Oakley, T. & Kaufer, D. (2007). Designing Clinical Experiences with Words: The Three Layers of Analysis in Clinical Reports; A Dilemma for Mental Spaces and Genre Theory. Mental Spaces in Discourse and Interaction, ed. Hougaard, A. & Oakley, T. John Benjamins Publishing Company.

Inventor’s Note: Even now, I would not likely list DocuScope on my vita as a separate entity. I think that is because in my environment, DocuScope would be listed as a “tool” more than an intellectual innovation in its own right. The line between tool and innovation in my environment is determined by patentability. There are no patentable algorithms in DocuScope. It simply put together several existing technologies that hadn’t been put together before. The dictionaries are an innovative feature because they took so long to develop. But because they were produced by over-time theory and empirical observation and not by algorithm, they are considered more “art” than patentable method. Ironically, had Suguru and I taken a machine learning approach, and done something even slightly novel with that, the chances for patentability would have been much higher and usability for humanistic research much lower. As a result, in my environment, the only way to demonstrate the value of the DocuScope environment was to show the kind of peer-reviewed research it can help support.