Text and Image Mining: From Mass Observation to Stage Magic

Wolfram R&D
6 Sept 202412:48

TLDRThe speaker discusses the application of Mathematica in text and image mining for humanities research, highlighting the differences between humanities and STEM fields. Projects include analyzing Old Bailey trial records, the Mass Observation Project, and stage magic history. Techniques such as distributional concept analysis, machine vision for circuit diagrams, and entity recognition are employed to uncover patterns and insights in historical data.

Takeaways

  • 📚 The speaker discusses using Mathematica for text and image mining in humanities research, highlighting the differences in approach between the humanities and scientific fields.
  • đŸ‘šâ€đŸ’» The Old Bailey Online project is mentioned as a significant example, showcasing a searchable database of criminal trials from 1674 to 1913, with visual data representation revealing patterns like the rise of plea bargaining.
  • 🔍 The Mass Observation project is described, which involved collecting diaries and observations to understand the daily lives of ordinary Britons, using distributional concept analysis (DCA) for text analysis.
  • đŸŽ© A project on stage magic in the late 19th and early 20th century utilized text and image mining to develop an experimental approach to history, including the extraction and analysis of images related to magic tricks.
  • 🌐 The speaker has worked on creating a large archive of electronic texts related to the histories of electronics, computation, and scientific instrumentation, focusing on understanding circuit diagrams.
  • 🌉 Another project involves building a database of historical bridge images with metadata, combining Machine Vision with linked open data to extract features of interest in civil engineering history.
  • 🔗 The importance of crawling and indexing texts is emphasized, with examples of using Mathematica's tools to crawl the WorldCat Identities API and index texts for semantic search.
  • 🔍 The RAKE algorithm is highlighted for rapid automatic keyword extraction, useful for discovering and assessing the relevance of texts to research queries.
  • đŸ€– Machine Vision techniques are being developed to analyze and label components in circuit diagrams, aiming to identify meaningful design patterns beyond individual components.
  • 📊 Clustering algorithms like the one based on kog complexity are used for discovery and to find relationships between historical figures or concepts in text collections.
  • 📈 The random indexing method is introduced for quickly mining terminology and understanding semantic characteristics of texts, showing how it can reveal author preferences and semantic evolution.

Q & A

  • What is the main focus of humanities research as described in the transcript?

    -The main focus of humanities research is on close reading and interpretation of sources such as texts, images, artifacts, and media.

  • What is the Old Bailey online project mentioned in the transcript?

    -The Old Bailey online is a fully searchable database of all the criminal trials held at London's Central Criminal Court between 1674 and 1913.

  • How does the graph in the Old Bailey online project illustrate the change in trial length over time?

    -The graph shows a bifurcation in trial length starting in the 1800s, with some trials being less than 100 words and others more than a couple of hundred words, reflecting the rise of plea bargaining and guilty pleas.

  • What is the mass observation project discussed in the transcript?

    -The mass observation project began in 1937 and involved recruiting volunteers to write diaries and answer questionnaires to better understand the lives of ordinary Britons.

  • What is distributional concept analysis (DCA) and how is it used in the research?

    -DCA is a text analysis method that relies on small n-gram windows positioned before and after a word of interest, rather than centered on it, to understand the context of the word.

  • What techniques were applied to the study of stage magic in the late 19th and early 20th century?

    -The study of stage magic involved text and image mining, desktop fabrication, physical computing, and automatic extraction and identification of images from periodicals and magic magazines.

  • How does the speaker use Mathematica to compile an archive of texts from the open web?

    -The speaker used Mathematica to write crawlers that compile an archive of millions of pages of text related to the histories of electronics, computation, and scientific instrumentation.

  • What is the goal of developing tools that understand circuit diagrams in the transcript?

    -The goal is to develop a system that can identify meaningful assemblies of components or design idioms in schematics for semantic search in the history of electronics.

  • What is the purpose of the historical Bridge images database project mentioned?

    -The purpose is to create a large database of historical bridge images with metadata and use machine vision to extract features of interest in the history of civil engineering.

  • How does the speaker use entity recognition and keyword extraction to aid research?

    -The speaker uses entity recognition and keyword extraction to link keywords to entities, discover relevant sources, and find answers to research questions within a curated collection of sources.

  • What is the random indexing method and how is it used in text mining?

    -The random indexing method creates vectors of co-occurrence events in a context window around a word, helping to find words that pattern together and measure the semantic density of words.

Outlines

00:00

📚 Text and Image Mining in Humanities Research

The speaker discusses their experiences using Mathematica for text and image mining in humanities research, highlighting the differences between humanities and STEM fields. They emphasize the importance of close reading and interpretation in humanities, and how computational tools are often developed by researchers themselves. The Old Bailey Online project is introduced, showcasing a searchable database of criminal trials from 1674 to 1913. A graph is presented, illustrating the length of trials over time, revealing a bifurcation in trial length during the 1800s, which is attributed to the rise of plea bargaining. Another project mentioned is a collaboration with Amy Bell, focusing on the Mass Observation Project from 1937, which collected diaries and observations to understand the lives of ordinary Britons. The project uses distributional concept analysis (DCA) to study text data. The speaker also touches on a project about stage magic, where they extracted images and text from historical sources to develop an experimental approach to history.

05:02

🔍 Advanced Text and Image Mining Techniques

The speaker elaborates on various text and image mining projects they have been involved in. They discuss the creation of a database of historical bridge images with metadata, using Mathematica's capabilities for web crawling and linked open data. They aim to develop a machine vision system to extract features of interest in civil engineering history. Another project involves developing tools to understand circuit diagrams, with the ultimate goal of identifying meaningful design idioms in schematics. The speaker also mentions using clustering algorithms for discovery while writing, and the use of the RAKE algorithm for keyword extraction. They demonstrate how these tools can be used to find relevant sources and make connections between texts. The speaker concludes with an example of using random indexing to analyze the semantic characteristics of text, showing how it can reveal authorial preferences and semantic density.

10:04

🧠 Semantic Analysis and Data Mining in Historical Research

In this section, the speaker delves into the application of semantic analysis and data mining in historical research. They discuss the use of random indexing, a method for creating vectors of co-occurrence events, to study the semantic characteristics of text. The speaker provides an example of how this method can reveal an author's stylistic preferences by comparing the contexts in which certain phrases are used. Additionally, they explain how random indexing can be used to identify words that pattern together, providing insights into the semantic evolution of terms over time. The speaker also touches on the use of clustering algorithms to discover relationships between historical figures by analyzing biographical texts. They emphasize the utility of these techniques in uncovering hidden connections and patterns within large collections of historical data.

Mindmap

Keywords

💡Text and Image Mining

Text and image mining refer to the process of extracting useful information from textual and visual data using computational methods. In the context of the video, the speaker discusses how these techniques are applied in the humanities, particularly in historical research. For instance, they mention using Mathematica for text and image mining in projects like the Old Bailey online, where they analyzed the length of criminal trials over time.

💡Humanities

The humanities are academic disciplines that study the human condition, using methods that are primarily analytical, critical, or speculative, as opposed to the mainly empirical approaches of the natural sciences. The video highlights how computational tools like text and image mining are being tailored for use in the humanities, which traditionally focus on close reading and interpretation of sources.

💡Close Reading

Close reading is a method of textual analysis that involves a careful, detailed examination of a text, often focusing on specific words, phrases, or structures. The speaker notes that in the humanities, there is a focus on close reading and interpretation of sources, which contrasts with the more quantitative approaches common in the sciences.

💡Old Bailey Online

The Old Bailey Online is a digital archive mentioned in the video that contains fully searchable records of all criminal trials held at London's Central Criminal Court between 1674 and 1913. The speaker discusses a project where they plotted the length of trials over time, revealing patterns such as the rise of plea bargaining in the 19th century.

💡Mass Observation Project

The Mass Observation Project, initiated in 1937, is highlighted as an example of a research collaboration in the video. It involved recruiting volunteers to write diaries and answer questionnaires to understand the lives of ordinary Britons. The speaker and their collaborator are using text analysis methods like distributional concept analysis to study this archive.

💡Distributional Concept Analysis (DCA)

Distributional Concept Analysis is a text analysis method developed by Peter de Bolla and his colleagues. Unlike other methods that focus on word proximity, DCA uses small windows positioned before and after a word of interest. The speaker mentions using DCA in their research on the Mass Observation Project to understand cultural history.

💡Stage Magic

The subject of stage magic in the late 19th and early 20th century is discussed as part of a research project where the speaker and a former student applied text and image mining techniques. They aimed to develop an experimental approach to history, including the extraction of images and identification of elements like people, apparatus, and card tricks from historical magic magazines.

💡Machine Vision

Machine vision, a subset of computer vision, is mentioned in the context of developing tools to understand circuit diagrams. The speaker describes how they used machine vision techniques to automatically label and identify components in circuit images extracted from texts, aiming to develop a system that can identify meaningful design idioms in schematics.

💡Linked Open Data

Linked Open Data refers to a set of practices for publishing structured data so that it can be interlinked and become more useful through semantic queries. The speaker discusses combining their database of historical bridge images with linked open data using Mathematica's support for SPARQL queries, to enrich the metadata and facilitate more meaningful searches.

💡Rapid Automatic Keyword Extraction (RAKE)

RAKE is a method for extracting keywords from text, which the speaker uses on a Wikipedia article as an example. It's particularly useful for shorter texts and can be combined with other methods like TF-IDF to assess the relevance of a text to a query or to measure the similarity between texts. The speaker finds RAKE helpful for discovery within their curated collections of sources.

Highlights

Experiences using Mathematica for text and image mining in humanities research.

Humanities research focuses on close reading and interpretation of sources like texts, images, artifacts, and media.

The Old Bailey online project: a searchable database of criminal trials from 1674 to 1913.

Visualization of criminal trial length over time revealing a bifurcation in the 1800s.

The rise of plea bargaining and guilty pleas reflected in trial length changes.

Collaboration with Amy Bell on the mass observation project, analyzing daily life and culture in mid-20th century Britain.

Distributional Concept Analysis (DCA) as a method for text analysis.

Stage magic research combining text and image mining with experimental history techniques.

Automatic extraction of images and identification of magic-related items from historical texts.

Creating a wobble image from stereo pairs of early 20th-century seance photography.

Crawling the open web to compile an archive of texts related to the history of electronics and computation.

Developing tools to understand and label circuit diagrams automatically.

Creating a database of historical bridge images with metadata for civil engineering research.

Using Machine Vision to extract features of interest from historical bridge images.

Crawling WorldCat Identities API to gather metadata for persons and institutions.

Using RAKE and TF-IDF for keyword extraction and text relevance assessment.

Linking extracted keywords to entities for discovery and research question answering.

Compression clustering algorithm to reveal relationships between early Canadian historical figures.

Random indexing method for fast semantic analysis of large text collections.