Text and Image Mining: From Mass Observation to Stage Magic
TLDRThe speaker discusses the application of Mathematica in text and image mining for humanities research, highlighting the differences between humanities and STEM fields. Projects include analyzing Old Bailey trial records, the Mass Observation Project, and stage magic history. Techniques such as distributional concept analysis, machine vision for circuit diagrams, and entity recognition are employed to uncover patterns and insights in historical data.
Takeaways
- π The speaker discusses using Mathematica for text and image mining in humanities research, highlighting the differences in approach between the humanities and scientific fields.
- π¨βπ» The Old Bailey Online project is mentioned as a significant example, showcasing a searchable database of criminal trials from 1674 to 1913, with visual data representation revealing patterns like the rise of plea bargaining.
- π The Mass Observation project is described, which involved collecting diaries and observations to understand the daily lives of ordinary Britons, using distributional concept analysis (DCA) for text analysis.
- π© A project on stage magic in the late 19th and early 20th century utilized text and image mining to develop an experimental approach to history, including the extraction and analysis of images related to magic tricks.
- π The speaker has worked on creating a large archive of electronic texts related to the histories of electronics, computation, and scientific instrumentation, focusing on understanding circuit diagrams.
- π Another project involves building a database of historical bridge images with metadata, combining Machine Vision with linked open data to extract features of interest in civil engineering history.
- π The importance of crawling and indexing texts is emphasized, with examples of using Mathematica's tools to crawl the WorldCat Identities API and index texts for semantic search.
- π The RAKE algorithm is highlighted for rapid automatic keyword extraction, useful for discovering and assessing the relevance of texts to research queries.
- π€ Machine Vision techniques are being developed to analyze and label components in circuit diagrams, aiming to identify meaningful design patterns beyond individual components.
- π Clustering algorithms like the one based on kog complexity are used for discovery and to find relationships between historical figures or concepts in text collections.
- π The random indexing method is introduced for quickly mining terminology and understanding semantic characteristics of texts, showing how it can reveal author preferences and semantic evolution.
Q & A
What is the main focus of humanities research as described in the transcript?
-The main focus of humanities research is on close reading and interpretation of sources such as texts, images, artifacts, and media.
What is the Old Bailey online project mentioned in the transcript?
-The Old Bailey online is a fully searchable database of all the criminal trials held at London's Central Criminal Court between 1674 and 1913.
How does the graph in the Old Bailey online project illustrate the change in trial length over time?
-The graph shows a bifurcation in trial length starting in the 1800s, with some trials being less than 100 words and others more than a couple of hundred words, reflecting the rise of plea bargaining and guilty pleas.
What is the mass observation project discussed in the transcript?
-The mass observation project began in 1937 and involved recruiting volunteers to write diaries and answer questionnaires to better understand the lives of ordinary Britons.
What is distributional concept analysis (DCA) and how is it used in the research?
-DCA is a text analysis method that relies on small n-gram windows positioned before and after a word of interest, rather than centered on it, to understand the context of the word.
What techniques were applied to the study of stage magic in the late 19th and early 20th century?
-The study of stage magic involved text and image mining, desktop fabrication, physical computing, and automatic extraction and identification of images from periodicals and magic magazines.
How does the speaker use Mathematica to compile an archive of texts from the open web?
-The speaker used Mathematica to write crawlers that compile an archive of millions of pages of text related to the histories of electronics, computation, and scientific instrumentation.
What is the goal of developing tools that understand circuit diagrams in the transcript?
-The goal is to develop a system that can identify meaningful assemblies of components or design idioms in schematics for semantic search in the history of electronics.
What is the purpose of the historical Bridge images database project mentioned?
-The purpose is to create a large database of historical bridge images with metadata and use machine vision to extract features of interest in the history of civil engineering.
How does the speaker use entity recognition and keyword extraction to aid research?
-The speaker uses entity recognition and keyword extraction to link keywords to entities, discover relevant sources, and find answers to research questions within a curated collection of sources.
What is the random indexing method and how is it used in text mining?
-The random indexing method creates vectors of co-occurrence events in a context window around a word, helping to find words that pattern together and measure the semantic density of words.
Outlines
π Text and Image Mining in Humanities Research
The speaker discusses their experiences using Mathematica for text and image mining in humanities research, highlighting the differences between humanities and STEM fields. They emphasize the importance of close reading and interpretation in humanities, and how computational tools are often developed by researchers themselves. The Old Bailey Online project is introduced, showcasing a searchable database of criminal trials from 1674 to 1913. A graph is presented, illustrating the length of trials over time, revealing a bifurcation in trial length during the 1800s, which is attributed to the rise of plea bargaining. Another project mentioned is a collaboration with Amy Bell, focusing on the Mass Observation Project from 1937, which collected diaries and observations to understand the lives of ordinary Britons. The project uses distributional concept analysis (DCA) to study text data. The speaker also touches on a project about stage magic, where they extracted images and text from historical sources to develop an experimental approach to history.
π Advanced Text and Image Mining Techniques
The speaker elaborates on various text and image mining projects they have been involved in. They discuss the creation of a database of historical bridge images with metadata, using Mathematica's capabilities for web crawling and linked open data. They aim to develop a machine vision system to extract features of interest in civil engineering history. Another project involves developing tools to understand circuit diagrams, with the ultimate goal of identifying meaningful design idioms in schematics. The speaker also mentions using clustering algorithms for discovery while writing, and the use of the RAKE algorithm for keyword extraction. They demonstrate how these tools can be used to find relevant sources and make connections between texts. The speaker concludes with an example of using random indexing to analyze the semantic characteristics of text, showing how it can reveal authorial preferences and semantic density.
π§ Semantic Analysis and Data Mining in Historical Research
In this section, the speaker delves into the application of semantic analysis and data mining in historical research. They discuss the use of random indexing, a method for creating vectors of co-occurrence events, to study the semantic characteristics of text. The speaker provides an example of how this method can reveal an author's stylistic preferences by comparing the contexts in which certain phrases are used. Additionally, they explain how random indexing can be used to identify words that pattern together, providing insights into the semantic evolution of terms over time. The speaker also touches on the use of clustering algorithms to discover relationships between historical figures by analyzing biographical texts. They emphasize the utility of these techniques in uncovering hidden connections and patterns within large collections of historical data.
Mindmap
Keywords
π‘Text and Image Mining
π‘Humanities
π‘Close Reading
π‘Old Bailey Online
π‘Mass Observation Project
π‘Distributional Concept Analysis (DCA)
π‘Stage Magic
π‘Machine Vision
π‘Linked Open Data
π‘Rapid Automatic Keyword Extraction (RAKE)
Highlights
Experiences using Mathematica for text and image mining in humanities research.
Humanities research focuses on close reading and interpretation of sources like texts, images, artifacts, and media.
The Old Bailey online project: a searchable database of criminal trials from 1674 to 1913.
Visualization of criminal trial length over time revealing a bifurcation in the 1800s.
The rise of plea bargaining and guilty pleas reflected in trial length changes.
Collaboration with Amy Bell on the mass observation project, analyzing daily life and culture in mid-20th century Britain.
Distributional Concept Analysis (DCA) as a method for text analysis.
Stage magic research combining text and image mining with experimental history techniques.
Automatic extraction of images and identification of magic-related items from historical texts.
Creating a wobble image from stereo pairs of early 20th-century seance photography.
Crawling the open web to compile an archive of texts related to the history of electronics and computation.
Developing tools to understand and label circuit diagrams automatically.
Creating a database of historical bridge images with metadata for civil engineering research.
Using Machine Vision to extract features of interest from historical bridge images.
Crawling WorldCat Identities API to gather metadata for persons and institutions.
Using RAKE and TF-IDF for keyword extraction and text relevance assessment.
Linking extracted keywords to entities for discovery and research question answering.
Compression clustering algorithm to reveal relationships between early Canadian historical figures.
Random indexing method for fast semantic analysis of large text collections.
Casual Browsing
AI Teaching Magic: Generate Questions Instantly with QuestionWell!
2024-09-11 00:10:00
Stephen Wolfram - From Fundamental Physics to AI: An Emerging Computational Universe
2024-09-12 06:52:00
A.I. Expert Answers A.I. Questions From Twitter | Tech Support | WIRED
2024-09-11 00:29:00
@Numberblocks- All the Sums | Learn to Add and Subtract
2024-09-29 16:00:00
Future of Product Management From the Era of AI | Linear Karri Saarinen
2024-09-19 01:33:00
New and Powerful Components for AI and LLMs
2024-09-12 05:40:00