Joanna Guldi, "The Dangerous Art of Text Mining: A Methodology for Digital History" (Cambridge UP, 2

00:00

Dear old work platform, it's not you, it's us. Actually, it is you. And it was love at first onboarding. Their beautiful dashboards, their customizable workflows got us floated. Welcome to the New Books Network. Hello everyone, welcome back to NewBooks Network. This is your host, Shu Wan. Today, I feel very happy to invite Dr. Joe Goldie to join us. to talk about her newest book and very insightful book, The Dangerous Art of Taxi Mining.

00:50

So the first question, I want to invite Dr. Gilsey to introduce herself to us. Hi, this is Joe Gooley. I am professor of quantitative theory and methods at Emory University. I'm also a former professor of history, although right now I've moved entirely into the work of data science.

01:08

Thanks so much for your introduction, self-introduction. For the next question, I'm wondering why, you know, I know you're a historian, why are you interested in studying digital and computational humanities, critically? So I became a, I think, thanks Shu. It's great to be talking to you. And thanks for that question. You know, I did a PhD at the University of California in Berkeley in the early 2000s. at a time when most historians were not working on computational things.

01:39

And I didn't have a very specific computational research interest when I started graduate school. I wanted to look at patterns of historical change over time. I was interested in the cultural turn and social history. I was interested in questions about landscape. The phenomenological experience of space and cities and how strangers interacted. But I was also in graduate school.

02:05

In Silicon Valley, geographically, Berkeley is in the middle of a lot of tech happenings, and they started to feel very big, and everyone else who I knew in college. who had moved to San Francisco at the time that I moved to Berkeley, had moved there because of tech jobs. And so I found myself having a lot of conversations over the six years of graduate school.

02:26

with people in the tech sector who would ask me questions like, what's the biggest change that any invention in IT has meant for historians? Normally, my reply, at least at first, was... Nothing. You know, there aren't big changes. We all use Microsoft Word. But when Google Books launched its service in 2006, It launched without a lot of fanfare. And my IT friends had persuaded me that maybe it would be a good idea to have a blog. I mostly blogged about urban history and experiences of space.

03:05

On this blog, I had started to write about Google Books because there was a day in which I had been... Googling obscure characters from my dissertation and nothing was there on the internet, like maybe some genealogy sites. And then there was a day when suddenly I Googled these obscure figures and there were 300 hits on the internet. So that was what happened when Google started to digitalize the books of the Harvard, Yale and New York public libraries.

03:34

suddenly there was all of this historical information that was viable via basic keyword search. It wasn't big tech. You didn't need to code.

03:43

But I said to myself and I said on my blog, I think this is a big change. I think this could really matter for how we do history. Because just being able to keyword search something like the name of this obscure figure or... some of the characteristics of 19th century urban space means that we have access to large-scale changes in a way that we've never been able to manipulate sources in this way before. So I blogged about that, and to my great shock,

04:13

A lot more people were interested in my blog than were interested in my dissertation. And in fact, scholars were more interested in my blog. I got nice feedback on my dissertation. People liked it. People still assign the first book that came out of it. But all of that attention was dwarfed by the attention to this very simple observation that I had written about in a very simple way, which was just, we've made all of this text keyword searchable.

04:40

And so I started to think about this as a conversation that the culture at large really wanted to have with people in the university. At Chicago, where I held a postdoc in digital history at the Harvard Society of Fellows, I had a series of opportunities where I just had the time and I had the money. And I was in to reach out and do new research.

05:08

collaborations. And I was in a place where at Chicago, where there were already a number of scholars who had been working for 20 years already with computational tools trying to understand. changes in concept history. That's Robert Morrissey and his group who were already publishing in the 1990s about what computers could mean for interventions in intellectual history. So suddenly I had access to a lot more robust tools. Eventually I learned to code.

05:39

And by way of signaling to my fellow historians that this was a direction where I was going to go, that it wasn't a traditional direction for a social historian of space. I wrote a memo to the historical community, and that memo was called The History Manifesto. It was co-authored with my friend. Harvard historian David Armitage.

06:04

what I had been saying all along, which was that computational tools were opening up longer and longer time spans, that this was not the direction in which the profession of history had headed so far. And it required us to think about what the purposes of lingerie and microhistory were, the purposes to which we could put these tools, and how we would be consumers. So I became a digital historian because it was something asked me

06:29

by the community. And I've continued to work on those tools, even as I worked on my traditional British history, even as I worked on questions about landscape and property usage. And so voila, the book that we're talking about today. The dangerous art of text mining is the product of 15 years. of collaborating with information technology specialists slowly at first.

06:56

and eventually in a more intense way. Okay, thanks so much for your answer. Well, first thing I want to say, I mean, I am a big fan of you famous, and you co-authored a famous book. and then about i mean the manifesto of history which is one of the best book about history i mean critical thinking about history publishing lasted like one decade two decades

07:20

Then now let's turn to your book. So my first question for your book is I'm wondering about one argument in your book, like there's lots of high-stakes disaster from data science. which is demonstrated in the paper retracted of the publication in this field, especially among those data-driven prioritized with error. So I start off the book by saying this is a book that's written with two very different audiences in mind. It's written for data scientists.

07:59

who may be unaware or unappreciative of the perspectives that humanists and social scientists could bring to their work. But I believe that putting these two perspectives together can create a smarter data science. Equally the book is written for my fellow historians and maybe other humanists and social scientists. more generally, who may be... about what advances in text mining mean for their disciplines, but I expect them to be skeptical. I expect them to hold the bias that we're trapped.

08:37

in a kind of meta-narrative of the West just by virtue of having access to resources like Google Books, which are a perspective of literate elites in the West. That's true, but those questions have been considered, that they may have concerns about the black box nature of algorithms.

08:58

And that was true at first before there was a dialogue between history and data science, but it's much less true now. Now we have... skills of critical thinking that open up those perspectives in a really important new way, and that we are ready for collaboration. So those two points, it can be a smarter data science if we collaborate across disciplinary boundaries.

09:20

And we are ready to have a conversation about bias and critical thinking with algorithms that produces surprising new results from textual databases of a kind that historians could not find in any other way. So in the first chapter, I go looking for examples of what it looks like when data scientists do the work of long-term history on their own, without collaborations from the humanities. And my simple answer is that it can often be a disaster.

09:55

Totally a disaster from the point of view of IT, where the purpose is to discover and make new tools that can facilitate the creation of new knowledge. But from the point of view of accurate... critical thinking about the past, there have been several high-stakes, high-profile disasters. So I give some examples. One of them was a retracted paper about incest in the West.

10:24

versus the rest of the world, which purported to to give the conclusion that Western societies, weird societies, Western individualistic, educated, et cetera, societies had been so successful. the rest of the world because we avoided the problem of incest and intermarriage created these new patterns of collaboration. So that's a really odd... suggestion from the perspective of history because we can all name.

10:57

famous dynasties in the West, like the Habsburgs, which were notorious for incest. And also there were prohibitions against incest in other parts of the world. But what's really curious about it as a data-driven paper is that there were already databases of incest. in Europe in the Middle Ages that had been prepared by historians and archivists over hundreds of years. Those databases were not touched for the study and instead much more limited.

11:29

global databases were used. So garbage in, garbage out. It's an example of not asking the archivists, not talking to the medievalists. in the course of designing a project, resulting in an incredibly biased portrait of what might be driving progress in the world. So we give a couple of these examples. Very often they're attached... There's a minor error in thinking that leads to a bigger error in analysis. So one of the examples is an animated history of the world that circulated on YouTube.

12:07

It was attached to a very interesting article. authored by Maximilian Schick, who is a digital art historian, very accomplished. His paper I have no qualms with. He's done magnificent work on... enormous databases from the world of art history of artists, collectors, art historians, and their travels, their birth dates, their death dates, their place of birth, their place of death. He worked with a team of data scientists to animate and visualize these travels.

12:41

And the results were very interesting from the point of view of how that community of art historians has changed over time. But some member of his team also prepared a video which was circulated in scientific publications and went viral and was featured on YouTube. And this visualization was labeled. a history of the world a history of the world so it pretends ports to show a history of the world in which

13:06

You start at Rome with large scale travels and then the world contracts during the Middle Ages and white people are born and die. And then Europeans start to travel over the rest of the world and they are the first human beings to arrive in Japan. And they are the first human beings to arrive in Australia and North America. And there is no transatlantic slave trade.

13:26

So obviously this is a biased history of the world if it is a video history of the world. It's a history of the world in which only... Only white European settlers matter. Only their stories are visualized. And it's problematic because it's a narrative about conquest and settlement and progress defined through conquest and settlement. So the perspective is wrong. And the problem is really just the labeling. If this video had been labeled...

13:53

a history of the world through white conquest, a history of white conquest, what can we learn about that? And if the narration had taken that into account. we could have used it as a critical document and used it to think about certain perspectives or the perspective of certain kinds of data on world history. They're examples of important work that's been done with data science.

14:18

but discoveries that have been tempered because the data scientists in question had not adequately thought through the context of the kinds of debates in which their data was attempting to make an intervention or where the correct data was. In fact, they were problematic because there wasn't enough interdisciplinarity. They weren't what I call a hybrid team. And I use the word hybrid rather than interdisciplinary on purpose.

14:46

Because I think of, you know, you can be an interdisciplinary team if I'm a historian and I have one conversation with an art historian, or I read one book on art history, and then I write an article where I import those methods. Similarly... Computer scientists can claim to be interdisciplinary if they have one conversation with a historian that sets them on a problem, but they don't continue to work together. But that's actually where most of the trouble comes.

15:13

A hybrid team, and I give some examples of really actively hybrid teams that exist in reality. Hybrid teams happen when you have historians that are embedded in a data science team. or archivists or literary scholars who were working all the time with software engineers, and they're continuing to ask questions and look at the data and interpret the work together and then iterate and come up with new solutions. So hybrid knowledge. So hybrid teams, and I give several examples in the book.

15:46

book of what hybrid teams look like. Hybrid teams happen when you have a team like say, Ruth Anert and Sebastian Anert and their collaborators. Ruth is trained as a Tudor literary scholar. Sebastian Anert is is a physicist by training. And they got together and started looking at the Tudor state papers. This is actually not an example that I use in the book because the Annard's book came out more recently.

16:11

But they're a perfect example of a hybrid team in that they've been talking together. They're married. They talk to each other. probably on a daily basis. But they've been talking about their research together over years with their graduate students, with their collaborators, with their colleagues. And so it's not a matter of just taking one tool from Sebastian's wheelhouse and applying it to one question. from Ruth's understanding of the Tudor state.

16:43

they've been going back and forth on which approach suited which question over years. And that's one of the reasons why... Tutor Networks of Power is such a terrific book, why they've been able to really, really open up that space. It's a long-term... collaboration. They're growing together. They're reimagining the problem of data together. So it's a hybrid team in that. like hybrid corn. It's a living thing. It grows in a certain direction. It develops in a certain direction.

17:13

That kind of hybrid work is, I think, something that's missing from most firms in Silicon Valley. It's missing from most firms. engagement in the world of data science from most collaborations in which computer scientists take part. And I think all work in data is the more impoverished for not having those kind of hybrid relationships where you can could continue to press questions.

17:38

fed in part from the robust definitions of knowledge and validation in the humanities and social science where we have asked what institutions are, what culture is. what the history of human civilization is. Over hundreds of years, we have very robust ways of asking these questions. It is not a one-go solution. So hybrid teams are part of the answer that I provide.

18:03

Okay, thanks so much for your question. So I will appreciate your suggestion, like, is necessary, I mean, the collaboration between scientists and data scientists and historians on doing data-driven research in history especially. So for the next question, I'm wondering about a theory put forward in your book. It's a theory of a research process capable of working between qualitative question and quantitative tools. So could you briefly introduce a theory?

18:37

Yes. So in chapter four, I forward the term critical search. So critical search is a theory about how to bring critical thinking into the research process. And it's a theory that's designed to speak to the concerns of historians who may just be using the keyword search bar on Google Books or in JSTOR or in HathiTrust to do their research. or they may be deeply invested in digital history research or digital humanities research with existing tools.

19:11

And critical search was a way of modeling the kinds of critical thinking and use of multiple sources to drive new questions that are part of the research process in the world of history. So in the world of history, when we teach our students how to get started in our research project, we talk about primary sources. which are the actual documents in the archive. We talk about secondary sources, which are the published monographs in the library written by scholars, typically published.

19:42

from university presses with lots of footnotes telling you what sources they looked at. And then we have books of theory, which are also university press. published, often driven more from the general humanities or social science perspectives that may forward a new way of looking at a problem about gender, race or class or about landscape or about sensory experience or disability.

20:08

These works of theory, the primary sources and the secondary sources, work together in any project, whether it's a term paper or a dissertation or a major monogram. The scholar goes into the archive, they find some documents that nobody has ever looked at before, and they think, oh, what do these documents really mean? And they start to tell themselves a story. But they don't stop there. They go back.

20:32

And they read some more secondary sources and they say, oh, you know, I just found this character. I wonder if anybody else knows about this character. Or it seems like this event is really important, this way of talking about. about famine or this series of events. It's really important. I need to read up on the secondary sources. And so they return to the scholarly matrix.

20:51

So they pan out, they say, what have other scholars asked? And then they say, what do my sources tell me that's new? And then perhaps then they turn to theory. And so they read even more widely. I'm a scholar of property, so I'm interested in what sociologists and anthropologists, legal scholars have said about the nature of property in different kinds of cultural systems. So maybe I go back and I read some of them and they...

21:18

They taught me a little to ask the question in a deeper way. What really is the nature of property? Can we rest at lock? How have we been thinking about this in the academy across all cultures?

21:31

So that's a kind of generic model of what historical research looks like. So when I wrote Critical Search, I wanted to say, look, anything we do with digital history... shouldn't take us away from this model it should be reinvested in this model So if I have, to take a very simple exercise, if I have an engram, a Google engrams, and I've counted the phrase, I've counted the number of mentions of eviction.

21:58

over the 19th and 20th century. And I have a squiggle that tells me that there are certain spikes, there are certain times when we talked about eviction more and more. And I'm thinking about that. That's not a history paper. That's not research. And it doesn't allow me to do critical thinking. It does allow me an interesting new window, another source, if you will. Another source. It points. It also, Google Books also helps me to index. what time I should be looking at if the Google Ingram

22:31

spike in 1881, then I've got a kind of index of, okay, the discourse in 1881 is very interesting. I should go and read those sources because something happened. What was it that happened? And I should go back and I should read secondary sources about what it was that happened in 1881. And then I should go back to the theory. And then with more robust questions, I can ask some more.

22:55

It's about tenant and landlord relationships or other terms for property. And I can use Google Ingrams again, or I can use some other tools, some other algorithmic tools. And my preference in critical search is that we treat algorithms with the same kind of critical thinking that we treat these other sources. I don't just read one secondary source and take that as...

23:16

as the speech of God, the definitive answer of what scholars think about this period or this event. And similarly, I shouldn't just take Google Engrams. Maybe I should start with Google Engrams. But then I use a topic model or then I use a word embedding. I think about other technological approaches that can help me to open up. whatever that question is, whether it's a question about a particular event or a question about change over time over the long durée.

23:48

Every aspect of the research project, from which algorithm I choose to how I implement each individual algorithm, which questions I ask it, how I tweak the parameters. should be subject to this critical search. It should be iterative. I should see if the answers are any different, if I zoom into time or pan out of time, if I set the number of topics at 30 or at 300. Does it help me to understand the index base differently?

24:20

and then I should return to primary sources, I should return to secondary sources, I should engage with theory, and I should do this iteratively. So that, in short, is the model of critical search. And it forms the organization of the book as a whole. The book moves from...

24:38

In part one, these generic problems of the blind spots of data science, which is a new field, which doesn't adequately understand what it means to look at, to ask whether your sources are biased about history or to engage a historical problem. To the second part where I look at a series of individual algorithms that can produce new knowledge and history informed by different theories of what temporal, wherein temporal experience consists.

25:09

So in a sense, the book is a whole model of critical search. There's no one chapter where I take critical search all the way through, but each individual chapter unpacks algorithms and shows what it's like. to ask about a theoretical problem like memory with the help of three different algorithms or what it's like to take one algorithm and then tweak it so that it will show me different aspects of temporal experience.

25:36

So we can say more about that later in the interview, if you like. But it's very important to understand that critical thinking is... is opportune at every step in the research process. I think the biggest fear of most humanists when confronted with digital research, is that there will be some moment when data scientists come to us with a big automatic button that says, do history now, and then the computer will feed out one...

26:08

and expect us to take that as an unbiased answer. And I'm saying, no, that would be impossible, and that's actually not the right. approach to data science. It's also sometimes what data scientists' instinct is, that I will crunch this data set and then I will have one answer rather than...

26:27

There are a number of ways of describing any episode of the past or talking about what mattered in any episode of the past. And we can use data to engage that and to drive the questions deeper and deeper and have more and more sophisticated answers. But no one answer holds the total truth. So critical thinking in the research process is what I call critical search. And I think it's a very important...

26:56

I don't give a final answer on what it will be because other scholars will deal with these issues in their same way. But my contribution, I hope, is to say it should be iterative and we should be critical. And we should unpack each aspect of this research process in a new way in this era of big... Okay, thanks so much for your answer. I totally agree with you. I mean, especially your argument about usage of a quantitative approach or tool.

27:28

They are very important for us to answer some traditional qualitative questions. And from my perspective, I mean, now I'm still a PhD student, but I have some colleague and friend who, I mean, who are a student in other, I mean, social science program, like political science. So when we review the difference between the training in humanity and the social science, especially in the United States, we notice there's a big difference in terms of the methodology coursework.

28:02

we always have something. We don't have, for example, we don't have, at least in my department, we don't have any training or any class regarding quantitative methodology. For example, in the Department of Political Science, Their methodology costs usually regarding those quantitative tools like social network analysis, like in some time, like data mining. whatever, this kind of stuff. So that's a big difference. But from my personal perspective, I think it's necessary.

28:41

Like maybe in the future, in history programs, for example, in the United States, I mean, college and the department administrators, they may consider that. the quantitative methodology coursework into the training for PhD program. because it will become... requirement, I want to become a requirement for history, maybe not in the near future, maybe just now. As you mentioned, you have many case studies about the use of data-driven research. in community, in history especially.

29:15

So one thing I'm strongly interested in is your discussion about the category of temporal experience appropriate as a subject for data-driven monitoring. So could you please briefly talk about the category? Yes, thanks for that question. So temporal experience has been, there's been a kind of turn in history to engaging with questions of what an event is, what does it mean for. for many nations to experience global protests in 1968, for example.

29:55

And are there some places that have a faster or slower temporal experience of modernity, whether from newspapers or railways or the seeming acceleration of political events?

30:09

temporal experience has been something that we've been talking about for a long time. One of the great joys of my path through the profession of history... and I talk about this in the preface, has been that in the aftermath of the History Manifesto, the profession of history had a number of debates about what forms of data science... would rise to the high standards of excellence expected from the history discipline at large.

30:42

when could we really trust data-driven research? And the gloves came off. People were very fierce in their defense of primary source research. of subaltern perspectives, of the voices hidden in the archives, of the importance of individual agency, all of these questions. One of the debates that I found myself in, one of the most powerful ones, I mean, all of these debates were so useful to me. They were just a gift.

31:11

from my colleagues across the profession. And I'm so grateful to have been in the middle of those debates. One of the most powerful interventions was when I was asked to join the intellectual historians at their annual global conference. So intellectual history is not one of the forms of history that I was trained in. It's not my home set of practices. But the intellectual historians and the philosophers of history have been particularly interested in this question of temporal experience.

31:50

So they were asking questions about cyclicality, about the event, about periodization. And I had the opportunity to just hear these. papers and also engage with authors from intellectual history who had demands, let's just say demands for digital history, for what from their point of view would be a respectable form of... data driven history, something that they could trust, something that they would find merit in.

32:17

And many of their questions were about whether algorithms could in fact model temporal experience. Could they show us something about... the experience of progress. or the concept of progress in some societies and experience or concept of cyclicality in other societies, could they help us to identify events that we didn't know about or to understand when periods of time began and ended? So those temporal experiences started to shape...

32:52

how I understood the debates of the digital humanities. And I started gathering algorithms that could be used to investigate temporal experience, such as events. periodization, memory and causality. I don't think that's a complete set. I think there are many, many, many other forms of temporal experience that I left hanging on the tree, and I wish others well of them. But these are the four experiences that I went after.

33:20

And one thing that the lens of temporal experience helps us to really understand is that... I think part of the fantasy that data-driven analysis can give us a definitive answer on the past rather than many different, many interesting. long-term stories that intersect with each other in powerful ways. That fantasy that there's one definitive meta-analysis of the past. It stems from, I think it's reinforced by the kinds of visuals that data scientists have used.

33:54

when they're doing temporal research. So, for example, the Google Ingram is just one long squiggle over time. But what happens when we unpack temporal experience and we say, well, the squiggle for science may have gone up and up over time, but what about... What about memory? What about the memory of science? How did the memory of science change? Who did we think the greatest scientist was 100 years ago or 70 years ago or 50 years ago? How has that memory changed?

34:27

What do we remember their accomplishments as being? That gives us a different avenue into the questions of the... So the same thing for event or for periodization. We realize that temporal experience is not unified, it's multiple, but there are multiple aspects of temporal experience that we can model separately. And then for any one of these temporal experiences, there's not just one algorithm that gets us there, but there are several, in fact.

34:55

So that's the work that I do in the second part two of the book. There are three parts of the book. Part two of the book actually goes into the process of trying to map. These questions about temporal experience from the philosophy of history with algorithmic approaches. Okay, thanks so much for answering. I really appreciate your argument or your points that about the use of data, data-driven research.

35:23

for the study of memory and in your book you provide a fantastic case study of using test mining methods in studying how the process of memory changing the 19th century britain Could you please give us some more details about this case study? Yes. So in the question of memory, it was important... The problem of memory is something that historians have taken very seriously over decades, if not centuries.

35:54

So we tend to think, for example, of Hobbesbaum talking about invented traditions, like the invented tradition of the kilt. which is not actually as ancient as other ancient Celtic traditions or modes of representation, but it actually had to be invented in a moment of nostalgia or the coronation. which is a kind of performance that is invented by Queen Victoria to announce. her desire to be seen as the Empress.

36:25

of India, of a new kind of British monarch. And so elaborate rituals are invented at that point in time we could think about. graduation ceremonies at American universities as another instance of invented tradition. They look, they feel very ancient. Actually, they're quite modern. The regalia, all of the ceremony is actually quite modern.

36:45

So we talked about invented traditions in terms of memory. We talked about Tudor funerary monuments as another way that other generations of people have created. have suggested that they be understood as representing a force in history. We could talk about the monuments to the dead soldiers of the First World War as another instance of creating memory in the past. So I wanted to see if text mining could get us any closer to these problems of memory.

37:18

And so the chapter takes on three different algorithmic approaches, three very different ways of counting change over time based on... As with all text mining, we're essentially counting words and phrases. It's very simple, but then we apply different transformations to get us to what we really want to know. So in the memory chapter, I start off with an utterly simplistic... approach, which is counting dates. So this is a, I credit this approach to the wonderful Michelle Moravec.

37:50

distinguished digital historian who used this approach in her study of feminist zines in the 1970s to show that what the dates and the pamphlets and the authors that feminist zine writers thought should be celebrated in the 1970s are different than the ones that we celebrate today. So she was looking at these questions of memory. So I took Michelle Moravec's question and asked, well, what... What years were mentioned in Parliament the most?

38:20

If we look at the hundred years of Parliament, so the debates of the House of Commons and House of Lords, every speech given in... the UK Parliament from 1800 to 1900. What are the years that they're mentioning the most? And the results are very, I plot the results in what I call a double timeline. Time is on the x-axis. Time is also on the y-axis. On the x-axis is the year of the speech. The year of the person doing the mentioning on the y-axis is the year in which...

38:53

the year of time that's being mentioned. And so if we plot the years... These times against each other, one of the things that we see, first of all, is that there's a strong upward diagonal. If you're giving a speech in the year 1848, it's highly likely that you're referencing legislation that happened in 1847 or 1846. Or earlier in 1848, you might be referencing another speech that somebody gave, or you might reference a deadline that comes due in 1851.

39:26

This is pretty consistent through the whole of the 19th century. So there's this strong upward diagonal. That's validating, but not all that interesting. What's really interesting is that in this double timeline, we start to get strange, unexpected vertical lines when all of a sudden... For one year, speakers in Parliament became very, very interested in the Tudor and Stuart past.

39:49

So if we look at that timeline, one of the examples happens in 1838. There's suddenly a vertical line of dots of references to... years in the 1600s, 1500s. And so you ask, what's going on? Well, you can use this visualization as an index and go back to primary sources. So that's what I do. And if you start reading the speeches that are being referenced, you realize that what's happening is that... Members of the Conservative Party have started referencing all sorts of random things.

40:23

that happened in the Tudor and Stuart period. They're talking about debates over the earldom of Marr. They're talking about Tudor... church vestments, so the garments that priests are supposed to wear. And you think, why were they doing that in 1838? And then, of course, the answer becomes clear.

40:46

1838 is the date of the Tamworth Manifesto. This is when Benjamin Disraeli and his cronies reconstitute the new conservative party out of what had been the Tory party. And they are looking to... to represent to the nation by one means or another that they are people who are really attached.

41:05

to England's tutor and steward past, that they have a deep understanding of all sorts of events that happened in England's distant past. So there's a kind of performance going on of references to memory, references to memory made by... conservative speakers in parliament. So there are other vertical lines. Those are very interesting. There are also some horizontal lines, which are moments in the past that are regularly referenced.

41:33

Those become actually much more intense if we take a different algorithmic approach. So for the second example in the memory chapter, I use what's called... Parts of speech analysis or named entity recognition. Named entity recognition is a kind of It goes looking at the grammatical structure of every sentence. The algorithm has been instructed to look for phrases that are used as if they are the names of places or the names of people, the names of corporations or the names of events.

42:07

And there are many others. So I asked one of these named entity recognition algorithms to go looking for the names of events mentioned in Parliament. And then I threw up another enormous timeline of 100 years of references to the past. We're now not looking for 1789. We're looking for the French Revolution by name. And it turns out that just as many scholars have observed, the French Revolution is constantly referenced in every speech.

42:37

for the whole of the 19th century. And I shouldn't say every speech, but virtually not a week goes by without somebody saying, God forbid, we should have a French revolution here in Great Britain. So that's absolutely consistent. They also talk about the Magna Carta nonstop. At certain times, they talk about the game acts, and this is something that we would expect as readers of Emma Griffin's social history, at times of more social conflict.

43:03

The game acts come up because people are saying in Parliament, should it really be the law of the land that poachers should be punished? for trapping animals on public lands or crown lands should the punishment really be so severe. But we also see other patterns that we didn't anticipate. For example, it turns out that Parliament has very little time for the American Constitution until 1832, when the British middle class gets the vote.

43:37

And then through the 19th century, as the working class gets more and more voting rights, People in parliament are more and more willing to talk about the American constitution as a historical event that's worthy of some memory that may have something to tell us.

43:53

about what modern nations look like. And so the American Constitution is debated more and more. You also see lapses of memory, like the Indian famines are talked about a lot when they happen, tens of millions dead, right? But there's no memory. Months or weeks later, the Orissa famine, for example, almost instantly disappears, whereas the Irish famine is talked about continuously for decades after it happened.

44:20

So what's the difference there? Well, the Indians have no representation in Parliament, the Irish do. There's an explanation, but we did not know these patterns of memory. We couldn't find them unless we went looking for them. not just with data in general, but with data that was actually specifically tailored to look at problems of memory. I'll mention the third, there are variations of those exercises that I use to look into specific aspects of...

44:50

of memory in the context of Britain. So I look at social history, I look at famines and riots in particular, and I have findings about that. But I move on to a third algorithm, which is based on parts of speech analysis. It actually uses the grammar of the sentence to collect. patterns of nouns and verbs that tell us what the average statement was like. What are the...

45:15

phrases that are spoken the most frequently in Parliament. Who does what to whom? So this is a tactic that's been used very skillfully by some digital folklorists like Tim Tangerlini. I use it to go looking for sentences that reference the past. What are the sentences that reference the past that are most frequently spoken about in the 19th century, decade by decade? And the answer is...

45:43

Increasingly, over the course of the 19th century, people in Parliament talk about how in the past our ancestors had certain rights. And of course, that's usually as an argument that we should have a certain right should be restored. So if you go looking into this, they're talking about...

46:03

In the past, our ancestors had rights to the commons, to use common land, and this becomes the instigation for new forms of legislation that create public parks in Great Britain or create movements for public housing. And they talk about how in the past our ancestors had certain rights and protections, and this becomes the instigation for the working class vote, eventually for votes for women. So they're restoring new rights on the basis of talking more about the past.

46:33

So we get a very clear representation in the data about how the work of memory and using memory, talking about history on the floor of parliament. helped people to think about political rights and to advocate for the possibility, the reality, the realism of political rights.

46:53

in a new way. And again, these are things that I'm not the first historian to think that there's a relationship between thinking about the past and thinking about the present, but we understand it in a new way when we look at the data, when we can actually measure. how little the past was referenced or a past of rights was referenced at the beginning of the century and how frequently it's referenced by the end of the century

47:15

And, of course, you could use the data in more ways to go into the individuals who are looking at this discussion of rights. But this is just... This is just for the purpose of my methodological book. This is an exercise to show that we can use... the historical concept of memory to inspire better data science, more sophisticated questions that get into more intricate relationships. between change and time. And they represent us back to ourselves in new ways.

47:55

to reach a better understanding of memory and better understanding of how memory affects the political debate, affects people's understanding of the past, even in the past, the understanding of the past in the past. And well, well, I know after reading your book, I noticed that there are many, so many great case studies, but because of time, it's weak and not.

48:17

going to each of them, but now we have to jump to, well, even so, I want to recommend my readers to read those case studies carefully. They are very insightful, very inspiring. Especially for history, for my peers, you must learn a lot of how to use data, how to use data driver research in your own research. So, but now...

48:44

Yeah, I was just at the University of Chicago and meeting with graduate students, as I often do when I'm invited. I taught a master's course and every single graduate student told me that they had been told. by the chair that these text mining methods were going to change everything in history in the next handful of years, and they were so eager to learn.

49:05

So I think there is an understanding that these methods need to be taught. And it's not that every historian has to become a data scientist, not at all, but we need at least enough training to use them and engage them. Thanks so much for your supplemental explanation. And as I mentioned earlier, Like, well, as you mentioned, like in political science department, their methodology costs always about quantitative methodology. But in most, at this moment, at least.

49:36

Most of the history departments don't have such like quantitative methodology in their coursework for PhD students, for master's students, but it will become necessary. immediately especially after the publication of this book. But again, because of tournaments, I have to jump. to the conclusion part of last chapter of this book and then invite you to talk about how the future of history on the data intensive ground will be would be like how the future of history researching would be like

50:12

Yes. So I talk about the third part of the book. turns loosely to the future. And of course, I'm a historian. I don't have a crystal ball. I don't actually know what's going to happen. In general, text mining methods and digital history are 10 years ahead of us.

50:29

in North America, in the universities of Europe, where these methods were embraced a decade ago, and there's now... really sophisticated research coming out of practically every university campus has at least one practicing digital methodologist. who's using these tools and meeting with other digital historians. So I think it's a very... it's putting too much pressure on...

50:58

prediction to say the same thing will probably happen in the United States. And I should say the United States rather than North America because the turn... towards digital methods is already well expressed in Canada and the Canadian university system. Most departments already have a fairly robust digital history portfolio. So I would expect to see, as you say, a methodology course.

51:27

in most US departments over the next series of years. And that's probably going to require... hires, probably hires, at a fairly senior level, because much of the work that facilitates this kind of intervention depends on... depends on being able to mobilize data, to work across interdisciplinary collaborations, to think, to work and talk with archives about which data is getting digitalized.

52:02

And then to speak to provosts and deans about the kinds of computational resources that classrooms need in order to effectively teach these methods. I've taught R and Python for historical analysis to both historians and computer science students and data science students in classrooms, upwards of 30 seats.

52:27

My colleagues at quantitative theory and methods are political scientists who teach 300 seats with these methods in a single semester. Doing that requires a certain kind of historian who can have... a conversation with deans, provosts, and officers of the university about about high-performance computing, about how students can effectively load these tools on their laptops, and over the course of a semester, really learn to do robust work with tools.

52:57

That's not work that can be done by an adjunct. It's not work that can be done by hiring an assistant professor in a field, for example, as an assistant professor of... of Britain or of China studies, who's then expected to create a monograph in British studies or China studies. higher may do important work and they may contribute new methods, but they cannot be expected to create these institutional alignments of the kind that the field really needs.

53:27

and that our classrooms need. So I see a real need for senior hires in digital humanities and a digital history in particular to facilitate this new work. I also talk about, you know, I return to the theme of hybrid labs, I think both inside the university and outside the university to create new forms of research to do new things with our data, whether that data is...

53:56

from Victorian Britain or from the contemporary sphere. And where I see these tools as driving us, I did most of my work on Victorian Parliament, in fact, because... I didn't want the computer to surprise us too much. We have so much robust historiography of Victorian Britain. It's fairly easy to try out algorithms and say, does this algorithm tell me something? but looks mostly familiar, 80% familiar, and maybe doesn't tell me 20% something that I didn't know.

54:26

And one really has to work with the algorithms to produce that kind of method that's capable of both validating, yes, the computer can see what other historians have seen, and... And we can find something new and surprising. But I think after 10 years for me of doing this kind of work, the work that went into this methodological book, we now have robust pipelines, robust methods.

54:55

We've validated them on Congress and on Parliament. So yes, the computer tells you things that make sense if you follow these methods. I give lots of examples of how you get the computer to tell you nonsense. or things that aren't interesting or surprising. But we do have methods now that are capable of producing reliable history. But the advantage, the comparative advantage of using those methods is not going to be in the future.

55:22

on working on Victorian Britain. I mean, God bless all digital historians working on Victorian Britain. Please continue to do so. But I think where the methods are most crucial...

55:33

is on the new archives post-1970 that we don't have access to in any other ways. And Matthew Connolly has been writing about this in such a compelling way. He says the National Archives in the United States... essentially gave up after the invention of email with the results that historians of foreign relations who were working on the 1990s have one third as much material to work with in the archives as historians of the 1970s. And so for more recent timescales, it's much less material still.

56:05

On the other hand, if you go looking for World Bank reports, World Bank documents from the last 20 years, United Nations documents, even... documents related to many grassroots movements or state or local government for the last 10 to 20 years. You can download today a fairly complete archive just by going to the correct repository. spidering the websites. So then imagine that you have

56:32

a million documents too much to read reliably in an efficient way. If you are able to use the methods outlined in The Dangerous Art of Text Mining... to apply topic modeling, word embedding, to look at memory, event periodization, you can very rapidly move to a chronology of an institution that we didn't really understand in the abstract. So you can ask, what are the major turning points that change everything? What are the major themes?

57:01

How is one year different than another year? Who was the most influential actor in this? in this local unit of government, you can ask those questions and get answers in a more efficient way than historians have ever been able to do in the past. And so I think that's particularly compelling for issues like...

57:25

climate change, which is where my own research is going next. Can we give us a chronology of climate change and, continent by continent, figure out a chronology of the disasters that have... caused so much disruption? Can we figure out a chronology of the technologies that have been undertaken or the policies? Can we put together a chronology of the unsung heroes, the Greta Thunberg? Who are the Greta Thunbergs of Africa or New Zealand?

57:57

of the rest of Asia. And so if we can assemble those stories, not just one at a time, but in aggregate, then it becomes possible to, on order, put together a 10-year, 20-year or 40-year portrait. of a global problem, like the crisis of the planetary or crises of underdevelopment, all of the aspects of the ongoing polycrisis. So for me, that's one possible future. Doing so requires certain kinds of institutional investments, and I don't want to minimalize them.

58:34

They require history departments to be forward-looking. They require universities to be engaged in a certain way. So I speak to all of those concerns in the closing chapters. Thanks so much for your answer. Again, I still totally agree with you. And by the way, here's my own example. Especially, I will appreciate your perspective for the future of history and history world. and how the next generation of historians work. I mean, they are working much based on data, based on...

59:08

They must treat tags as data. Here's an example from my classroom. Well, I'm a PhD student, you know. But I taught the class like China and the world. It's like a Chinese foreign relation class last summer. So in my class, I'm crazy, I would say, because in his class, I teach my students to use Python, basically use code.

59:34

the use of python reasons that i want my students to use python and use the algorithm popular algorithm called words to work in Python to analyze the foreign relations of the United States, especially in a document regarding how the American government treated China during the Chinese Civil War.

59:57

So most of my students say, and most of my colleagues, they say, I'm crazy. I mean, why? It's not a computer science class, but why you teach Python, a basic use of Python in your class? Because I told my students... okay you guys maybe some of you guys will become the next generation of historians so you must know how to deal with a huge number, a titanic number of data, quantity of data, of hacks. For example, FIOS.

01:00:31

foreign relations of the United States, the big documents now are all available on GitHub. So I tell students Look, there are like 100,000 pages of documents. You cannot read all of them page by page, line by line. You must use text and use algorithms to read them. So in a sense, I appreciate you. Thank you. Yes, your students are so lucky to have you.

01:01:04

That's absolutely right. So there is a crisis looming in history, which is the crisis of too much information to read. And we all have approaches for navigating it, whether it's through sampling in the archives. or through using data-driven methods. But the data-driven methods are particularly important when there is a when there isn't a foregoing historiography, when we just don't know other than through received journalism or...

01:01:32

or memory, public memory. We don't know what the major turning points are because we can fold it into a process of critical thinking. We can get multiple accounts. So your students are so lucky to have you teaching them these methods and looking at new archives, looking at new raw material. So congratulations on your class, on your work, on teaching with Python. Formidable.

01:01:58

Thank you so much for encouragement. So, but at the end of our talk today, I want to talk to my audience directly. So thanks so much for listening to Dr. Gildi. talk about her fantasy book, The Dangerous Art of Tax Money. I want to repeat the title. Please take notes if you want to take notes. the dangerous art of text mining it's a fantastic and it's a fantastic and the timely timely um discussion about use, both the use of tax money in history and the dangers.

01:02:36

and the danger of misuse of tax mining in historical research. so for you know data scientists you want to know how to use that how to do um how to assess data from the perspective of data of data or for history or you want to know oh i want to know how to you how to do data driven research or even just like a general audience, you want to know how the future or the next generation of historical research will look like.

01:03:08

I personally highly recommend you buy a copy of this fantastic book and you can choose, you can read those fantastic discussions, especially those fantastic case studies. After reading the book, I found some case study, some case is so familiar for, for example, our American audience. You must learn American history from a very different perspective. I mean, those possibilities are based on data-driven research.

01:03:40

So after reading the book, you may think about history, about history research entirely differently. So at the end of our episode today, again, I want to... repeated the title for the third time, The Dangerous Art of Testament. and Dr. Joe Goody's podcast book. Please consider by a copy of this book. It's so brilliant, so insightful. So thanks so much for listening to our episode today. Thank you. Thank you so much, Xu.

✨ This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.

Joanna Guldi, "The Dangerous Art of Text Mining: A Methodology for Digital History" (Cambridge UP, 2022)

Summary

Episode description

Transcript

Joanna Guldi, "The Dangerous Art of Text Mining: A Methodology for Digital History" (Cambridge UP, 2022)

Summary ✨

Episode description

Transcript ✨

Summary

Transcript