Well, what I'm saying is that there are known knowns, and that there are known unknowns, but there's also unknown unknowns, things we don't know that we don't know.
What?
Say what again. Say what again. I dare you.
[MUSIC PLAYING]
You are listening to WREK Atlanta, and this is Lost in the Stacks, the research library rock and roll radio show. I'm Charlie Bennett, in the studio with Cody, Marlee, Fred, and myself. Each week on Lost in the Stacks, we pick a theme, and then use it to create a mix of music and library talk. Whichever you are here for, we hope you dig it. Our show today is called "Whiter Google Scholar. " Whither-- why are you--
Whither.
Is that just a fancier way of saying--
It's "hw-ither."
--a WTF, man, about Google Scholar?
Well, Google Scholar is 20 years old this year, and so we thought it would be an ideal time to take a look at this tool used by a lot of academics.
Did you say 20 years?
With our academic use of whither.
"Hw-ither."
Yes. Yes. 20 years.
Hwuh, hwuh.
In fact, 20 years this past week. November 17, I think, was the anniversary date.
Happy birthday Google Scholar. Of course, like so many things that come out of Silicon Valley, trying to look inside this platform is like trying to look inside of a box which is shut inside a crate which is itself locked inside a vault.
[LAUGHTER]
So there are things we know about Google Scholar and things that we don't know and things that we don't know that we don't know.
Gosh. It's like this recurring, Rumsfeldian nightmare.
Our songs today are about the mysterious and unknown, being in thrall to unseen powers, and all-encompassing entities. Our reliance on black box platforms grows yearly--
You sound like an eldritch horror writer.
This is pretty bleak. And the online tools that we use on a daily basis are built on algorithms that maximize usage and profit, not our well-being. It's enough to give you the blues.
Oh, I like what you did there.
So let's start with a song called "Algorithm and Blues" by KMES right here on Lost in the Stacks.
[KMES, "ALGORITHM AND BLUES"]
That was "Algorithm and Blues" by KMES. And our show today here on Lost in the Stacks is called "Whither Google--" sorry. "Hw-ither Google Scholar--"
OK, Fred, just-- come on, man. Tell-- explain to me what is going on here.
So I wanted to discuss Google Scholar, and actually, I wanted to have a guest, but it's really hard to get someone from Google Scholar to--
I seem to remember--
--to talk about Google Scholar. CHARLIE BENNETT: Every time we've wanted to talk about Google, we've had to get someone who used to work there or just do without a guest at all. So we're doing without. But before we dive into how inscrutable it is, let's just talk about what it is, so-- for people who don't know what Google Scholar is.
Ask me. I'll get this started.
OK. Charlie, what is Google Scholar?
I really don't know. I think it's a segment of Google that's just been set aside. But I don't know. I thought I did, before I did the research for this, but now I don't know.
Yeah. Marlee?
Now, I think it is separate because when I type in a Google Search for an article title, it never shows me Google Scholar results. I actually have to go to the separate Google Scholar website to search Google Scholar.
Right, yeah. And Cody, we got you on mic today. When I say what is Google Scholar, what comes to mind?
I had always thought of it as, like, that's where I go read science.
[LAUGHTER]
Like, I go to Google Scholar. I see a science article that says they found perpetual motion. They link an article. And I'm like, let me see what this actually says. I'll use Google Scholar to try and find it.
That's such a better answer. I mean, that's about outcome and process and information need.
It's actually a really good use case for it. If you find in the popular media someone references, oh, the research shows da, da, da, you search for that, and you'll probably find it there because they've got everything.
So do you think, Marlee, that if you searched for an article title, and-- in regular Google Search, all those results would come up. Do you think that the results that you would then find in Google Scholar are not anywhere in that big list of Google results?
Oh, no, they can be.
Yeah.
They absolutely can, because they'll be in a repository, or they'll-- there might be some index of ResearchGate, for example, or--
Yeah, I think you called it a segment. That's what it strikes me, as like a segment. CHARLIE BENNETT: No, I said segment. You said segment. OK.
And Marlee kind of corrected that a little bit. But the place that I'm coming from is that it does seem like it's a reduced search. It's a Google Search that is only particular to some sources, except when I was reading about how this all started, they were talking about how they were getting files and physically transporting them to a location.
That's how it started in 2004. CHARLIE BENNETT: Yeah, which makes it seem like it started as an idea for a repository itself--
It sounds like SciHub.
--of, I believe, stolen-- yeah, stolen research, I think, is what-- I mean, I don't even want to get into the ethics or morality of copying research articles and putting them somewhere else. But I think a lot of people think of that as stolen.
But how it is now, how it can be defined now is, like, I guess sort of the same way that Google News is a thing. You can search Google for news, but you can go to the little application for Google News, and it tells-- and it will give you results that Google tells you is news. And in the same way Google Scholar will give you results in a way that Google Scholar tells you are scholarly results.
I think to be very pedantic and specific about it--
Yeah.
--results that are pulled from what Google has decided are news sources or results that are pulled from what Google decides are scholarly sources--
That's exactly--
Exactly.
--they don't make the distinction on the results that arrive. They're saying, we're only searching what we think of as news. We're only searching what we think of as scholarship.
Right. And that's how it was developed 20 years ago at Google Alex Verstak, if I'm pronouncing that correctly, and Anurag Acharya-- and I'm probably not pronouncing that correctly, either. But those are the two employees of Google at the time that came up with the idea for Google Scholar. And so how it exists now is a scholarly search engine that a lot of folks use, both laypeople and academics.
A lot of the faculty that we work here at Georgia Tech are really interested in using Google Scholar and how their scholarship appears on Google Scholar. But it was this idea, 20 years ago, where these two guys, Alex and Anurag, decided that they wanted Google searches that only brought back scholarly results.
Yeah the line here is-- the idea wasn't to produce Google Scholar. It was to improve our ranking-- "our" being Google's-- ranking of scholarly documents in web search, which is so naive sounding in this moment in time, but probably made perfect sense-- do you all remember--
There wasn't a whole lot on the web back then. Yeah.
Remember when a Google search came back, and you were like, just this?
[LAUGHTER]
So it is-- and I mentioned this, that it's not just the scholarly work that faculty all over rely on for their research and are using Google Scholar to find. It is that ranking of scholarly research, like how many times it was cited. And Google is providing their own metrics for counting those kinds of citations, which is really important to scholars.
So now, it's-- the product as it is now, whatever it was envisioned as 20 years ago, now it's where a lot of people go first to find scientific information. And it's a lot of where faculty go to point to. Hey, look how great my papers are because they've been cited so many times.
I'm laughing because Fred's eyes got a little bit wider and a smile started to form as he started to list off those things, because there's that kind of ripe-for-corruption feel whenever an entity says, hey, we're going to start telling people how good this is. We're going to start telling people how important this particular research is.
A lot of products that-- we all work at the Georgia Tech Library, and so a lot of the products--
Whoa, whoa, whoa, be nice.
Oh. Yeah, Cody.
Just a very big fan of the Georgia Tech Library.
Yeah.
Yeah. We're all workers of or fans of the Georgia Tech Library. And we have databases that work in a similar way to Google Scholar in that we want our faculty to search them.
But the difference between these databases and names you might have heard of, JSTOR, EBSCO, ProQuest-- if you're a faculty researcher out there listening to this, you probably heard those names-- and Google Scholar is not necessarily one of those, but we try to get our users to use those and not use Google Scholar exclusively. But the difference is that even though those are also corporate products, we can put our finger on a list of what's contained in them. Like, in Google, we cannot.
Google is crawling.
Yeah.
We know not what.
I can vouch for some of the junk that Google Scholar is crawling. Yeah. Now--
When you say some of, meaning--
Well, I mean-- CHARLIE BENNETT: --some of it you can't? MARLEE GIVENS: Because, like, I-- well, it's a long story of how it came to be. But I've found-- well, I mean, there are some things in there that I think would not be considered traditional scholarship. That's one. So, like, I have a blog post. It's been cited over eight times, I think, by now. And that's in my Google Scholar profile. And I'm proud of it, but it's not scholarly.
This show is actually a perfect example. Not this episode, but this show, because if you search for Lost in the Stacks in Google Scholar, a bunch of stuff comes back because it's in our digital repository at Tech, which is one of the things that Google Scholar scrapes and searches. Doesn't have it in there. It doesn't say whether it's scholarly or not. But it's like, oh, it's in a digital repository at a research institute. Must be Scholarly
Yeah, yeah.
And the only reason that we know it scrapes our institutional repositories because we see those results come up in searches. There's nothing in Google's documentation that says, hey, we include the Georgia Tech repository.
We don't have red lights that go off every time Google crawls the repository?
I'm begging for that. That would be wonderful.
[LAUGHTER]
But would they always be going off, or would it just be every once in a while?
We can talk about that in the next segment because they're not going off very much at all.
OK. Well, I think if we're talking about red lights, maybe we should just end the segment here. This is Lost in the Stacks. We'll be back to talk about the mysteries of Google Scholar, and possibly its discontents, after a music set.
File this set under ZA 3075.A46.
[THE DANCE PARTY, "TODAY'S MYSTERY IS..."]
[NAT KING COLE, "YOU'LL NEVER KNOW"]
(SINGING) No, no.
That was "You'll Never Know" by Nat King Cole. Before that, "Today's Mystery Is..." by The Dance Party. Those were songs about being helplessly drawn in by the mysterious and unknown.
[MUSIC PLAYING]
This is Lost in the Stacks, and today's show is all about Google Scholar. Why do we care that it's a black box?
[LAUGHS]
Why do we want to know more?
And can a show be about something that we know nothing about?
[LAUGHTER]
Well, you made a comment in the first segment that you saw my eyes get really wide. Like, there was something that triggered this episode. And there was an actual occurrence, I guess, in my daily working life here at the Georgia Tech Library that inspired this. And I work very closely with the Georgia Tech repository. The repository runs on a platform that's called DSpace, which isn't important, except that we recently upgraded from one version of DSpace to another.
And in the course of that, the repository URL changed. All of the content in the repository has an identical handle-- Uniform Resource Locator-- that didn't change, but the policy, the top-level domain, changed.
Yeah. And I just want to say what you said again in a different way.
Yeah.
There was an established way that these entries all had the same URL-- I know it's not a URL, but the same way to get to it on the web no matter what. But you have to decide to use that. And if you're a scraper you don't.
Which, Google Scholar is a scraper. And they totally stopped scraping, and they lost a whole lot of connections to our previous content in the repository. So we had over 70,000 items, which Google Scholar had deemed as scholarly, because it's an institutional repository, I guess. We didn't make that distinction for them. They found it on their own. But now, since we upgraded in summer of 2023, we've lost about 85% of that.
And even though our technical folks go to the Google website, the Google Scholar website, look at the documentation, follow all the instructions to say, hey, how does my site get crawled by Google Scholar? We did that. We did that. We did that. All checked. It has been crawling the new repository very, very slowly. And even a year out, we still have less than less than 15% coverage of the repository. And that sparked this idea.
And I thought, I've got to find somebody at Google Scholar that can help us, and I couldn't. And I thought, is there anybody that can talk about this on the radio? And there's not. I managed to find one former Google employee that still has some connections at Google, and she worked at Google Scholar. I won't use her name, but she worked at Google Scholar on the Google Scholar project and did outreach to universities, but has since left. And I found this person on LinkedIn.
And I said, can you please put me in touch with someone that works at Google Scholar? And she said, let me see if they are willing to speak on that. And I got back a message a day later. No, they are not. It's a black box.
Yeah.
And it's frustrating.
So I should speak to something you said earlier.
Yeah.
"Decided it was scholarly," was something you said. Google Scholar scrapes the repository and says, this is scholarship. And I know for a fact, because I put it there, that there is a half-hour discussion of statins and antidepressants, depression medication, by two people who are not qualified to have that conversation in any professional way because it's a podcast that I did. And that's in there. And so--
So the thing that you did as part of your Georgia Tech librarianship duties ended up in the repository, and thus ended up as a potential Google Scholar result.
And could be scraped. And its subject is medication, blood pressure-- these are joke keywords that we put in because we were just talking about how, here we are, and now we have to think about drugs in middle age as things that keep you from dying, not things that make things happy, right?
So there's now a semi-validated, nonsense resource about very important subjects because it happens to be in the repository because it's part of a long-running-- now-ended, long-running podcast done by Georgia Tech faculty.
That's a concern in Google Scholar. A lot of junk can end up in Google Scholar. That's not to say that junk can't end up in JSTOR or--
Yeah, I was going to say because the thing that I did not get to cite before we went to the music set was another example of junk is something that is scholarly, but it's still junk. So I have found in someone's Google Scholar profile a publication that when I clicked on it, it's like, oh, that is their name in the index to the conference proceedings. So it's a scholarly publication--
Like, they were an attendee. They didn't contribute.
Oh, they contributed a paper.
Oh, OK.
So the paper itself is indexed. It's probably indexed in something like JSTOR or EBSCO or ProQuest or Elsevier or whatever. But that page with their name in the index--
Oh, it's like an additional--
--to the proceedings--
--citation.
--is an additional citation that is made it onto their Google Scholar profile and is part of their metrics.
Yeah.
Not that it's being cited, but it's there, part of their metrics.
Oh, man. So they can say they've got two citations from this conference, even though it's really just one.
Yeah, yeah. And I bet they even-- they don't say that because they don't know it's there. But you could probably find that in anything that indexes that conference proceedings.
And this is the problem with Google Scholar being a black box, because if that happened at ProQuest, EBSCO, JSTOR, or whatever, all the usual databases where we know what's in it, we can talk to a human. For all the deficiencies of those companies, which is a different show--
[LAUGHTER]
--you can actually connect with a human to get errors corrected. When it comes to Google Scholar, that is impossible. It's there. And maybe you can delete it from your own profile, because you can control your own profile, but you can't delete it from Google Scholar entirely.
Because Google Scholar is not a repository. It is not a storage place. It is a constantly running process that you can dip into and say, hey, what do you think of this particular thing?
And it changes.
Yeah.
And we don't know what's added and what's subtracted.
So we only have a little time left in this part of the show. And we've just said, hey, here's all these things that are wrong. Can we sum this up? What are we trying to say about Google Scholar in this moment?
I think I'm trying to say that it's-- counterintuitively, it's a really valuable resource, I think.
Plot twist.
I know I use it. I know pretty much every academic, probably, at Georgia Tech, in whatever field, probably uses it in some capacity. But it's just remiss not to acknowledge these real, genuine problems that it has.
Those problems being we don't know what it does or how, and we're not even sure why anymore because there are no people who will explain it.
That's pretty much it.
That's it, yeah.
Get us out of here, man
OK. You're listening to Lost in the Stacks. And we'll talk more about what we don't know about Google Scholar on the left side of the hour.
So we'll talk about everything.
[LAUGHS]
[MUSIC PLAYING]
[MUSIC PLAYING]
Hey, y'all. This is Dr Lisa Hoopes from the Georgia Aquarium. I am the director of research and conservation, also in charge of feeding for all the animals. And you are listening to Lost in the Stacks, WREK Atlanta.
[MUSIC PLAYING]
Today's show is called "Whither Google--" ah. "Hw-ither Google Scholar."
[LAUGHS]
Today's show is called "What is Up with Google Scholar." One of the black box ambiguities of Google Scholar is how it indexes what it indexes. What does it consider scholarly, and why? Who at Google says that a work is scholarly? These are all reasonable questions that anyone thinking critically about what we're talking about-- I can't say Google Scholar anymore-- might ask, but it's important to realize that from the beginning, these questions were answered with ambiguity.
Here are a couple of quotes from Anurag Acharya, coinventor of Google Scholar. First from 2012. And Cody, I want you to be my quote there. I ambushed you.
You did ambush me with a script, and I still messed it up. OK. "We have built the largest scholarly search. At this point, it includes every source that I can reasonably think of. And some sources may be borderline scholarly, but that is the nature of trying to do everything."
And then there's another from 2014.
So, update. "Scholarly is what everybody else in the scholarly field considers scholarly. It sounds like a recursive definition, but it does settle down. We crawl the whole web, and for a new blog, for example, you see what the connections are to the rest of scholarship. If many people cite it or if it cites many people, it's probably, probably, scholarly."
My blood pressure just went up.
[LAUGHS]
So let's take note of these phrases. "Is probably scholarly," "a recursive definition," "maybe borderline," "every source I can reasonably think of." Google Scholar was built from ambiguity. We scholars are captive to an unseen, ambiguous force. So many unknown unknowns. It's like Cthulhu up in here, man. File this set under HD 9696.8 do U64, if that's real.
[GARNET MIMMS, "AS LONG AS I HAVE YOU"]
(SINGING) Born in darkness But I fought my way up to the sun
[ALAN POUNDS GET RICH, "SEARCHING IN THE WILDERNESS"]
That was "Searching in the Wilderness" by Alan Pounds Get Rich. Before that, "Invisible Forces" by the Fresh and Onlys. And we started our set with "As Long As I Have You" by Garnett Mims, songs about being in total thrall to an unseen power and looking for its source.
[MUSIC PLAYING]
This is Lost in the Stacks, and our show is called "Whither Google Scholar." The definition of "whither" is "to what place or to what state." So I ask again, is that just a fancy way of saying WTF? What's the future of Google Scholar?
Yeah, I think so.
Yeah.
It's time to talk about the future.
I love "Whither Google Scholar."
It's great whither. "Hw-ither."
The future-- so here's the funny thing about doing the show. We can only guess at the future, and we have imperfect knowledge of the present and anecdotal knowledge of the past. So "Whither Google Scholar" really is just us extrapolating from our experience of it.
Sure, but it sounds like fun, so let's do it. CHARLIE BENNETT: Yeah, let's do it. So I'm guessing that everything's going to crash and burn very soon. Well, it depends on whether you think Google Scholar is already crashing and burning, I guess. If it seems like it's-- everything that we've talked about today, if that seems to you like it's a huge problem now, I don't have a lot of optimism that that's going to improve.
But if you're of the perspective of, yeah, I acknowledge those problems, but it's still useful for what I need it for--
[LAUGHS]
--then you're probably going to keep using it.
CHARLIE BENNETT: So the reason I say crash and burn is because when I look at a little bit of the lit survey that we did before this, most of these things, like criticisms of how Google Scholar is bad for search, evidence hacking, promoting disinformation, AI junk crowding Google Scholar-- most of the user experience is, yeah, I get a bunch of stuff when I put a search in there, and it's got some metrics that are really easy for me to parse because they're not "overloaded," quote-unquote,
"overloaded" like Web of Science or ProQuest dissertations. But really, it's just a mess. So I want to talk about some of those things that you just mentioned, like evidence hacking is something that is happening at Google Scholar. A couple of researchers at UNC Chapel Hill and Virginia wrote an article about this evidence hacking, which is-- it's basically-- it's white supremacist publications and organizations publishing things at-- because they're affiliated with a university.
And it becomes publicly accessible in Google Scholar. And so people cite this thing as being found in Google Scholar to normalize their racist, white supremacist work. So that's one huge problem. And again, can we talk to a human at Google Scholar to say, what are you doing about this? No, we cannot. AI junk-- CHARLIE BENNETT: And even if we did, what do we really have to say? Hey, this person who is affiliated with a university and put their preprint in a university repository is bad.
Don't let people cite it. There's no action there. There's no action item, and there's no-- it's just a moral question. I would still prefer there to be a human to say like, ah, jee, I don't know what to do, rather than just talk to a--
[LAUGHTER]
--wall of electrons, which is currently what-- CHARLIE BENNETT: Well, that's just your melancholy humanism, man.
Yeah.
There's also AI junk that's crowding Google Scholar now.
This all just sounds like-- I just imagine someone at Google saying, this all will wash out. This all just comes out in the wash, right? I mean, yeah, there's some great stuff in here. Yeah, there's some crappy stuff in here. But in the grand scheme of things, it's overly-- it's positive.
And isn't the frame of that really, we're improving Search all the time, because that's-- like, we're improving the process, but the improvement of the process is also opaque.
And it could go away at any time. Google is notorious for just suddenly flipping the switch and turning things off, even things that have been going on for years and years. People thought that this would happen when it was taken off the Google toolbar. Like, there's a little dropdown menu where you can pick either, like, Google News or other things like that. And Google Scholar used to be there, and it taken off of that.
So that was a few years ago, and people thought, oh, Google Scholar is not long for this world, but it's hung around, but it could go at any minute.
I mean, they did do some new development to try to introduce journal metrics, so that's fairly recent. But yeah. I mean, I doubt it's making them any money.
Yeah. And what you just described is adding a feature. It's not improving the results or crafting some kind of a monetizable service or anything like that.
Google has all the power in this, which is often the case in the applications that Google develops.
The wizard only has the power you grant him, unless you look behind the curtain.
So do we like Google Scholar?
I use it all the time.
I have to admit that I do, too.
Yeah, yeah. I mean, it does some things, like, super well.
[LAUGHS]
If you are-- I mean, yeah. I mean, it's just-- it's like one-stop shopping for one thing. And then the citation tracing. I mean, it's going to find citations that Web of Science is only going to find a portion of.
And that's my use of Google Scholar is unofficial. Just like when people say, oh, you can start with Wikipedia to get an overview of the topic. Just don't cite it. With Google Scholar, that's usually what I do. I'm like, what's happening? Hey, is anyone citing Lost in the Stacks?
And a very long story short, yes, People are citing Lost in the Stacks, but also, there's a bunch of citations that Google Scholar thinks are Lost in the Stacks that are not because names are similar, because titles are similar. It's a mess. It's very fun to use. It's a mess. And it's on fire.
It's a black box full of unknown unknowns.
Sinking in the ocean, below the wreckage.
This is Lost in the Stacks. And today, we've been talking about Google Scholar. And as usual, we've only scratched the surface. CHARLIE BENNETT: We'll come back. I mean, there's AI in it, so Fred wants to talk about it over and over and over again.
[LAUGHTER]
I'm sure we'll talk about it again sometime.
File this set under BT 131.P53. [CHRIS BELL, "I AM THE COSMOS"] Every night I tell myself I am the cosmos I am the cosmos--
[THE CLIQUE, "SUPERMAN"]
"Superman" by The Clique, and before that, "I Am the Cosmos" by Chris Bell. Those are songs about entities that believe they see all, know all, and encompass everything.
[MUSIC PLAYING]
I swear, Fred, this whole show has a Lovecraftian, elder gods, cosmic horror feel to it.
We're locking ourselves inside a black box of horror.
And we'll drop in the ocean. Today's show was called "Whither Google Scholar." Fred, how are you going to wrap this up? I assume you didn't do the cosmic horror thing.
Well, there are known knowns.
Oh, gosh.
There are known unknowns.
Fred.
And unknown unknowns. And the wrap up for this show is what I would call a known unknown.
Well, I don't know what else there is to say besides-- just roll the credits, dude.
[MUSIC PLAYING]
Lost in the Stacks is a collaboration between WREK Atlanta and the Georgia Tech Library. Written and produced by Alex McGee, Charlie Bennett, Fred Rascoe, and Marlee Givens.
Legal counsel and a box which is shoved inside a crate which itself is locked inside a vault were provided by the Burrus Intellectual Property Law Group in Atlanta, Georgia.
How did they get it over here?
Special thanks to anyone who still has those old print copies of the engineering index from the '70s and '80s. And thanks, as always, to each and every one of you for listening.
Our web page is library.gatech.e du/lostinthestacks, where you'll find our most recent episode, a link to our podcast feed, and a web form if you want to get in touch with us.
Next week is a rerun because the holiday season is upon us. And the week after that, we'll have another edition of the Georgia Tech Library Guidebook, where we peek behind the curtain of technical services.
Is the wizard back there?
It's time for our last song today. Despite everything, I know I'm probably going to keep on using Google Scholar.
You don't need that "probably," Fred.
I'll keep using Google. I'll keep using streaming services. I'll keep using the internet. I'll keep using automobiles.
[LAUGHS]
I'll keep using air conditioning. I'll keep using things made out of plastic.
[SIGHS]
I'll do all these things because if I'm being honest with myself, I know deep down I worship at the temple of convenience.
This show has taken a hard right turn into the dark.
Put me in the black box and drop me in the ocean. So let's close with "Temple of Convenience" by Yeah Yeah Noh. Have a weekend, everybody.
[YEAH YEAH NOH, "TEMPLE OF CONVENIENCE"]