Mastering spaCy: Build structured NLP solutions with custom components and models powered by spacy-l

Speaker 1

00:00

Welcome to the deep dive, where we slice through the information clutter to bring you the clearest, most important insights. Today, we're taking a bit of a shortcut to becoming well informed about a really powerful tool in natural language processing NLP.

Speaker 2

00:14

It's called Spacey, that's right, and it's an interesting one. If you think of those huge language models, you know, like chat, GPT, maybe is a big powerful food processor. Okay, then Spacey is more like your practical, really well optimized kitchen knife. It's a library that's specifically designed to help you get actual work done.

Speaker 1

00:34

So moving beyond just theory.

Speaker 2

00:36

Exactly beyond just academic concepts and do efficient practical application. And we're going to uncover some surprising depth today. I think from you know, basic text processing right up to integrating with the latest AI stuff.

Speaker 1

00:48

Sounds good, And our mission for this deep dive is basically to give you a comprehensive but still really accessible understanding of what Spacey can do. We're drawing from quite a few sources, including the excellent book Mastering Spacey. Okay, let's untack this. So to kick us off, what's the absolute core thing our listeners should get about Spacey.

Speaker 2

01:12

Well at its heart. Spacey is this incredibly fast open source Python library, and it's really built for production ready NLP applications.

Speaker 1

01:22

Production ready. That sounds important.

Speaker 2

01:24

It is, a lot of its speed comes from using Python for the really performance critical bids, so it's highly optimized but still easy to use within Python.

Speaker 1

01:33

Aha. So it's not just another like academic tool set. It's built for real world stuff from the.

Speaker 2

01:37

Get go precisely. That's a key difference compared to maybe something like NLTK, the Natural Language Toolkit, which historically at least was often more focused on students researchers. Spacey, you're hitting the ground running for deployment.

Speaker 1

01:48

You mentioned it's built to get work done? Is that like the official philosophy pretty much?

Speaker 2

01:53

Inus Montani? What are the core? Creators? Often talks about this. The goal is genuinely to help people do their work efficiently. They're not trying to build some massive do everything system. Oh okay, it's more about providing these sharp, reliable tools like that knife, to fit nicely into whatever you're already doing.

Speaker 1

02:10

Got it and getting started? Is it complex?

Speaker 2

02:15

Not? Really? It works with modern Python runs on you know, the usual operating systems, Windows, Mac, Linux.

Speaker 1

02:21

And best practice is probably virtual environments. Right.

Speaker 2

02:24

Keep things clean, Oh, absolutely, always a good idea for any Python project. Keeps your dependency sorted.

Speaker 1

02:29

Now you mentioned something important. The language models aren't built in correct.

Speaker 2

02:33

That's a key point. Spacey itself is the framework, the tools, but for the statistical smarts, things like tagging parts of speech or finding named entities, you need to download a language model separately.

Speaker 1

02:45

Like Encore websism, that kind of thing exactly like encore webdism for English.

Speaker 2

02:49

This quick command line thing Python dash M, Spacey download oncre webism that downloads the small English model, gets you the core pipeline components.

Speaker 1

02:58

Okay, and once you've got that, how do you sort of of see what it's doing.

Speaker 2

03:01

Oh well, that's where displacy comes in. It's Spacey's built in visualization tool, and it's fantastic. How So, it just makes really complex linguistic concepts much easier to grasp visually. You could see dependency parses how words connect, or see named entities highlighted right in the text. It helps you spot patterns almost.

Speaker 1

03:22

Instantly, so you can actually see the analysis.

Speaker 2

03:24

Yeah, you can try it online. There's a demo, or you can run it locally from your code. Even in Jupiter notebooks. It's super helpful for understanding what's going on into the hood.

Speaker 1

03:32

Okay, so set up, done, model, downloaded, visualization. Ready, let's talk about the core processing. You mentioned a pipeline.

Speaker 2

03:39

Yeah, I think of it like an NLP assembly line. When you load a model, say using spacey dot load, you get back this NLP object, right, And when you feed text into that object like doc NLP, this is some text. It runs the text through a sequence of processing steps.

Speaker 1

03:55

The pipeline components exactly.

Speaker 2

03:57

The default pipeline usually include. It's a tokenizer, a tagger for part of speech, a dependency parser for sentence structure, and an entity recognizer or any R component. Each does its specific.

Speaker 1

04:10

Job, and the output is this doc object.

Speaker 2

04:13

Right. The doc object holds the result. It's not just the text, it's the text broken down into tokens, and each token is enriched with all the linguistic features found by the pipeline.

Speaker 1

04:23

Let's break down that pipeline. First up, Tokenization and sentence segmentation sounds simple, just splitting words, Ah.

Speaker 2

04:29

Well, it's a bit more nuanced than just splitting on spaces. Tokenization is breaking the text into its smallest meaningful parts, the tokens, words, numbers, punctuation. They all become tokens. Okay, But here's a surprising detail. Unlike most other pipeline components, the default tokenizer doesn't rely on a statistical model.

Speaker 1

04:47

Oh which does it use?

Speaker 2

04:49

It uses really carefully crafted language specific rules, which makes it very fast and predictable. And you can even customize it. You can add special cases like telling it how to handle slang or specific abbreviations.

Speaker 1

04:59

Let's teach it lemmey should be lemon me exactly.

Speaker 2

05:02

That kind of thing gives you fine grain control.

Speaker 1

05:04

And sentence segmentation, finding sentence boundaries that's.

Speaker 2

05:08

Actually often more complex than tokenization. Think about abbreviations like misder or complex punctuation. Spacey has a unique approach here. What's that It often uses the dependency parser, which understands sentence structure to help figure out sentence boundaries really accurately. It's quite a sophisticated design choice.

Speaker 1

05:28

Interesting. Okay, Next step lematization getting the root word yep.

Speaker 2

05:32

The lemma is the base or dictionary form. So like you said, eating eats eat tape, they all boil down to the lemma eat.

Speaker 1

05:39

How useful is that in practice?

Speaker 2

05:41

Oh, incredibly useful. Think about a chatbot for booking flights. A user might say I want to fly, or show me flights or I flew yesterday.

Speaker 1

05:49

Right, different forms of the same core idea exactly.

Speaker 2

05:52

Lemonization reduces fly flights flu all down to fly, so your system only needs to look for that one base form to understand the core intent. It simplifies things massively.

Speaker 1

06:02

Makes sense, and you could use it for other things too, like place names.

Speaker 2

06:05

Definitely maybe sometimes angel Town when they mean Los Angeles. You can actually add custom rules using something called an a tribune ruler to map Angeltown to the canonical Los Angeles lemma. During processing insures consistency.

Speaker 1

06:18

So Spacey processes the text, applies these steps and stores the results you mentioned. Container objects, doc, token span.

Speaker 2

06:26

Right, these are your main ways of accessing the processed information. The doc object represents the whole processed text. Okay, if you loop over a doc like for token and doc, you get individual token objects.

Speaker 1

06:37

And each token knows things about itself.

Speaker 2

06:39

Loads of things. A token object holds the original word, it's lemma, it's part of speech tag, it's dependency relation. It also has boolean flags like token dot is punk, token dot is currency token dot like earl, token dot latham wow.

Speaker 1

06:52

Okay, So you can check if a token looks like a URL or a number easily yep.

Speaker 2

06:56

And it knows it's entity type if it's part of one like token do type might be person or or worg. It even has a token dot shave attribute that gives you a kind of abstract representation of the words orthography, like is it capitalized, is it all digits, et cetera. Really useful for rule base matching.

Speaker 1

07:13

And span What does that fit in?

Speaker 2

07:15

A span? Is just a slice of the dock representing multiple tokens. Sentences are span objects. You can get them via doc dot sense. Named entities are also span objects, accessible via doc dot NZ. So doc token span or how you navigate and use the process.

Speaker 1

07:31

Text got it. Let's move into some of those linguistic features part of speech tagging pos tagging. That's identifying nouns, verbs, adjectives.

Speaker 2

07:39

Exactly, categorizing words by their grammatical role in the sentence.

Speaker 1

07:43

And how does space you figure that out? Is it just a dictionary look up?

Speaker 2

07:46

Oh no, it's much smarter than that. It looks at the word in context. The surrounding words heavily influence the tag. It uses sequential statistical models trained on large amounts of texts.

Speaker 1

07:56

So the same word could get different tags.

Speaker 2

07:58

Absolutely. Think of the word book, I read a book noun versus I want to book a flight verb. The context tells the tagger which role it's playing.

Speaker 1

08:08

And why is this useful beyond just grammar?

Speaker 2

08:11

Well, it's really important for understanding meaning, especially for word sense disambiguation, figuring out which meaning of a word is intended.

Speaker 1

08:18

Can you give an example?

Speaker 2

08:19

Sure, take the word beat. It can mean many things, But if the pos tagger confidently tags it as an adjective adj, as in I'm totally beat, you know, it almost certainly means exhausted. Ah.

Speaker 1

08:31

I see. The tag helps narrow down the meaning.

Speaker 2

08:33

Precisely, even if the verb or noun tags might still be ambiguous. Beat the drum versus follow the beat. The adjective tag is often quite specific. It adds a layer of understanding, even if lamonization kind of flattens out things like verb tense.

Speaker 1

08:47

Okay, that makes sense. Next up, dependency parsing. This sounds a bit more complex. Mapping sentence relationships.

Speaker 2

08:53

It is complex but incredibly powerful. Dependency parsing represents the grammatical structure of a sentence not just as a flat sequence, but as a tree of relationships. It shows how words depend on each.

Speaker 1

09:04

Other head and dependent exactly each.

Speaker 2

09:06

Word except usually the main verb. The root has a head word it modifies or relates to, and a specific dependency label describes that relationship, like N subject phenomenal subject, or dubject for direct object. Why go to all this trouble, Well, what's fascinating here is that sentences aren't just sequences of tokens. They have this deep, inherent structure, and understanding that structure is absolutely crucial for many real world NLP tasks, like

09:32

what think about chatbots or a machine translation? You need to know who did what to whom. Consider I forwarded you the email versus you forwarded me the email.

Speaker 1

09:42

Same words, totally different meaning exactly.

Speaker 2

09:44

Dependency parsing helps the system figure out that I is the subject the one doing the forwarding in the first sentence, and you as the subject in the second. It disambiguates the roles based on the grammatical structure unsubject, DUBJIOJ relationships. Without that, I understand user intent would be much much.

Speaker 1

10:02

Harder, right, That makes the importance clear. Okay, what about named entity recognition any R spotting real world objects?

Speaker 2

10:09

Yep. A named entity is basically anything that can be referred to with a proper name or a quantity. So people's names, company names, locations, dates, monetary values, percentages.

Speaker 1

10:21

The categories seem pretty standard person or or GPE geopolitical entity.

Speaker 2

10:27

Those are common ones, yes, but the specific set of entity types is actually quite flexible and often depends on the data of the model was trained on or the specific task you have in mind. How so, Well, if you're analyzing financial news, entities like money and percentage might be way more important and frequent than say, work of art. The model needs to be tailored or chosen based on the domain.

Speaker 1

10:50

And how good as any are these days.

Speaker 2

10:51

It's gotten incredibly good. The state of the art methods often use those transformer architectures we mentioned earlier. They're very effective at understanding context to identify entities accurately.

Speaker 1

11:01

Okay, And sometimes the default tokenization or entity spans might not be quite right. Can you fix them?

Speaker 2

11:08

Yes? Absolutely. Spacey provides a really neat mechanism called doc dot retokenize it lets you merge multiple tokens into one, or split a single token into several.

Speaker 1

11:16

Why would you need to do that?

Speaker 2

11:17

Well, maybe an entity like New York City got split into three tokens, but you want to treat it as a single unit for analysis, you can merge them. Or maybe a typo resulted in San Francisco being one token and you want to split it.

Speaker 1

11:29

Ah okay, So for cleanup and normalization.

Speaker 2

11:33

Exactly, merging is usually simpler. Splitting can be a bit more involved because Spacey then needs to figure out the linguistic features and dependencies for the new tokens you've created. But it's a very powerful tool for practical adjustments.

Speaker 1

11:46

Let's shift gear slightly to rule based matching. You mentioned regular expressions can be tricky. What Spacey's alternative.

Speaker 2

11:54

Spacey offers the matriclass, and it's designed to be a well, a much cleaner, more readable, and definitely more maintainable alternative for finding patterns and text compared to rejects.

Speaker 1

12:05

Why is rejects problematic?

Speaker 2

12:06

Regular expressions can just become incredibly dense and hard to read, especially for complex patterns. They're also easy to get subtly wrong, which can lead to bugs that are hard to track down, and they operate purely on.

Speaker 1

12:17

Strings, and the match is different how.

Speaker 2

12:19

The matcher works with token objects and their attributes. You define patterns not as strings, but as lists of dictionaries, where each dictionary specifies the attributes.

Speaker 1

12:29

A token must have like low to match the word hello regardless.

Speaker 2

12:32

Of case precisely, or is punched true to match any punctuation mark or liken them true for number. Like tokens, you're matching based on linguistic features, not just character sequences.

Speaker 1

12:44

That sounds much more robust.

Speaker 2

12:45

It is, and you can use extended syntax too. You can match based on token length length check off a token is in a list I note or use boolean flags like east digit I, sulfa I supper great for finding, say, emphasized words in all cans.

Speaker 1

13:00

Does it have rejects like operators like optional parts.

Speaker 2

13:03

Yes, you can use operators like bunds to make a token pattern optional. Think about matching names like Barack Obama but also Barack Hussein Obama. The middle name token can be marked as optional, and you have operators like plus one or more and zero or more for specifying occurrences, similar to rejects. There's even a really useful online demo on the Spacey website where you can build and test matcher patterns interactively.

Speaker 1

13:29

Okay, that covers matching specific patterns. What if you have like a huge list of things to find, say thousands of product names, right.

Speaker 2

13:37

Creating individual matcher patterns for thousands of specific phrases would be well, not very efficient or practical.

Speaker 1

13:44

So what's the solution for that?

Speaker 2

13:46

Spacey provides the phrase matcher. It's optimized specifically for efficiently scanning text against large lists of multi word phrases or dictionaries.

Speaker 1

13:53

How does that work?

Speaker 2

13:54

You give it a list of doc objects representing the phrases you want to find, like Angela Merkele, Donald Trump, Alexis ceparus. It then uses a really efficient algorithm to find all occurrences of those exact phrases in your target text, much faster than running thousands of individual rules.

Speaker 1

14:10

Very useful for terminology lists or gazetteers exactly.

Speaker 2

14:14

And it can even match based on token attributes, not just the exact words. For instance, you could match based on the shape attribute, which is handy for finding structured data like IP addresses or specific code patterns and log files. Even if the exact digits change.

Speaker 3

14:28

So you have the matcher for flexible patterns and phrase matcher for large lists. How do you integrate these findings back into the main spacey doc. That's where the span ruler comes in. It's a pipeline component that lets you use rules to find very similarly to matcher patterns, to directly add span objects to your doc add themwhare to doc dot sense. You can configure it to add them to doc dot en, so effectively adding rule based named entities.

14:53

Or you can have it add them to a custom span group like doc dot spans my custom patterns, so.

Speaker 1

14:57

You added to the pipeline like other components.

Speaker 2

14:58

YEP, NLP, dot X a pipe span ruler. Then you provide it with your patterns. For example, you could define a pattern to find every instance of the word chime and label it as an OARG entity.

Speaker 1

15:09

What if the regular ner model also finds entities? Do they clash?

Speaker 2

15:14

Good question. You can configure the span ruler. You can tell it whether your rule based entities should overwrite entities found by the statistical ner model, overrit true or not overwrite falls. You can also set it up so that statistical entities don't overwrite your rule based ones gives you control over which source of entities takes precedence.

Speaker 1

15:37

Okay, this rule based stuff seems really practical. Can we talk about some specific recipes like real world extraction examples?

Speaker 2

15:45

Absolutely, here's where it gets really interesting, showing Spacey's power. So you can easily build patterns to extract things like ibands, international bank account numbers, or phone numbers, these highly structured numeric things.

Speaker 1

15:57

Okay, what else?

Speaker 2

15:58

Think about? Social media? Could create patterns to find mentions expressing opinions, like matching the sequence business name plus iswaz bay plus Maybe an adverb plus an adjective.

Speaker 1

16:08

Like finding cafe X was really great.

Speaker 2

16:10

Exactly that pattern structure cafex was a adverb adjective. Could pick up cafe x is good, Cafe y was very slow, restaurant z will be amazing. Helps you gauge sentiment clever.

Speaker 1

16:20

Other examples.

Speaker 2

16:21

Hashtags are easy. You can match the hashtag symbol followed by tokens that meet certain criteria like IC or ICEULFA to reliably pull out things like hashtag deep learning or hashtag weekend fun.

Speaker 1

16:33

And what about slightly more complex entities?

Speaker 2

16:36

You can even use patterns to refine entities. For example, maybe the ner just picks up Smith as a person. You could use a match or pattern to look for a preceding title like mister AM's doctor nump, and then retokenize to merge the title and the name into a single, more complete entity span Miss Smith.

Speaker 1

16:54

Wow. Okay, that's quite granular.

Speaker 2

16:56

Control, it really is. These rule based tools, combined with the linguistic features, give you a lot of power for precise information extraction.

Speaker 1

17:04

Let's push deeper now into understanding meaning and intent. How does spacey help with semantic parsing figuring out what a user actually wants a great.

Speaker 2

17:11

Way to explore this is with data sets like eighty zis the airline travel information system. It contains thousands of real user requests about.

Speaker 1

17:19

Flights like show me flights from Boston to Denver exactly?

Speaker 2

17:22

Or what's the cheapest flight? What meals are served on flight x? Analyzing these requires understanding not just the words, but the underlying goal.

Speaker 1

17:33

Where do you even start with something like that?

Speaker 2

17:35

Well, a really crucial first step, honestly, is just looking at the data yourself. Read through a sample of the utterances, get a feel for the common patterns. The types of entities involved the grammar people use.

Speaker 1

17:47

What kind of things would you look for in the eightiest data.

Speaker 2

17:51

You'd quickly notice people specifying origins and destinations. But it's not enough just to spot Boston and Denver. You need to capture the relationship from Boston to Denver. You'd see the importance of prepositions like from to in Those little words carry a lot of semantic.

Speaker 1

18:06

Weight, So you need more than just finding keywords.

Speaker 2

18:09

Definitely, you need to understand the relationships between the words. And that's where Spacey's dependency matter.

Speaker 1

18:15

Comes in another matcher. How's this one different?

Speaker 2

18:17

Well, the matcher looks for sequences of tokens based on their attributes. The dependency match looks for patterns based on the syntactic dependency relationships between tokens.

Speaker 1

18:26

Ah. Using that dependency parstry we talked about earlier.

Speaker 2

18:29

Precisely, it lets you find patterns like a verb connected to a noun with a direct object relationship dub J. This is key for identifying intent.

Speaker 1

18:40

Can you give a quick linguistic primer on that objects?

Speaker 2

18:43

Sure? So? Very Basically, you have transitive verbs which need an object to act upon, like I bought flowers flowers is a direct object, and in transitive verbs which don't like I slept okay. And sometimes there's an indirect object too, like I gave him the book book is direct him as direct. The dependency matcher lets you specify these relationships in your patterns.

Speaker 1

19:04

How does that help find intent in the flight examples.

Speaker 2

19:08

Well, you could define a pattern looking for a verb like show or find that has a direct object TOBJ like flights. That pattern defined using dependency relations would match show me flights, find flights, I need you to show flights, etc. Capturing the core intent regardless of the exact phrasing.

Speaker 1

19:25

That seems much more robust than just keyword spotting.

Speaker 2

19:28

It is, and you can build more complex patterns. What if someone says, show all flights and fares. The dependency matcher can use the conjunct dependency link between flights and fares to recognize that the user has two related intents connected by and.

Speaker 1

19:44

Okay, that's powerful, But this raises a question. Once you've used these matchers to figure out the intent, say book flight, how do you store that information with the doc?

Speaker 2

19:54

Great question. You don't want that information just floating around Spacey has a mechanism for this extension attributes exten attributes. Yeah, you can define your own custom attributes on doc token or span objects. So you could create an attribute called say doc dot intent. The underscore indicates it's a custom extension and how.

Speaker 1

20:11

Do you set that attribute?

Speaker 2

20:13

You typically create a custom spacey pipeline component use a special decorator at language dot factory to define it. Inside this component's call method, which processes the doc. You'd run your matcher or dependency matcher, figure out the intent and then set doc dot intent.

Speaker 1

20:29

Book flight so you can tailor the pipeline to extract and store exactly what you need exactly.

Speaker 2

20:34

It makes spacing incredibly flexible and extensible for specific tasks.

Speaker 1

20:38

Now we touched on performance earlier. What about processing large data sets like the full eight is corpus with thousands of utterances. Doing them one by one sounds slow.

Speaker 2

20:49

It would be processing doc NLP text for each of the four nine and seventy eight utterances individually would take quite a while.

Speaker 1

20:56

So what's the efficient way?

Speaker 2

20:57

The key is the NLP dot pipe method or language dot pipe. If you're using the base class.

Speaker 1

21:03

How does pipe help?

Speaker 2

21:04

It processes the text as a stream and crucially it buffers them internally and processes them in batches. This allows Spacey to leverage optimizations and parallel processing much more effectively.

Speaker 1

21:15

And the speed difference is noticeable.

Speaker 2

21:17

Oh, absolutely dramatic. The sources mentioned going from something like twenty seven seconds for processing the eighties data set one by one down to under six seconds using NLP dot pipe. It's the standard way to process large volumes of text efficiently.

Speaker 1

21:29

Okay, essential for any real world application. Let's pivot now to the really cutting edge stuff, transformers and large language models LMS. The transformer architecture kind of kick things off right. The attention is all you need paper.

Speaker 2

21:43

Yes, that twenty seventeen paper was a landmark. Transformers really revolutionized NLP.

Speaker 1

21:49

What problem were they trying to solve?

Speaker 2

21:50

Well, Previous models like LSTMs process texts sequentially. This meant they could struggle with long range dependencies for getting information from the beginning of a law text, and they weren't easily parallelizable, which limited training speed.

Speaker 1

22:04

And transformers fix this how with attention exactly.

Speaker 2

22:09

The core innovation is the self attention mechanism, often implemented in a multi head attention block. Instead of just looking at the immediately preceding words. Self attention allows the model to weigh the importance of all words and the input sequence when calculating the representation for a single word.

Speaker 1

22:24

So it looks at the whole context at once.

Speaker 2

22:27

Sort of yeah. It calculates a words embedding its representation by taking a weighted average of the embeddings of all other words in the sequence, where the weights the attention scores indicate relevance. This lets it understand language much more deeply in context.

Speaker 1

22:44

What was the big aha moment with this?

Speaker 2

22:47

A major one was that transformers could generate dynamic word vectors. Older methods like word two VAC gave the same vector for bank every time, but a transformer can understand a context and give a different vector for bank and riverbank versus bank in investment bank.

Speaker 1

23:02

That's a huge leap in understanding nuance.

Speaker 2

23:04

It really was. And libraries like Hugging Faces Transformers Library now provide access to literally thousands of these pre trained transformer models.

Speaker 1

23:12

How does Spacey integrate with these? Can you use transformers within a Spacey pipeline?

Speaker 2

23:16

Yes? Absolutely. A great example is text classification. Let's say you want to classify Amazon product reviews as positive or negative. You can use Spacey's text categorizer component.

Speaker 1

23:26

Which is trainable.

Speaker 2

23:27

Right, it's a trainable pipeline component. You'd prepare your training data the reviews labeled as positive or negative using Spacey's example object, and then serialize it efficiently using doc ben.

Speaker 1

23:38

How do you manage the training process itself?

Speaker 2

23:42

Spacey has a really nice configuration system. Instead of hard coding parameters, you define everything the pipeline components, model settings, hyper parameters, data paths in a single configuration file configured on CFG.

Speaker 1

23:56

Why is that better?

Speaker 2

23:57

It makes your experiment incredibly reproducible. Yeah, there are no hidden de faults. Everything is explicit in the config file. You can then train your pipeline directly from the command line using spacey train.

Speaker 1

24:07

And can you include a transformer in that pipeline for text classification?

Speaker 2

24:12

Yes, you can configure the pipeline to include a transformer component. This component generates those context to wear embeddings we talked about, which are then fit into the text categorizer. Often, adding a transformer significantly boosts the accuracy of the classifier because it has a richer understanding of the text's meaning and sentiment.

Speaker 1

24:30

Okay, so let's name some names. What about famous transformer models like Bert and Roberta? What makes them special?

Speaker 2

24:37

Right? Bert bi directional encoder representations from transformers was a huge step. Its key innovation was being bi directional during pre training.

Speaker 1

24:46

Meaning it looked forwards and backwards in the text simultaneously.

Speaker 2

24:49

Yeah. Previous models were often unidirectional or combined separate left to right and right to left models. Bert used a technique called masked language modeling predicting hidden workds to learn context from both directions at the same time. This gave it a deeper understanding.

Speaker 1

25:05

And it produced those dynamic word vectors.

Speaker 2

25:08

Yes. It also used some special tokens like cls at the beginning of sequences and stap to the separate sentences, and it used word piece tokenization, breaking words into common subword units like playing might become play and hashtag in. This helps it handle large vocabularies and even words it hasn't explicitly seen before.

Speaker 1

25:25

What about Roberta? How did that improve on? Burt?

Speaker 2

25:28

Roberta developed by Facebook AI basically took the Bert architecture and optimized the training procedure. They used things like dynamic masking, changing the masked words during training, trained on much more data for longer, and removed one of Burd's training objectives next sentence prediction finding. It didn't always help. These changes generally led to better performance on downstream tasks compared to the original Bert models.

Speaker 1

25:53

Okay, so transformers are powerful pretrain models. What about the even bigger ones, The large language models are llms? How do they fit in?

Speaker 2

26:01

Lms are essentially an evolution or maybe a scaling up of those pre trained language models like bird. We're talking models with vastly more parameters two three is one hundred and seventy five billion, for instance, trained on absolutely enormous amounts of text.

Speaker 1

26:14

And code, and they can do well almost anything text related.

Speaker 2

26:17

They're incredibly versatile. Yeah, translation, summarization, question answering, code generation, creative writing. They've shown promise in specialized field like medicine, law education too.

Speaker 1

26:27

But they're not perfect, right, What are the downsides?

Speaker 2

26:30

Definitely not perfect? There are key limitations. One is this year computational cost training and even running them requires massive resources. They can also be slower to generate responses compared to smaller models, and crucially, they have this tendency to hallucinate.

Speaker 1

26:46

Hallucinate meaning they mix stuff up.

Speaker 2

26:49

Essentially, Yes, they can generate responses that sound perfectly plausible and grammatically correct, but are factually incorrect or nonsensical. They don't inherently know things. They're predicting probable sequences.

Speaker 1

27:00

Of words, so you need to be careful how you use them.

Speaker 2

27:02

Very careful. A lot of work goes into prompt engineering, carefully crafting the input prompt to guide the LLM towards the desired accurate output.

Speaker 1

27:11

How does space help manage interactions with llms.

Speaker 2

27:14

There's a package called spacelm. It provides a structured way to integrate llms into spacey workflows. It treats interactions with an LM as defined tasks like summerization or entity.

Speaker 1

27:25

Extraction, and it uses prompts.

Speaker 2

27:26

Yes, it uses GINGA templates to define the prompts for these tasks. You can use built in tasks, or you can define your own custom tasks.

Speaker 1

27:32

Custom tasks like what.

Speaker 2

27:34

For example, the sources mentioned creating a custom task to extract specific quotes from a text and the surrounding contact sentences. You define the prompt template to ask the LLM for this specific output, and you also define how to parse the llm's potentially messy response back into a structured format that Spacey can use.

Speaker 1

27:54

So spacelm provides a bridge and some structure for using llms within a more controlled space environment.

Speaker 2

28:00

Exactly, it helps make using lllms more systematic and reproducible.

Speaker 1

28:04

Let's circle back to training your own models. We talked about NR. When would you actually need to train a custom ANYR model instead of using a pre trained one.

Speaker 2

28:12

That's a common question. The rule of thumb is, if a pre trained Spacey model like Encore welding performs reasonably well on your data, maybe gets say seventy five percent accuracy or higher on the entities you care about, you might not need full custom training.

Speaker 1

28:27

What would you do then?

Speaker 2

28:28

You could potentially find tune the existing model, or more often, you'd use other Spacey components like the matcher or span ruler. We discussed to add rules that catch the specific cases the pre train model misses or gets wrong, kind of like augmenting it.

Speaker 1

28:42

But when is custom training unavoidable?

Speaker 2

28:45

It's usually necessary when your domain has many important entity types that are just completely absent from the pre trained models. Think about highly specialized fields, specific financial instruments, unique biological gene names, custom product codes very niche legal terms. If the pre train model doesn't even know these categories exist, rules alone won't cut it. You need to teach a model from scratch or significantly fine tune one, and that.

Speaker 1

29:12

Involves getting data and labeling it exactly.

Speaker 2

29:15

Data collection is the first step. Then comes annotation, manually labeling examples of your text with the entities, parts of speech, dependencies, whatever your model needs to learn.

Speaker 1

29:24

Are there tools for that? Annotation sounds tedious?

Speaker 2

29:27

It can be, but there are great tools. Prodigy, also from the makers of Spacey, is a very modern annotation tool that often uses active learning to be more efficient. It suggests labels you can firm or correct. They are also open source options like Nertwig, which integrates with Jupiter Notebooks.

Speaker 1

29:42

Okay, so you annotate your data using one of these tools, then what then.

Speaker 2

29:45

You convert that annotated data into Space's efficient binary format doc ben. You typically split your data into training and evaluation sets. Then you use Space's training system Spacey Trained with a config file to train your custom and eer component, and finally use spacey Evaluate to see how well your trained model performs on the unseen valuation data.

Speaker 1

30:07

What's fascinating here is the possibility of combining models. Can you use your custom trained ner model alongside one of Spacey's pre trained ones.

Speaker 2

30:18

Yes, and that's often a very powerful approach. You get the best of both worlds.

Speaker 1

30:21

How does that work technically?

Speaker 2

30:22

First, you'd package your custom trained pipeline component, maybe the one that recognizes fashion brand entities, into an installable Python package using the spacey package command.

Speaker 1

30:31

Okay, so it's like distributing your own mini model exactly.

Speaker 2

30:35

Then use another command, spacey assemble with a special configuration file. This config file tells Spacey how to build a new pipeline by sourcing components from different places.

Speaker 1

30:44

So you could say, take my custom fashion brand component and also take the GPE location and money components from the standard encore webs as a model.

Speaker 2

30:51

Precisely, Spacey assemble pulls these components together into a single, unified a pipeline that can recognize entities from both your custom training and the general purpose pre trained model. It's a very neat way to create highly specialized, yet broadly capable NLP systems.

Speaker 1

31:10

Very cool. Let's touch on entity linking. That's about connecting mentions in text to actual entries in a knowledge base right disambiguating Washington.

Speaker 2

31:18

Exactly is Washington referring to George Washington, the person, Washington, DC, the city, or Washington state. Entity linking aims to resolve that ambiguity by linking the mention to a unique identifier, often in a knowledge base like Wikidata or a custom company database.

Speaker 1

31:33

How does space handle this?

Speaker 2

31:34

Spacey has an entity linker component. It's architecture basically involved three main parts. First, you need a knowledge base KB.

Speaker 1

31:40

What's in the KB?

Speaker 3

31:41

It stores information about the entities.

Speaker 2

31:43

You want to link to their unique IDs like Wikidata qids, names, descriptions, and aliases. Spacey provides tools to create this. For example, and in memory lookup KB, you'd add entries for say, Taylor Swift, the singer, Taylor Lautner, the act Taylor Fritz, the tennis player, each with your unique ID and maybe a short description.

Speaker 1

32:04

What else is needed besides the KB?

Speaker 2

32:07

Second, you need a way to generate candidate entities from the KB. For a given mention in the text. If the text says tailor, the system needs to know that Swift, Latner, and Fritz are all potential candidates. You also add aliases with prior probabilities. Maybe Taylor Swift the alias has a one hundred percent probability of linking to the singer's ID, while just Taylor has an equal chance for all three

32:28

initially and the third part. The third part is a machine learning model which is trained to look at the mention, its context in the sentence, and the information about the candidate entities from the KB and then predict the most likely correct link or predict nil if none of the candidates seem right.

Speaker 1

32:44

Does training this require special data?

Speaker 2

32:46

Yes. When you train the entity linker component, your training data needs to clearly specify which mentions should link to which kb IDs. You often need a custom Corpus reader to handle this specific data format during training.

Speaker 1

33:00

Okay, we've built all these amazing models and pipelines. How do we actually put them into the hands of users or other systems. Let's talk deployment. Building apps in APIs right, moving.

Speaker 2

33:10

From the lab to the real world. Two popular Python frameworks are great for this, with spacey for building interactive web applications quickly, especially for not a front end expert. Streamlet is fantastic.

Speaker 1

33:21

Streamlt How does it work?

Speaker 2

33:23

It lets you build web apps purely in Python. You can create widgets like textboxes that textavia and buttons very easily. There's even a specific package Spacey streamlet that provides ready made components to visualize Space's analysis like anyr results directly in your streamlet app, so you.

Speaker 1

33:40

Could build a quick demo tool for your Spacey pipeline exactly.

Speaker 2

33:43

And a key feature is streamlets caching at ffon cash or at s don cash data. This prevents your Spacey models from having to reload every single time a user interacts with the app, which makes it much faster and more responsive.

Speaker 1

33:56

What if you need something more robust, like a back end API that other services can call.

Speaker 2

34:01

Then fastpi is an excellent choice. It's a modern, high performance Python framework specifically designed for building APIs.

Speaker 1

34:09

What makes fast to PI good.

Speaker 2

34:10

It's known for its speed. It also leverages Python type hints heavily. You define the expected data types for your API inputs and outputs, which faster PI uses for automatic data validation, gadging errors early, and also for automatically generating interactive API documentation using Swagger UI.

Speaker 1

34:27

So it makes development faster and more reliable.

Speaker 2

34:29

Yes, significantly, you use pidanic models to define your data structures and faster PI handles the validation and serialization. You could easily build an API in point that takes some text, runs it through your spacey ner pipeline, and returns the extracted entities as structured Jason data. The autogenerated documentation makes it super easy for others or yourself to understand and test the API.

Speaker 1

34:53

Okay, building models, deploying apps, the whole process can get complex. How do you manage the entire end to end workflow, especially for reproducibility and collaboration.

Speaker 2

35:03

That's where workflow management tools come in. Spacey has a companion tool called Weasel. Weasel Yeah, Weasel helps you structure your entire NLP project. You define your workload steps like downloading data, preprocessing, training, evaluating, along with any data assets and custom commands, all within a configuration file project dot EML. It makes your project reproducible and easier for others to understand and run.

Speaker 1

35:25

What about managing the data itself and the models? They can get large and change often for that.

Speaker 2

35:31

Data Version controlled DVC is an increasingly popular tool, especially in the mL world. It works alongside geit. How does DVC help it tackles several common problems. First, sharing large data sets and models is hard with get alone, DVC lets you version your data end models, storing them in remote storage like s three year Google Cloud storage, while keeping small metaphiles in GIT. This makes collaboration much easier. What else, it helps make your data processing and model

35:58

training pipelines relyable and reproducible. You define the steps and their dependencies, and DVC can tract everything. It also crucially helps with tracking model metrics over time, so you can see how performance changes as you modify data or code. It really embraces giops principles for mlops, making your mL workflows version, automated and continuously reconciled.

Speaker 1

36:20

So it brings better engineering practices to data science exactly.

Speaker 2

36:23

It helps manage the whole life cycle, and tools like DVC studio even provide features like a model registry for managing and sharing your train models effectively across a team. Weasel and DVC together provide a really solid foundation for managing serious NLP projects.

Speaker 1

36:36

Wow from the core concepts and pipeline through advanced analysis like dependency parsing and NR rule based matching, transformers LMS, training custom models, and finally, deployment and workflow management. That was an incredibly thorough deep dive into the Spacey ecosystem.

Speaker 2

36:55

It really covers a lot of ground, doesn't it.

Speaker 1

36:56

It absolutely does. It really reinforces that idea of space see being that practical, well optimized kitchen knife, precise, powerful and adaptable for so many different NLP tasks.

Speaker 2

37:08

Yeah. And you know, if we connect this to the bigger picture, really understanding tools like Spacey empowers you not just to build things, but also to critically evaluate how these language AI systems work, how they understand or misunderstand the world.

Speaker 1

37:20

That's a great point.

Speaker 2

37:21

It kind of raises an important question for anyone listening. I think, now that you have this deeper insight, how will you use this knowledge, maybe to refine your own analysis, or perhaps even to build something new and impactful.

Speaker 1

37:31

A fantastic thought to end on. We definitely encourage you, our listeners, to keep exploring these fascinating topics and think about how you might apply this knowledge, whether it's in your work, your studies, or just your own curiosity about language and AI.

Transcript source: Provided by creator in RSS feed: download file

Mastering spaCy: Build structured NLP solutions with custom components and models powered by spacy-llm

Episode description

Transcript