Okay, let's unpack this. You ever feel like you're just drowning.
In numbers, Definitely, spreadsheets everywhere, complex reports.
Exactly, grabs that maybe look cool but don't actually tell you much. It's so easy to get lost in all that data, right, But imagine if you could just see through it all, if the numbers could instantly like paint a clear picture.
That's the idea, isn't it. That's the real power of data visualization, turning that overwhelming information into something you can grasp, you know, quickly and really thoroughly.
Yeah, moving beyond just tables and stats exactly.
We're so used to looking at tables, maybe hearing about models, and those have their place for sure, but a good visualization it can give you that immediate kind of gut level understanding, seeing the patterns, the relationships hiding in there.
It's like reading a recipe versus actually seeing the finished dish. Like you said, yeah, precisely, And that brings us nicely to our source for this deep dive. The book Data Visualization A practical.
Introduction, Ah, yes, good one.
And this isn't just you know, a gallery of pretty charts. It's a really practical, hands on guide mm hmm.
It walks you through using R the programming language and this really flexible tool called gig plot two right.
And what I found really insightful is how it focuses not just on the aesthetics like does it look.
Nice, although that matters.
True, but more on how our brains actually process visual information and designing charts that work with that process.
That connection is key, isn't it, between how we see and what we understand. The book really emphasizes that the best visualizations are the ones that kind of tap into how we intuitively interpret things like size, color, position, makes sense.
Like making the data speak directly through what we see exactly. So our mission today really is to pull out the key insights from this book to help you listening become more data savvy.
Yeah, give you the tools to not just make your own effective charts, but also to look at any graph you see and really understand what it's telling you or maybe what it isn't telling you.
Good point. We want to help you avoid those common pitfalls and just feel more confident navigating all this data.
Okay, so where should we start? Maybe the big question why even bother visualizing data? Why not just stick with the tables.
Right. The book makes a really strong case for moving beyond just the numbers.
It does. There's a great example from Jackman back in nineteen eighty.
Oh yeah, the voter turnout one.
That's the one. He was looking at voter turnout and income inequality across different countries. Okay, and the initial analysis, just crunching the numbers for eighteen countries suggested a pretty strong link.
Seems straightforward enough.
But then he just plotted the data a simple scatterplot and bam, it was instantly clear that whole relationship is basically being driven by one single data.
Point South Africa.
Wow, so one outlier was creating the entire trend pretty much.
Now, you could find that eventually with more stats, sensitivity analysis, stuff like that.
Sure, big, deep enough, but the visual.
It made it obvious immediately.
That's a powerful demonstration right there. It really is.
Yeah.
And there's another great illustration of the book, inspired by Enscom's.
Quartet ah the classic.
Yeah. So another researcher, Van Hove, created sixteen different data sets and here's the kicker. Huh, every single one had the exact same statistical correlation between x and y r equals point.
Six Okay, point six seems like a decent positive relationship.
Right, that's what the number tells you. But then you visualize them. You plot those sixteen data sets, and they look at different, totally different. Some look like a nice cloud of points like you'd expect. Others have a crazy outlier point of the line. Some are clearly curved, some are just like two separate groups of dots.
So the single number point six completely hid all that variation completely.
The core insight there is just critical. Yeah, always always look at a scatterplot of your correlations. Don't just trust the number.
It really Ammer's home that point, doesn't it. A single statistic can mask wildly different realities in the data. You just have to see the shape, spot the weird.
Stuff, understand the distribution.
Yeah, but it's also important, as the book notes, not to just blindly trust the visual either.
Right. Absolutely, visualizations have their own what the book calls rhetorical plausibility. They suggest things, They.
Frame the data in a certain way exactly.
Just because it's in a chart doesn't make it the absolute truth. We still need to think critically.
Okay, so we're sold on why visualization is powerful, But what makes one good versus bad Right.
The book kind of breaks down the problems into three buckets. Okay, there's issues of just plain bad taste aesthetics. Then there are substantive problems like how the data itself is shown, and finally perceptual problems how our brains interpret the visual.
Let's start with bad taste. What falls into that category?
This is where aesthetics really come in. Things that make a graph distracting, cluttered, hard to read, inconsistent design.
Choice, What is too much going on?
Yeah? Exactly. The book uses an example figure one point four of what it calls chart junk, classic term chart junk.
Love it. What's an example?
Oh, think like bars that are hard to distinguish, labels repeated everywhere pointlessly, Maybe those fake three D effects that add nothing, oh the worst? Or drop shadows, yes, drop shadows, pointless textures. It's just visual clutter getting in the way of the actual data.
So the idea is keep it clean pretty much.
Less is often more. Every little line, every color should be there for a reason, helping the data speak. If it's not adding understanding, maybe it shouldn't be there.
It reminds me of Edward Tuff's work. The book mentions him, right.
Oh, yeah, Tuft's foundational his concept of the data to ink ratio.
Right, maximize the ink that shows data, minimize the rest.
Exactly, get rid of the chart junk. He also talked about graphical excellence, showing interesting data clearly, efficiently, telling the truth about it, getting the most ideas across with the least visual noise.
Makes sense, simplify, remove extra gridlines, pointless colors.
Usually yes, But the book throws in a really interesting curveball here. Some research Bateman Borkin, they found that sometimes those more visually embellished graphs, almost like many infographics, Yeah, they can actually be more memorable than the super simple, clean ones.
Really, that's counterintuitive, more memorable even if they're harder to read initially.
It seems so. Yeah, people might recall something visually unique or novel more easily later on.
Huh. So there's a bit of a trade off maybe between immediate clarity and long term.
Recall potentially, But the key is memorable doesn't automatically mean easy to interpret.
Accurately, right, which brings us to that third category of problems, perceptual issues exactly.
This is where it gets really fascinating. Even a clean, well designed graph can unintentionally mislead people just because of how our brains work. How so well? The book shows an example with stacked bar charts, trying to compare the size of the say, middle segment, across several different bars. It's surprisingly difficult for our eyes.
Yeah, I can picture that your baseline keeps changing, right.
And there's another example with lines that look like they're converging getting closer just because of the aspect ratio the shape of the plot, even if the underlying data shows they're staying parallel.
Wow, So good taste isn't enough. You really need to understand perception.
You absolutely do. And our perception is a uniform right, Like how we see color. Our ability to distinguish shades changes across the spectrum.
And it depends on lightness too, doesn't it.
Yeah, chroma depends on luminance. It gets complex. That's why the book really pushes for using perceptually uniform color palettes.
Perceptually uniform, Okay, what does that mean?
Exactly? Imagine a color ramp where each step up represents an equal increase in the data value. A perceptually uniform palette makes those steps look equally spaced in color intensity.
Ah, so a non uniform one might make some small data changes look huge visually or vice versa.
Exactly, it avoids accidentally emphasizing or de emphasizing parts of the data just because of quirks in the color scale.
Okay, so the book talks about different types of these palets.
Yeah, three main ones. First, sequential scales, think lo to high data like income or maybe temperature if it's all positive.
Makes sense, like light blue to dark blue.
Right. Then you have diverging scales. These are for data with a meaningful midpoint like zero temperature changes, maybe deviations from an average.
Like that blue to red scale example figure one point. Then you see that's a.
Classic zero or the midpoint is usually a neutral color like white or light gray, and the extremes diverge to two different hues.
Okay.
And third type qualitative talents. These are for categorical data where there's no inherent order. Think countries, talks of products, political parties.
So the goal there is just distinct colors distinct.
But also ideally with similar visual weight, so one category doesn't just pop out unintentionally. The bottom palette that same figure one point one end scene is a good example.
It's really about making sure the visual differences match the data differences accurately precisely.
Using the wrong palate can really mess with interpretation.
The book also mentions complexity overload trying to map too many things at once.
Yeah, like using size and shape and color and position all in one go. Unless the data has a really really clear structure, it just becomes noise. Figure one point nineteen shows that, Well, hard to track everything.
Too much happening. And what about gestalt rules?
Ah, yeah, that's about how our brains naturally look for patterns. We group things, we connect things. We see shapes even.
If they aren't really there sometimes like seeing faces in clouds, kind of like that.
Yeah. Figure one point one each shows seemingly random dots, but you can't help trying to see clusters or lines. This is powerful if you use it right in visualization design, but it can also trick people into seeing patterns that are just random chance.
So understanding perception is crucial, which leads to how we actually encode data visually. The book talks about Cleveland and McGill's research foundational stuff.
Figure one point two to three summarizes it. They basically figured out what visual tasks were best at?
Perceptual Okay, what's at the top. What are we best at?
Judging position along a common scale? Think comparing bar heights in a standard bar chart. We're really accurate.
At That makes sense, they all start from zero, right.
Then comes position on a lined but separate scales. Still pretty good. Then judging links like line segments, but only if they share a common baseline.
Hmm. Okay, and what are we worse at?
Our accuracy drops off for judging links without a common baseline. Then things like angles, which is why pie charts can be problematic for comparison in area and volume and color saturation or hue are further down the list.
So this hierarchy should guide our choices. If you want people to compare values accurately, use.
Position along the commons. Bar charts are often great for that. If you're showing trends, maybe line charts work well for judging slope or angle, though even that's not top tier.
It really highlights why choosing the right chart type matters so much for effective communication. It's about how easily the viewer can decode the information exactly, and.
The book also stresses it's not just which channel you choose like color or position, but how you implement.
It, like using a good sequential palette for ordered data or distinct hues for categories.
Precisely the details of the implementation matter hugely.
Okay, this is great theory, but the book is also very practical. Right it dives into using R and gg plot two.
It does it shift skiers into how you actually make these visualizations using code.
Now, programming can sound a bit scary. The book suggests starting with something called R mark down. Why is that helpful?
Armarkdown is fantastic for reproducibility unless you combine your code, your notes, and your output the plots, the tables all in one document.
So you can see exactly how you got a result exactly.
You write in plaintext embedchun of our code. When you process the document, the code runs and the results get inserted right there. It's great for keeping track, sharing work, and avoiding that how do I make this chart again?
Problem? That sounds incredibly useful and R itself.
R is a super powerful language widely used in statistics and data science, and gg plot two is this amazing package within R for visualization.
Built on the grammar of graphics what's that about.
Think of it like a system for building graphs piece by piece. You start with your data, then you define esthetic mappings linking data variables to visual properties like exposition, we position, color size.
Okay, mapping data to visuals.
Then you choose gms to the geometric objects like points, lines, bars that actually represent the data, and you layer these things together.
So it's a structured way to think about building any kind of.
Plot exactly, very flexible, very powerful once you grasp the core ideas developed by Leland Wilkinson implemented in gig plot two by Hadley Wickham.
And the book mentions the ecology of assistance better.
Now Yeah, basically meaning there's just so much help available online now, websites like stack overflow, our communities, tutorials, blogs, it's much easier to get started in financewers. When you get stuck then it used to.
Be that's encouraging. So to get started, the book says, install the tidy verse right.
The tidy Verse is a collection of our packages including deep plot two, deeplier for a data manipulation, and others, all designed to work together really well. You install it in our studio usually with just installed out packages.
Tidy verse, and the book suggests typing out the examples.
Yeah, it's good advice. Actually typing the code helps it sync in much better than just copy pasting.
Good tip and reassuringly. Gplot's defaults are pretty.
Good generally, Yes, the default settings for colors, themes, et cetera are thoughtfully chosen. You can often get a decent looking informative plot without much tweaking, which is great for beginners.
Okay, let's get into those core jiggy plot concepts. First, ascetic mappings using ease. Break that down again.
Right. Ease is where you tell gd plot which variables in your data control which visual property. So as x gdt per cap y equals life x, color equals the x axis, life x controls the I axis, and the continent column controls the color.
Crucially, you're not saying which color, just what controls the color exactly.
Gb plot handles assigning the actual colors, positions, et cetera based on the data values.
Okay, then GMS GMS.
Are the visual markers. Geompoint makes a scatterplot, gmline draws lines, GMAM makes bar shirts. Gmsmooth adds a smooth trend line you add into your plot with a plus sign.
So ggplot sets up the canvas and mappings. Then you add plus gom point or plus gmbi.
You got you build plots layer by layer.
And the importance of tidy data. Ah.
Yes, tidy data is a way of structuring your data set that ggplot and the tidy verse really prefer. Basically, each variable gets its own column, each observation gets its.
Own row like a long format, not wide exactly.
It might seem like a small detail, but organizing your data this way makes working with gdplot much much smoother and more intuitive.
Got it? And this idea of inheritance of mappings.
That just means if you define mappings in the main gd plot call like g plot data gapminder asex c GDP per cap y life x, any gms you add later like plus gom point or plus gm smooth will automatically use those x and y mappings.
Unless you override them specifically in the GM right.
You can give a GM its own a's mapping if needed, but inherence saves a lot of typing for common mappings.
Okay, let's run through some practical plot examples from the book Basic Scatterplot Life expectancy versus GDP per capita using the gapminder data YEP.
That would be gg plot data gapminder mapping es x GDP per cap y lifex that sets it up plus g ome point boom scatterplot simple enough. Add a smoother Just add plus GM smooth. On the next line. Gg plot adds a default trend line, usually with a confidence band around it.
Nice. Now that GDP data is probably skewed right of lower values a few very high ones.
Usually is makes the scatterplot bunch up on one side.
So transforming the scale like a log scale for the X axis good idea.
Yes, add another layer plus scale x log ten that transforms the x axis to a base ten log scale, spreading the data out much better.
Visually okay, and making it look more professional. Titles axis labels use.
The labs function add plus lab title my plot title x GDP per capita why life expectancy.
Simple and what if you want to format the axis labels like showing dollars on the X axis.
That's where the scales package comes in Handy You modified the scale function maybe like plus scale x log ten labels at cool scales dot dollar gives you a nice dollar formatting Cool.
Now, what about mapping categories like coloring the points by continent.
You add color continent inside the a's function, so as xx GDP per cap y life x color.
Continent and ggplot handles the rest YEP.
It assigns a color to each continent and automatically adds a legend explaining the colors. If you also have GM smooth, you'll likely get a separate smooth line for each continent in its corresponding color.
Okay, this brings up that crucial difference mapping versus setting making all points purple.
For instance, right, if you put color purple inside a's fod G plot treats purple as a data value, it gives all points the same default color and makes a useless legend entry.
For purple because you mapped it to data.
Exactly, If you just want to set all points b purple, you put color purple outside a's inside the GM function itself like GM point color purple.
No mapping, just setting a fixed visual property, no legend needed.
Precisely, huge difference. Common point of confusion makes sense.
Then there's faceting, splitting the plot into panels.
Yes, super useful. Face wrap lets you split by one categorical variable, arranging panels and a grid face. A grid lets you split by two variables, creating a two D grid of plots like.
That age versus children example, faceted by sex and.
Race exactly to compare relationships across different groups really effectively.
What about visualizing just one continuous variable?
Histograms EM histogram you map your variable to x like a x use area. It bins the data and shows counts as bars. You might need to adjust the binwith argument to get a good view. Or density plots GEM density similar idea, but gives you a smooth curve estimated distribution often a nice alternative or complement to histograms.
And briefly, graph tables.
GM table allows embedding a small table right onto the plot. Can be handy for showing summary stats alongside the visual.
Okay, crucial step saving your masterpiece? How do we save plots?
Easiest way is DG save After you display your plot to type gg stave myplot dot pdf or gg save myplot dot PNG. It saves the last plot by default.
PDF versus PNG you mentioned, vector versus raster.
Yeah. Vector formats like PDF or SVG are usually best. They store the plot as lines and shapes, so you can resize them infinitely without getting blurry. Good for publications.
Raster formats like PNG or JPEG are pixel based, so they can get blocky if you enlarge them too much.
Right, use vector when you can, especially for line art like most plots, and.
The here package for filepaths.
Highly recommend it. It helps make your filepaths relative to your project root directory, so your code doesn't break if you move the project folder or share it. Much more robust dot here, dot here output my plot dot pdf kind of thing, okay. Chapter five delves into refining plots. Key takeaways main things.
Every aesthetic mapping has a scale scale dot function you can adjust like tics. Labels, sales often have guides legends you can tweak.
With guides and cosmetic changes.
Often done in the GM function itself or using the big theme function for overall plot appearance, background, gridlines, fonts, legend position.
Right like theme, legend dot position, a bottle and labs for all labels YEP labs it's your friend for access titles, legend titles, plot titles, subtitles, captions keep things clearly labeled.
Moving on, Chapter six touches on visualizing models, not just raw data.
Yeah, visualization is huge for understanding modern results too. The Broom package is amazing here. How so it tidies up MESSI model output. Tidy gives you coefficients, p values, et cetera, and a nice table augment adds predictions and residuals back to your original data. Glance gives model summary stats.
So you can then plot those Tidy results exactly.
You can plot coefficients using packages like cofplot. You can generate predictions from your model using predict, then plot those predictions, maybe with confidence intervals using gn Ribbon mixed models much less abstract.
Chapter seven maps another visual form, but with unique issues.
Definitely, maps are tricky. Data is often tied to geographic units of varying sizes and populations, the modifiable aerial unit problem.
Like how election maps can look different depending on whether you color states or counties or scale by population.
Precisely, the book shows that twenty sixteen election maps illustrating this. It also stresses looking at baseline variables like population density when interpreting coreplith maps maps colored by value.
Right, because a large, sparsely populated area colored dark red might represent fewer actual votes than a small dense area colored lighter.
Red exactly, gotta be careful with interpretation.
Chapter eight gets back to color importance of colorblind friendliness crucial.
About eight percent of man have some form of color vision deficiency. Use palettes designed to be distinguishable for everyone. Packages like Dichromat and color Blinder help with this in.
R good point and using manual colors when they have meaning like political parties, can be useful.
But the book cautions against stereotypes. Always test and themes. Themes.
Themes are great for changing the whole look and feel quickly. UG themes has presets like them Economist or the Musejay, or you can customize endlessly with the base theme function.
Chapter eight also warns about some plot types dual y axes, yet generally avoid dual y ax as if possible, very easy to mislead by manipulating the axis ranges independently. The book suggests alternatives like indexing data to a common start point or plotting the difference.
And the perennial favorite target.
Pie Charts ha yes generally poor for comparisons, especially if they are many slices or the values are close. Board charts are almost always better for showing parts of a whole or comparing amounts.
Okay, Finally, the appendix emphasizes reproducibility and workflow super important.
Use R marked out, keep your data tidy, write functions for repetitive tasks, know how to find help using i'm and package vignettes in R. It makes your analysis more reliable, understandable, and repeatable.
So wrapping up data visualization is incredibly powerful for insight.
But it requires thought, understanding, perception, design principles, honest representation, and.
Tools like R and gg plot two as the book shows, give you the practical means to do it well.
Absolutely hopefully you the listener feel a bit more ready to dive in and visualize your own data.
Yeah, those aha moments when you see a pattern are really rewarding.
Definitely, and don't be intimidated by the learning curve. The resources and community support are really strong.
Now, So here's a thought to leave you with. Think about the last confusing chart you saw, How could some of these ideas, clarity, perception, honesty have made it better?
And maybe more importantly, what stories are hiding in your data waiting to be visualized.
We really encourage checking out the book Data visualization, a practical introduction and exploring those R packages tidy Verse, GG plot two, scales here, gg themes, Broom, lots of great tools.
Pause, remember looking at data, visualizing it. It's not a substitute for thinking carefully
Right, but it's an absolutely essential part of asking better questions and getting to those deeper insights.
