Welcome to the deep dive. This is where we take a whole stack of complex information, research notes articles and really boil it down to the essentials for you. And today we're plunging into a topic that's just well critical for any serious developer, high performance Postgres School, specifically within Ruby on Rail's applications. Our mission really is to give you those key insights that will make your apps faster,
more reliable, definitely more resilient. What's kind of cool is that our main source here is actually a beta book. It's still being developed, so you're getting some really cutting edge, very practical info straight from the trenches.
You might say, yeah, and that's exactly why this is so relevant right now. Getting really good at postgreschool and rails. That's not just helpful for your career, it's massively in demand, right Just think about it. A Hired survey from twenty twenty three it found Ruby on rails was actually the most sought after skill. We're talking one point six y four times more interview requests if you know your stuff, wow.
One point six four times Yeah.
And Postgres School it's consistently winning wards. It was number one in the twenty twenty two stack Overflow survey for most used database among pros and it's topped the dB engine's ranking three times. So yeah, the time spent learning this, it's it's definitely high impact knowledge.
Okay, all right, let's get into it. Then. Before we can even you know, talk about performance tuning, we need someone to actually do the tuning. We need a test bed. So this source material. It introduces a fictional app called ride share. What exactly is that? And I guess how does it help us learn this stuff?
Right?
So ride share is it's basically designed as this simplified API only web app. Think of it like a mini Uber or Lyft. You know, it's got the core active record models you'd expect drivers, writers, trips, trip requests.
That's worth thing and active record just quickly, that's the RM, right, the object Relational mapper in reels exactly.
It's the magic layer that connects your Ruby code, your classes directly to the database tables. And it really leans into that rails philosophy of convention over configuration.
Ah right, like driver model maps to driver's table.
Automnate precisely, you don't have to spell it all out and for managing schema changes rideshare uses the standard rail stuff dbstructured dot s cool along with active record migrations under the hood that uses pg dump to capture the structure.
You know, one thing I found really valuable in the source was this strong push to set up and use postgrass School locally, like really, make it your own little lab.
Totally. It's not just theory.
Having it run locally gives you complete control a safe space to experiment, which, let's face it, is absolutely essential when you're messing with performance settings. You don't want to test this stuff live, definitely not, and getting rights you're running is designed to be pretty smooth. You use Homebrew urban v for your Ruby version, Butler for gems, and then just the standard binrails commands dB do I create, dB,
DOT migrate, dB console simple stuff. Once you set up, you immediately start bumping into core postgred school ideas like SQL being a declarative language.
Okay, what does that actually mean? Declaraty?
It just means you tell postgres School what you want, like get me all the trips from yesterday, and you don't specify how to get it. Postgrescool's optimizer, its brain figures out the most efficient way to execute that request.
Ah, so you declare the result. It handles the process like ordering food, exactly.
Like ordering food, and the how it figures out that's the query execution plan. It's like the database's internal recipe for fetching your data. Plus you've got functions built in ones and ones you can define yourself, which really let you push complex logic down into the database itself. Seriously, we can't stress this enough. Get ride share set up locally. It really is the perfect lab for practicing everything we're about to cover.
Okay, performance lab established. Ride Share is running locally, safe space acquired. But to really know what's going on, we need tools to look inside Postgres school, Right, how do we peak under the hood.
Yeah, the piece grows command line tool is kind of your main entry point there. There are a few meta commands you'll use all the time, like a cantillybax that toggles This expanded view makes query results way easier to read.
Oh yeah, that's super helpful.
And e is great. It pomps your current query and your text editor life saver for complex SQL. Then there's EEL to just list your databases. You can also customize PSQL using a timicustl RC file add aliases, change the prompt whatever makes you comfortable, and for enabling some really powerful extensions like PG stat statements, which will definitely circle back to you need to edit your postcrisql dot com file specifically the shared preload library is setting.
Ah, okay, and that canfig change needs a restart.
Right that specific one does.
Yeah, Shared preload libraries needs a full postcresco restart to take effect.
Got it? So okay, we can configure things, but how do we see what the database is doing? Like right now? Is it just log files or is there a more direct way? I remember this one time a query just ran forever almost took down the whole app. If only I'd known about PG stat activity. Oh.
Absolutely, PG set activity is exactly that. It's your real time dashboard. You see every connection, what state it's in active, idle, maybe crucially idle in transaction, and you see background processes too, like autovacuum, so.
You can spot those long running queries.
Yep.
You can see the curry text, find its process idea the PID, and then if you really need to, you can try to cancel it gracefully with pg cancel back end or in an emergency, terminate it with p determinative back end. Use that last one carefully.
Though right termination is a bit heavy handed.
It can be Now this ties into understanding pessimistic locking. It's what postgrescool does by default. You'll see shared locks and exclusive locks. The main takeaway you really want to minimize how long you hold exclusive locks because they block everyone else trying to access that same data.
And that can lead to deadlocks.
Exactly.
Deadlocks are the worst case scenario, two transactions waiting for each other, stuck forever. Postgres will will detect and break them, but it means one transaction fails. You can see livelock information using the peaklocks view.
Okay, so lots to monitor. How about experimenting safely?
These source suggest using generate series to create lots of fake data. You can do this in a separate experiment's database. It's a great way to simulate production level load without touching your real development data.
That's smart.
And here's something that often surprises people. Post grescool has transactional DDL.
Transactional DDL data definition.
Language like create table, exactly.
Schema changes, create index, alter table, ad column. They happen inside a transaction just like data changes. So you can literally type begin, then create index mix on my table call, then realize you made a mistake and type roll back and that index just poof never happened.
WHOA.
Okay, that's huge for safety. No half applied schema changes.
It's an incredible safety net. It means your migrations either succeed completely or fail completely, which brings us back to safe experimentation. Always test schema changes in staging first, and maybe even use read only database users in production for certain monitoring tasks. Using roles like p grade old data or pgmont just adds another layer of safety.
Makes sense, Okay, let's switch gears a bit. Data correctness data consistency obviously super important in rails. We often reach for active record validations, but the force argues pretty strongly for using database level constraints too. What's the thinking there? Are they redundant not redundant complementary? That's the key. Active record validations are great, essential, even for catching errors early
at the application layer, providing good user feedback. But database constraints offer stronger guarantees because they're enforced inside the database engine, which is built specifically to handle high concurrency transactions and maintain data isolation in ways an application layer just can't.
Okay, stronger guarantees. What kind of constraints are we talking about beyond say, primary key or.
Not in all, Oh, there's a whole suite.
You've got unique constraints obviously, foreign key constraints to maintain relationships between tables, check constraints for custom rules, and even more advanced exclusion constraints.
Let's take unique If I want to add one, but I already have duplicate data, what do I do?
Good question? You typically need to clean up that data first, the source mentions using a common table expression a CTE with the row number window function. It's a neat trick to identify and then delete the duplicates before you apply the unique constraint.
Okay, and for and keys. Rails didn't always support those natively, did it right?
Native support landed in rails four point two Before that you use gems. But yeah, they're fundamental. They ensure, for example, that you can't delete a rider if they still have associated trips in the database. Prevents orphaned records.
Got it? What about che check constraints you said, custom rules, Yeah, they're really flexible.
Anything that evaluates the true or false, like you could enforce that a trips table's completed timestam must always be later than it's create debt timestap. Simple powerful rule.
Okay, that makes sense, But this raises a practical point. How do you add a cheat check constraint like that to a table that's already huge and getting hammered with traffic. Wouldn't that lock it up while it checks millions of old rows?
Exactly the problem, and there's an elegant solution. You do it in two spelps using rails migrations. First you add check constraint, but pass the option validated false. This tells postgraschool, okay, enforce this rule for all new or updated rows from now on, but don't check the old ones yet. That part is super fast.
Ah, so it doesn't block.
Right.
Then, in a separate later migration you run validated at check constraint for that same constraint. This tells postgres school, okay, now go back and check all the existing rows. But it does so without taking such a heavy lock. It avoids that downtime.
Clever two steps. What about deferring constraints?
Yeah, deferable Initially deferred. You can apply this to unique primary key, foreign key, and exclusion constraints. It means the constraint check is postponed until the very end of the transaction. Super useful for things like say, reordering items in a list where each item needs a unique position. You might temporarily have duplicate positions during the transaction while you swap things around, but as long as it's fixed by the time you commit, it's okay.
Interesting, okay. You also mentioned exclusion constraints. Those sound advanced.
They are powerful and less common, but solve specific problems really well. They prevent overlapping data across multiple roads in the same table. The classic example is preventing overlapping time ranges like booking a meeting room, or, in ride SHARE's case,
maybe preventing overlapping vehicle reservations. They usually require an extension like beat read just and often use range types like TSTs range for timestamp ranges along with the overlap operator at datcha okay, quick detour case in sensitive unique emails common problem. How does postgress will handle that?
Two main ways?
Really You could use the site text extension, which provides a case in sensitive text type, or you can use generated.
Columns generated columns like virtual columns sorted.
Yeah, you define a column that's automatically computed based on others, so you could have a lower mail generated column that always stores lower email. Then you put a regular unique index on that generated column. Rails actually supports these now too.
Neat and quickly. And domains right, create.
Type of gas as enom lets you define a fixed list of a loudstring values for a column like trip statuses, pending, active, completed. Create domain lets you create a custom data type based on an existing one, but add check constraints to it, making reusable validation rules both useful, different trade offs.
Okay, this is great for ensuring data integrity, but let's talk about actually changing the database schema on a busy production system. That moment when you run RAILSDB dot migrate, it can be terrifying.
Oh yeah, the dreaded migration lock exactly.
Some alter table operations, they take what's called an access exclusive lock, right, and that just blocks everything reads rights. Your app grinds to a halt. How do we avoid that absolute nightmare?
Right?
That exclusive lock is the enemy on a busy system. The absolute key here, Your best friend really is the concurrently keyword for operations like create index or drop index. Adding concurrently tells postgress will to do the work without taking that heavy lock. It takes longer, more resources, but your application stays online.
It's a life saver.
So create index concurrently, drop index concurrently. Are there other ways to stay safe?
Definitely.
There's a fantastic Ruby gym called strong Migrations. You add it to your development environment and it actively watches your migrations. If it spots a potentially dangerous operation, one that would take an access exclusive lock and likely cause downtime, it'll either warn you, suggest a safer multi step alternative like the chi check constrained example, or even prevent the migration from running in production by default.
Oh wow, So it enforces safer practices during development exactly.
It catches things early beyond that. You need safeguards at the database level too, especially with high concurrency. Setting a lock time out is crucial.
Lock time out that limits how long a query waits for a lock.
Precisely, if a query can't get the lock it needs within say, fifty milliseconds, it gets canceled instead of just sitting there waiting indefinitely and potentially holding up other processes. Similarly, stay time out puts a cap on how long any single sequel statement is allowed to run, prevents runaway queries from hogging resources. The source also mentions enabling log lock weights and tuning deadlock timeout for better visibility in your
logs when contention happens. Okay, timeouts are key. What about removing columns? I've heard that can cause weird errors too. Ah, yes, so the stale schema cache problem. This happens when you drop a column, but some of your running Rails application servers haven't picked up the schema change yet, they try to query the column that no longer exists.
Boom error.
Right, So how do you remove a column safely?
The recommended way is using active record dot base dot ignored columns. It's a multi step process. First, you add the column name to ignored columns in your Rails model, deploy that code. Now rail simply pretends the column doesn't exist, even though it's still in the database. Then, once your sure no code is using it, you create and run a migration to actually remove column. Finally, you remove the column name from ignored columns in a later deploy. It's gradual and safe.
Makes sense gradual removal. Now this brings up another big one. What if you add a new column and need to populate it for millions of existing rows backfilling data without downtime?
Yeah, that's a classic challenge. Running a massive up date statement is usually out of the question. Too slow, too much locking. You need online backfilling strategies. One approach is double writing or dual rights. You modify your application code to write to both the old location if any, and
the new column simultaneously. You run a background job to backfill a new column for old records, and once it's done, you switch weeds to the new column and eventually remove the dual right logic and the old column.
Okay, double writing any other ways.
Another technique involves using intermediate tables. You create a temporary table, maybe just with the primary key in the new column value. You populate that table, perhaps marking it U and lodgy so it doesn't hit replication, maybe disabling autovacuum on it temporarily for speed. Then you batch update the main table
from this intermediate table. And crucially, all these backfilling processes need to be done in small, manageable batches, often with some kind of throttling or delay between batches to avoid overwhelming the database or causing replication.
Lack batching and throttling. Got it, Okay, let's shift focus to active record itself. It's amazing. Rails convention over configuration is great, but yeah, it can sometimes generate pretty inefficient queries if you're not paying attention right.
Absolutely, the abstraction is powerful, but it can hide what's actually happening. So connecting this to the bigger picture, it's about making your rails app smarter in how it communicates with postgress.
How do we even spot the bad queries easily?
Well?
One really helpful thing in Roll seven and later is the improved query logs. They can automatically add context like which controller an action trigger the query right into the sql log output. Makes tracing a slow query back to your application code much much easier.
That sounds useful. What's a common inefficiency pattern?
Oh?
The absolute classic is the M plus one query problem? You see it everywhere?
Ah, Yes, fetch one thing, then loop and fetch relate things one.
By one exactly like load one hundred blog posts and then inside the loop for each post, run another query to get its author. That's one hundred and one database queries when it could probably be just two kills performance.
Right, So how do we fix N plus one?
The primary solution is eager loading. You tell active Record upfront what associated data you'll need, use methods like dot preload or dot cludes rails, then cleverly figures out how to load all that data in a minimal number of queries, usually just one extra query for each association.
Yeah preload and dot includes.
Got it and a quick tip if you've already eagerloaded data into an array of objects, use dot size to get the count dot dot count. Dot size works unloaded array in memory, dot count might trigger another database query unnecessarily good tip.
Any other ways to prevent N plus one?
Yeah, Rail six point one introduce strict loading. You can enable it per association or globally. If you try to access an association that wasn't explicitly eger loaded, it raises an error instead of silently running the en plus one query. It forces you to be explicit and prevents the problem by default.
Ooh, I like that force the good behavior. What about optimizing individual queries?
Simple things First, always use limit If you don't need all possible results, don't pull back ten thousand rows if you only display twenty. Also, the returning clause on insert active records dot insertle method supports this. Now it lets you get back the IDs or other columns of the rose you just inserted without needing a separate select query afterwards. Saves a round trip.
Nice and processing large amounts of data use.
Active records batching methods dot finbach or dot in batches. They retrieve records in batches default one thousand, which keeps memory usage low and avoids overwhelming the database or your application with enormous result sets. Much more reliable for large tables.
Okay, batching is key. What about more complex SQL logic? Does Active record help there?
It does increasingly so. Active record has good support for subqueries. Now you can construct queries where part of the ware clause, for instance, is itself another SQL query, useful for things like finding drivers who have completed more trips than the overall average number of trips per driver.
Okay, subqueries. I've also heard about CT's common table expressions.
Yes, CTEs using the with keyword in SQL are fantastic for complex queries. They'll let you break down a big, hairy query into smaller named logical steps. It makes the sequel vastly more readable and maintainable. Active record has ways to build queries using CTEs.
Too, so better organization. What about views?
Database views are another way to encapsulate complex sequel. You define the query logic as a view directly imposed cresscol and then you can query it from rails, almost like a regular table. The scenic gem is very popular for managing database views within your rails migrations.
Okay, views and materialized views. What's the difference?
Ah, Now, this is where it gets really interesting. Materialized views take it a step further. They don't just store the query definition. They actually execute the query and store the results physically like a cash table.
So queries against them are super fast.
Lightened fast because the complex calculation or joint is already done. They're perfect for complex reports or dashboard data that doesn't need to be absolutely real time.
And the cool part you.
Can often refresh them can currently update the stored data without locking readers. Provided the materialized view as a unique index defined on it, zero downtime updates for your cash data.
Wow, concurrent refresh that's powerful. What about caching within rails itself?
Right.
Rails has layers two. The query cash is automatic within a single controller action. If you run the exact same sequel query twice in one request, RAILS cash is the result of the first call and returns it instantly the second time, avoiding a redundant database.
Hit okay automatic per request.
Then there are prepared statements postcresscool can parse a query plan once and reuse it for subsequent executions with different parameters. RAILS uses these under the hood. Rail seven improved how it handles them, making reuse more likely by numerrating columns, which helps reduce parsing overhead.
And counter cases.
Counter cases are a common pattern, not strictly a RAILS feature, but supported by it. You add an integer calumn to a model, say trips count on the user model. Then you can figure the trip model to automatically increment or decrement that counter whenever a trip is created or destroyed for that.
User, so you get counts almost instantly without a slow query.
Exactly reading the count is just reading it into your column. The tradeoff is slightly increased right latency, you have to update the counter, and potential for slight inconsistencies if things go wrong but they're often a huge performance win for frequently needed counts.
Makes sense. What about calculations like averages or sums.
Do them in the database. Don't pull a million numbers back to Ruby just to calculate the average. Use CQL aggregate functions like avg sum countet max min. Active record provides methods like dot average sum, et cetera that generate the correct sequel. It drastically reduces data transfer and Ruby object allocation.
Right less object delocation is that a big deal?
It can be, especially in tight loops or high throughput scenarios. Every active record object RAILS creates consumes memory and CPU cycles.
If you only need raw.
Data maybe for a report or an export, Using lower level methods like dot find by Soukel or even active record dot base, dot connection, dot execute can be significantly faster. There were dren simpler data structures, hashes or rays instead of full blown active record objects, dramatically reducing that allocation overhead.
Okay, lots of ways to optimize within RAILS, but let's go deeper into the database itself. When a query hits postgriss, well, how do we see what it's really doing? And how do we choose the right index to speed it up.
Right now we're getting into database level observability. A really crucial tool here is the pgstat statements extension we mentioned earlier. You enable it in postcresq al dot COF.
What is it track?
It tracks execution statistics for every normalized query run against your database, total time spent, average time, how many times it was called, rose returned, et cetera. It's invaluable for finding your most expensive queries globally, the ones consuming the most database time overall, so you can.
See the biggest hitters across the whole application exactly.
It helps you prioritize optimization efforts for a more visual approach. Pg hero is a great dashboard tool. Many teams use it. It gives you a web UI to see slow queries. It can analyze query plans using explain, show you large tables indexes, find unused indexes, and even show currently running queries.
Pg Hero Okay, but the core tool is explain right.
Explain is fundamental running.
Explain before a query shows you post Grescol's plan for executing it. Adding Analyze actually runs the query and shows you the plan along with the actual time spent and rose returned at.
Each step, so explain analyze gives you the real picture.
Yes, you can also add buffers to see memory usage like cash hits and format YAMEL or JSON for easier parsing. It shows you things like whether it's doing a sequential scan reading the whole table usually bad, or an index scan or ideally an index only scan. It shows estimated costs, filter conditions. It's how you diagnose a slow query.
Which leads to the question how do you make sure Postgress uses the index you think it should or figure out if you're missing one.
Finding missing indexes often starts with looking for those slow sequential scans and explain plans or MPG stats statements. Data pg Hero also has a suggested indexes feature based on its analysis the rail's best practices. Jim can sometimes flag missing foreign key indexes too.
Can you make Postgress logs slow queries automatically YEP.
Set log minduration statement in postcressql dot com.
Yeah.
Any query taking longer than say, five or milliseconds gets logged automatically. Even better, you can use the auto explain extension, which automatically logs the explain plan for slow queries so you see exactly why it was slow right in your logs.
Auto explain Nice. Okay, let's talk index type, single column, multi column what's the strategy.
Well, a multi column index on ABC can be used for queries filtering on A or A and B or A, B and C, but generally won't be used efficiently if you.
Only filter on or C.
It's left to right. You need to think about your query patterns. Sometimes multiple single column indexes are better. Sometimes a carefully ordered multi column indexes key. You also want to avoid redundant indexes, like having an index on A and another on AB. The first one might be unnecessary.
Okay, what about indexing on like a lowercased email.
That's indexes on expressions. You can literally do create unique index on users lower email. The index stores the result of the lower function, allowing fast case in sensitive lookups or uniqueness checks.
Cool now, GIM indexes those sound different.
Yeah.
GN stands for generalized inverted index and this is where things get really interesting. For certain data types. GN indexes are optimized for indexing columns that contain multiple values think arrays or especially JSMB columns.
So you can efficiently query inside a JSMB.
Column exactly with a gene index on a json B column holding, say, trip details, you could very quickly find all trips where metadata bags and trunk is two, or where metadata by water offered is true. Without a gen index, searching inside json B is usually a slow, sequential scan. Gn is also used by extensions like PGTRGM for fast text similarity searches, finding Henry when someone types Henrietta, for example, really powerful for fuzzy searching.
Wow, okay, jsonb's searching and fuzzy text. What are partial indexes?
Partial indexes are indexes that have a wear clause in their definition. This means the index only includes rows that match that wear clock.
Why would you too that to.
Make the index much smaller and more efficient. A classic example is indexing soft deleted records. Create index on users we're deleted at is not null. If only one percent of your users are deleted, this index is tiny compared to indexing all users. It's perfect for queries that specifically target that subset of rows.
Smaller index faster queries for that subset, smart and brin indexes.
Brin block range index. Now, connecting this to the bigger picture, think about very large tables may pen only tables like event logs, where data is inserted in order off correlated with the timestamp. Light created app for these tables. Britt indexes can offer performance similar to a standard B tree index for range queries like finding events between two timestamps, but they're drastically smaller. We're talking maybe forty kilobytes versus two hundred megabytes for the same data range.
It's huge space saving tiny indexes for ordered data. Interesting do indexes help with order by? Why?
Absolutely?
If you have an index on the columns you're ordering by, postgrescool can often just read the data directly from the index in the correct order, avoiding a costly sorting step. This includes handling a null's last or nulls first efficiently if the index is.
Defined correctly, and covering indexes covering indexes.
This uses the inclue gag keyword added in post crescool eleven. It lets you add extra columns to an index just for their data, not for searching or ordering. The benefit if a query can get all columns it needs both for filtering, ordering and for the seleclist directly from the index itself. Postgrescool can perform an index only scan. It never even has to visit the main table heap that avoids a whole layer of IO and can be a massive.
Performance boost index only scans got it. Okay, So we've optimized queries, picked good indexes, but databases need ongoing care right like cars needing oil changes. The source talks about vacuum, analyze, re index the VR maintenance tasks.
Yeah, absolutely vital, and it loops back to how Postgres school handles changes that NVCC system multiversion concurrency control. When you update or delete a row, postgriss doesn't immediately overwrite or remove the old data. It marks the old row version as invisible to new transactions, but the physical space isn't reclaimed right away. These invisible rows are called dead.
Toples, and those dead tuples are bloat exactly.
They take up space in your tables and indexes, making them larger than they need to be, which can slow down queries. This is where vacuum comes in.
So vacuum cleans up the dead tupules.
Yes, and the really neat thing is auto vacuum, which is postgres schools built in background process that does this automatically. It monitors tables for changes and runs vvace to reclaim space and analyze an ed SHEBSI to update statistics for the query planner periodically.
Be said, it's often too conservative.
Yeah, The default auto vacuum settings are often tuned for safety, not aggressiveness. On busy databases with high right rates, lots of updates and deletes, the default hittings might not keep up and blow can accumulate. You can, and often should, tune parameters like auto vacuum vacuum cost limit how much work it does per run and vacuum vacuum cost delay how long it pauses per tables or globally to make it run more often or more aggressively to keep bloat under control.
Okay, tune auto vacuum. What about indexes? Do they get bloated to They do?
And indexes can become fragmented or inefficient over time for rebuilding them. The best practice now is re index concurrently.
Ah concurrently again, so no downtime right.
Introduced in POSTGRESSCULL thirteen, it rebuilds the index in the background without blocking reads or rights to the table. Once the new index is ready, it swaps it in atomically. For older Postgres versions, the Prakic extension can do something similar for both tables and indexes.
So mostly rely on autovacuum and renex Concurrently, do you ever need manual vacuum.
Sometimes you might run vacuum analyze your pable manually after a huge data loaded deletion, just to make sure stats are up to date immediately. There's also a vacuum ver both option to see what it's doing, and since Postgres twelve vacuum skip blocked is useful to make sure your manual vacuum doesn't get stuck waiting for a lock on the table.
Is there a way to see the bloat?
Yeah, you can actually simulate it locally for learning, disable auto vacuum on a test table or on a ton of you P dates and then use SQL queries you can find them online or tools like pg hero to estimate the table and index bloat. Then you could run vaco full well, which does lock the table heavily, so don't run it on production just to see it reclaim the space aggressively. It helps build intuition.
Good for understanding. You mentioned unused indexes earlier.
Yes, crucial maintenance indexes aren't free, They take up disc space and more importantly, they slow down rights insert update to lead because the index has to be updated too regularly. Identifying and removing unused or redundant overlapping indexes is a key optimization. Tools like pg hero are great for finding these based on usage statistics tracked by postcress, so.
Clean up unused indexes. How can we schedule these maintenance tasks?
The pupg chron extension is super handy for this. It's a scheduler that runs inside postgress role itself. You can use it to schedule recurring SQEL commands like running a manual vacuum analyze on specific busy tables every night, or maybe rebuilding certain indexes periodically with index.
Concurrently pgcra cron for scheduling any other quick maintenance tools.
The rails pg extras gem provides some convenient RAKE tasks for common checks right from your railvapp, things like checking database cash hit rates, finding long running queries, diagnosing lock contention, and identifying null indexes. Null indexes yeah indexes where a very high percentage of the entries are actually NLLL. Depending on your queries, these might be candidates for conversion into some more efficient partial indexes using a were column is not NL class got.
It okay, final big area scaling, more users, more traffic. This often leads to problems with database connections, right hitting limits.
Yeah, that's a common scaling bottleneck. Postgress will has a hard limit on simultaneous connections set by the max connections parameter. If your application tries to open more connections than that limit allows, you get the dreaded fatal Sorry too many clients already error?
Ugh yeah, not fun. How do you deal with that?
Well, first you need visibility.
Use pgstat activity again to see how many connections are open, in what state they're in. Are they active, idle or worryingly idle in transaction meaning a connection is holding a transaction open but isn't doing any work. Potentially holding locks.
Idle in transaction is bad.
Very bad.
You need to find and fix the application code causing that. But the main architectural solution for handling lots of connections efficiently is connection pooling.
Connection pooling like pg bouncer.
It's exactly.
Pg bouncer is the standard, free, open source connection pooler for Postgress. What's fascinating here is how it works. It sits between your rails application and your postgrescool database. Your Rails apps connect to pg bouncer, which can handle thousands of front end connections, and pg bouncer maintains a smaller managed pool of actual connections to the back end postgresscole database,
reusing them efficiently. It drastically reduces the overhead of constantly opening and closing connections to Postgress itself.
So it lets you handle way more app servers without overwhelming the database's connection limits.
Precisely, it's essential for scaling rails apps with postgresscool. You do need to configure pg bouncer's pool modes though. Transaction pooling is very common and efficient, but it has a catch. It doesn't work well with postgrescool session based features like prepared statements, so if you use transaction pooling, you generally need to disable prepared statements in your rails database dot eml oh.
Okay disable prepared statements for transaction pooling.
What's the alternative session pool mode in pg bouncer. It works fine with prepared statements, but it's less efficient at re using connections as it pins a back end connection to a client for the entire session duration. It's often the fallback if you can't disable prepared statements.
Got it? Transaction versus session pooling, and we still need those timeouts right.
Absolutely, statement time out, lock, time out, and especially idle in transaction. Session timeout become even more critical when you have a pooler involved to prevent misbehaving application instances from hogging pooled connections.
Okay, let's dive a bit deeper into locking. We mentioned pessimistic locking.
Right postgressholes Default locks are taken up front. Peaglocks few gives you the granular details you can explicitly request locks in SQL. Select it for update is common. It locks the rose you select, preventing any other transaction from updating or deleting them or selecting them for update until your transaction commits or rolls back.
What if the row is already locked.
By default for update will wait, but you can add no weight If the row is locked, the query fails immediately, or you can use skiplocked. This tells postgress to just ignore any locked rows and return only the ones that can lock immediately. Super useful for implementing job queues where multiple workers might try to grab the next available job.
No wait and skip locked. Okay, is there a weaker.
Lock yees selectaseball for share. This takes a shared lock. Multiple transactions can hold a shared lock on the same row simultaneously. It prevents anyone from getting an exclusive lock, like for update or from delete, but it allows other transactions to also read or select for share. Useful if you need to read data and ensure it doesn't change before your transaction finishes, but you don't mind others reading it too.
Okay, that's pessimistic locking and postgress. What about optimistic locking and RAILS, right, that's different.
Active record optimistic locking is an application level strategy. You add a lock version integer column to your table. When railsloads a record, it remembers the lock version. When you go to save it, RAILS checks if the lock version in the database still match is the one it loaded. If it doesn't match, meeting someone else modified the record in the meantime, RAILS raises an active record stale object terror.
Instead of overwriting the changes, it forces you to handle the conflict in your application code.
So detect conflicts at right time rather than preventing them with locks exactly.
And then there are advisory locks. These are completely application defined locks using functions like peak advisory lock key. They don't lock table rows, they just lock an arbitrary number. The key useful for coordinating background jobs or ensuring only one process runs a specific task at a time without needing a dedicated table.
Okay, lots of locking options shifting again. Dealing with large data sets often means pagination. What are the pitfalls?
The biggest pitfall is traditional limit offset pagination. It's easy to implement dot limit twenty dot offset, but it gets incredibly slow on large tables as the offset value grows.
Why is high offset slow.
Because postgress will still has to fetch all the rows up to the offset plus limit point, sort them, and then discard the offset rows. So offset five hundred thousand is going to be agonizingly slow because it has to process over five million rows just to find the twenty you want.
Ouch. So what's better for large tables? Cursors?
Database cursors declared cursor fetch solve the offset problem by maintaining state on the server. They're consistent, but they have drawbacks. They keep a transaction open, which consumes resources and doesn't play well with pg bouncer's transaction pooling mode. An active record doesn't natively support them easily.
Okay, So cursors are tricky, what's the recommended way?
Then key set pagination sometimes called seek based pagination. We connect this to the bigger picture. This is generally the most efficient method for large data sets. Instead of offset, you use a wear clause based on the value of the last item scene on the previous page, combined with order buy and limit. For example, where is a last
seeded order by sp limit twenty. Because you're filtering directly using an index column, the database can jump almost directly to the starting point for the next page, avoiding scanning all the previous rows. It stays fast even on page ten thousand.
Key set pagination use the last scene value in the wear clause, got it last topic Bulk data writes. How to do those efficiently?
Upsorts are common, combining an insert with an update if the row already exists. Postgrescoll has insert insert and tree on conflict do nothing if it exists, ignore the insert or insert in on conflict, do update set if exists, update specify columns. Rails's insertal method has an on conflict option to handle this, and Postcress fifteen introduced this standard SQL merge command, which is even more powerful for complex absurd logic.
Okay, upsurts handle conflicts. What about just loading lots of new data like from a file.
For bulk loading from files, especially csvs, the absolute fastest way is the native postgress will copy command copy mytable from my fil dot csv with format csv header. It bypasses a lot of the normal SQL processing overhead and is incredibly efficient, especially if we run into the same transaction as a create table or trunkate as it can minimize wall loogging.
Then copy is king for bulk loads. Anything else cool For external.
Data, yeah, this is a surprisingly powerful and clever tool foreign data rappers, specifically the filep and do extension file EPI darol lets you define an external file like CSV log file as if it were a postcrescoll table. You can then run SQL queries directly against that external file without importing the data seriously.
Query as CSV file with SQL without loading it seriously.
Create extension fileffew year, create server file server foreign and data wrapper FILEFEEDW, create foreign table, mylogues, server filesiver options, fill namepath, dollars, dot csv, format csv. Then just selects from milogues where air code e cools five hundred. It's amazing for analyzing log files or accessing archive data stored externally a real game changer sometimes.
Wow, Okay, what a journey we've been on. We started right at the beginning, setting up that local performance lab with rideshare. Then we dove into admin basics, safe experimentation, then hit data integrity with constraints which go way beyond basic rills validations. We tackled those scary production migrations, learning
about concurrently and tools like strong migrations. Then we really got into optimizing active record itself n plus one queries, eager loading cts, materialized views, and finally went deep into postgress internals, explain advanced indexing like gins bren, partial indexes, essential maintenance with vacuum and re index and handling scaling challenges like connection pooling with pg bouncer locking, and efficient
folk data handling with copyy and even FDWs. You really have gained a shortcut here to some seriously high performance postgrescol skills. That's an incredibly valuable step.
It really is, and that's the goal, right to equip you with these practical, actionable insights, hopefully some surprising facts too that can genuinely make a difference in your applications. You should now have a much deeper feel for how postgresco operates and how to really harness its power with rails. So maybe a final thought to leave you with, Given all this immense power and flexibility we've uncovered in post
gris skool today, just think about your own work. Is there existing application bottleneck, maybe something you struggled with that could be dramatically simplified or sped up by using one of these database features, maybe one you hadn't even considered before. Keep asking that question, keep exploring, because continuously improving the operational excellence of your databases, well, it's an ongoing process, but it's definitely a rewarding one,
