59: Bayer CropScience

⁠¶ Intro / Opening

00:11

Chain of events, cause and effect. We analyse what went right and what went wrong, as we discover that many outcomes can be predicted, planned for, and even prevented. I'm John Chidgey and this is Causality. Causality is entirely supported by you. If you'd like to support us and keep the show ad free supporter.

00:30

supporters have access to early release, high quality versions of episodes as well as bonus material from all of our shows not available anywhere else. Just visit engineer dot network slash causality to learn how you can help this show to continue to be made. Thank you. By popular request for the month of May only, we're bringing back the very popular offer where you can get an additional month's discount on top of the two months you already get discounted for annual patrons.

00:53

But it's only for the month of May. It's available for premium and above, so if you've been on the fence about supporting the show and want access to the bonus material and early episode releases, then it's never been easier. Get in while it lasts. Thank you. Bayer Crop Science.

⁠¶ Bayer CropScience

01:09

Institute is a community located approximately fourteen kilometers or nine miles west of Charleston, West Virginia, in the United States of America. A large multi-tenant industrial facility is located beside the Kanawha River and today is bordered by the West Virginia State University. It was originally situated there for easy access by road, rail, or river barge for the transport of raw materials in and the export of finished products out of the facility.

01:36

The site was originally an airport and was called the Wirtz Field Airport. However, this closed in nineteen forty two under World War two provisions and was converted into a government sponsored synthetic rubber production plant that was managed by the managed for the government by the Carbide and Carbon Chemicals Corporation in conjunction with the United States rubber company.

01:56

Following the war in nineteen forty seven, the Union Carbide Corporation or UCC purchased the facility and began to produce insecticides for industrial uses, and in nineteen eighty six a French owned chemicals company, Roan Polanck.

02:10

acquired the Agricultural Division of UCC and took over operations until the year two thousand. Aventus, formed by a merger of Ron Poolenck and AGREVO or Arg Ivo, took over the facility briefly until finally in two thousand and two when Bayer Crop Science acquired it.

02:28

In august two thousand and eight, the one hundred and eighty six hectare or four hundred and sixty acre, now a multi-tenant site, was known as the Institute Manufacturing Industrial Park, which employed approximately six hundred and forty five workers across all seven tenants. With Bayer employing the lion's share of those of about 545 in total at that time. Those tenants were Bayer Crop Science. Adios-o. FMC Corporation, Dow Chemical, Catalyst refiners, reagent chemical, and Praxair.

02:59

There are sixteen production units and five utility and support units in total located on the industrial site, with some tenants producing chemicals that are required in part as feedstocks for other production units that are either owned or operated by some of the other tenants. Think of it a bit like this.

03:16

Company A makes product A, sells a bit of that to a direct customer, but also sells a bit of that same product A to company B that happens to be located at the same site, and that company B then uses it to make their own product. And so on. And then that way, companies with specific expertise in a given area, and following any strict operational licensing restrictions based on their experience.

03:41

can focus on manufacturing a specific product or product type and then the adjacent companies benefit from sourcing that locally or on site for their own use. A win-win kind of thing. Bayer owned and operated nine production and utility units at that time, with two additional process units that were owned by Adiesso and FMC, but were actually operated by Bayer employees under contract.

04:03

Bayer AG or the Bayer Group are an independently operated multinational corporation based in Leverkusen, Germany and consisted of three business areas Bayer Crop Science, Bayer Healthcare, and Bayer Material Science. Specifically now for the Bayer Crop Science Business Unit, their headquarters are in Monheim in Germany, and they employ over eighteen thousand people across one hundred and twenty countries.

04:28

Wow. In the United States, however, their US headquarters are located in RTP, that's the Research Triangle Park in North Carolina. Bayer crop science, as their name hopefully suggests. is a global provider of what are sometimes described as crop protection agents, including chemical categories like insecticides, herbicides, and fungicides, for both commercial and for domestic or consumer use.

04:52

Within crop science itself there are three divisions crop protection that serves the agriculture sector, bioscience that looks at genetically modifying crops, and environmental science that provides services for professional weed and pest control. Within the site, Bayer had three insecticide manufacturing complexes with two power stations, also called powerhouses, and a wastewater treatment plant.

05:14

The East Carbamylation Complex or ECC included facilities for methyl isocyanate or MIC and phosgene production, as well as aldicarb and carburetes. The MIC and FOSGEN units supplied feedstock to the aldecarbon carburet units for the production of insecticides. The methamolin unit was part of the West Carbamylation Complex or WCC, alongside the carbosulfin and carbofurin units.

05:40

Which were owned by FMC but operated by Bayer, and it had originally gone into service in nineteen eighty three. The AdiSo owned Rodamet unit made up the third complex within the multi tenant facility that Bayer also operated.

05:53

Of specific interest in this incident, however, given all of that background information, is the methal larvan unit, which we'll focus on moving forward. Baya produced methanol both for international customers as well as an intermediate feedstock that was then used to make larvae, also known as thioticarb, which is an insecticide as well as an overside, which, if you don't know what that is, it means it kills insect eggs,

06:18

Hopefully it does, I guess. Methanol is a white, crystalline solid, and it has a very slight sulfur smell to it. Methanol dust is combustible and can form explosive mixtures when dispersed in air, but then so can wheat flour. Methamol was first introduced as a carbamate insecticide in nineteen sixty six and registered by the US Environmental Protection Agency or EPA in nineteen sixty eight as a restricted use pesticide directly because it was highly toxic to humans.

06:46

Methamol is a chlorineesterase inhibitor that disrupts the central and peripheral nervous system. Charming stuff. For those interested in more detail about phosgene and methyl isocyanate, if any of these chemicals are sounding familiar to you, refer to the previous episode number twenty six of Causality about the incident in Bhopal, India.

07:07

The methamol process unit wasn't operated continuously all year round since the demand for methanol wasn't great enough to justify it, hence it only ran for a few months each time and had extended outages between those production runs. How methamol is manufactured requires a multi stage process.

07:24

First, aldoxime, made in what they call the oxime unit, and chlorine react to make chloral acetyloxime, which then reacts with sodium methyl macaptide, in methyl isobutyl ketone or MIBK solvent to produce methyl isoacetyldoxine, or MSAO. The MSAO reacts with the methoisocyanate in the MIBK to finally produce methanol. Excess MIC is then removed from the methanol solvent solution, which is then crystallized to the first using an antisolvent and separated from those solvents using centrifuges.

08:00

The resulting methanol cake is dried, cooled and stored in drums, and then warehoused ready for sale, or sent on to the larvin unit for that process to then use it as its feedstock. The liquid that remains from the centrifuge separation is what they refer to as mother liquor. It still contains MIBK hexane and since no separation system is completely perfect, there will be small quantities of non crystallized methanol contained within it as well.

08:27

The mother liquor is directed to the solvent recovery flasher that separates out solvent for reuse and sends these so called flasher bottoms across to the residue treater. The residue treater decomposes methanol using heated methyl isobutyl ketone MIBK solvents. So the mother liquor, which dissolved methamol and other waste chemicals are fed into the residue treater once it's been preheated and partially filled with solvent.

08:53

Once the concentration of methanol inside the residue treater is less than 0.5% by weight, it's then transferred to an auxiliary fuel tank where it's then mixed with other liquid waste material and reused as a fuel for one of the boilers. Focusing on the residue treater unit. The original twenty five year old carbon steel residue treater vessel was being replaced with a new Stainless steel residue treater at the same time as the DCS upgrade was being undertaken in the methanol unit.

09:22

The new unit was a 17 kL or 4.5 thousand gallon pressure vessel with a maximum allowable operating pressure, or MAOP, of 50 PC, that's PSI gauge. with a venting system designed to handle up to a one percent concentration of methanol. The normal operation of the residue treater was to maintain a fifty percent liquid level and a minimum recirculation flow with the liquid contents and a heat exchanger.

09:46

Designed to remove excess heat generated from the exothermic decomposition of the methanol and to maintain a hundred and thirty five degrees Celsius or two hundred and seventy five degree Fahrenheit temperature. The heat exchanger had both a cooling and a heating mode, and during startup it used a steam line to introduce heat into the system, although operators had complained it was undersized as this took considerable time to heat up the solvent, initially at least.

10:11

Once at operating temperature and there was enough heat produced by the exothermic reaction, the heat exchanger was switched out into a cooling mode for normal operation. Having said that, the temperature was also regulated by condensing the solvent vapor produced in the reaction in the vent cooler, which was then returned to solution, ultimately using the resurf loop in the cooling mode as a secondary temperature control.

10:35

The decomposition reaction rate increased with increasing temperature and with increasing methanol concentration as well. Therefore, control of both of these variables was key. The target temperature used, therefore, had to be a balancing action. It prevented uncontrolled auto decomposition of a higher concentration of methanol whilst ensuring the incoming methanol decomposed rapidly so it would not increase the vessel's concentration to a level that could be unsafe.

11:01

Since the plant was only run for parts of a given year, any major repairs or system upgrades were therefore undertaken during these large outage windows, including the previously mentioned distributed control system or DCS upgrade project. It had started in early two thousand six, but the actual cut over to the live system didn't begin until early two thousand seven, and the first plant section to be upgraded was the larvin section of the methal larvin unit.

11:27

The methamol section of the methal larvin unit was then upgraded during the next outage window, which was about the same time but in two thousand eight. The original control system was a Honeywell TDC system that was no longer actively supported by the vendor, and it was a more traditional or perhaps classic 80s style DCS before the computer mouse had been popularized by Apple.

11:49

Hence, interactions between the operators and the control system were entirely keyboard driven. The newer, Siemens PCS seven system, however, was more modern and utilized both the keyboard and mouse for user interactions. The DCS contained three interlock matrices: one for safety interlocks, one for operational interlocks, and one for control interlocks.

12:10

The Safety Interlock Matrix consisted of predefined process deviations and computer control process actions that determined how and when fail-safe automatic control functions were activated. The status of all safety matrix interlocks were displayed on a color-coded spreadsheet like table on the display console.

12:29

Process mimic screens also displayed a safety matrix commonly referred to in the business as a cause and effect matrix, along with safety system status next to relevant plant component symbols. Overriding a safety interlock required a supervisor password that normal board operators did not have access to.

12:46

Globally, during late 2007 and early 2008, there had been an increase in demand for larven, and whilst previous stockpiles of the methanol feedstock needed to make the larvae still available at the latter half of the upgrade and maintenance turnaround. and those stock levels were beginning to run low.

13:03

Whilst leadership at the facility had not mandated a deadline for bringing the methal plant back online again, operations staff indicated that, and I quote, they looked forward to resuming methamol production and a return to the normal daily work routine after a long unit shutdown. End quote. And finally, at long last, we have enough background information to begin discussing the incident.

⁠¶ The Incident

13:26

So let's talk about the incident itself. The very first day of the unit restart was actually Thursday, the twenty first of august two thousand and eight. But those plant sections that prefed into the subsequent parts of the methanol unit aren't relevant to this incident, other than to acknowledge that a complex plant like this one takes at least a week to get fully up and running from a cold start.

13:48

The original intention of the plant operators was to start up the oxime section of the process on Monday morning, the twenty fifth of august two thousand eight. However, it quickly became obvious that the upgraded sections of the plant weren't completely ready for service. Some of the items they found as they attempted to get the plant up and running included a valve had not been installed on a solvent feed line.

14:09

A crystallizer centrifuge was experiencing multiple problems with electrical connections that prevented it from starting at all. A heat tracing system for one section of process pipework wasn't operating, which then led to the contents of the pipe cooling and turning completely solid, and a broken stem on a water cooling system valve on one of the vapor condensers. And that's just to name some of the bigger issues. There were actually many others.

14:33

Beyond this, the new control system had yet to be tuned, with many of the closed loop controllers or proportional integral derivative PID controllers left at their default values, leading to unstable operation if left unattended. Thus requiring a significant amount of manual, human operational input and oversight by the operators.

14:52

By Wednesday the twenty seventh of august two thousand eight, methanol was once again being synthesized in the methanol reactor, and the crystallizers were brought back into service. The following startup stage required two centrifuges to separate out the methanol cake.

15:07

Since the upstream process was near continuous, Two centrifuges were normally required, one separating out a full batch whilst the other was filling with the next batch ready for the next to begin, and then they would swap over centrifuges. However, since they only had one centrifuge available due to the electrical problems on the other, they chose to continue the startup process using only a single centrifuge and began feeding methanol, solvent slurry into the working centrifuge anyway.

15:32

After the initial centrifuge separation sequence neared completion, the operators were surprised to find that no methanol cake had actually been produced. The operators at that time assumed that methanol was not actually being produced upstream, and hence why no cake was being produced downstream.

15:49

During startup it's normal practice to collect samples from various process points to confirm chemicals are correctly balanced and a total of four samples of the liquid exiting the centrifuge were actually taken during this restart. Normally, the output of the centrifuge contained approximately zero point five percent methanol, however the four samples all exceeded one percent, with the highest one reading four percent, which was eight times the specified operating level.

16:14

However, during the startup, with the many distractions going on at the time, the operators never reviewed the results of any of these sample tests. And as a result of this, they were unaware that the solvent levels were far too high, and that was actually the reason why no cake was being produced at the centrifuge. During startup the solvent in a storage tank is used to start the process.

16:35

and via the residue treater, a large portion of that solvent is then reclaimed and returned back into the process. Per the design of the system. However, the operators noted that they had consumed a reasonable amount of solvent already during startup, and it needed to be replenished soon, otherwise they would run out of solvent and the whole process would have to be stopped. They had to start the residue treater to recover some solvents.

16:57

At approximately four AM local time on Thursday morning, the twenty eighth of august two thousand and eight. The board operator manually opened the residue treater feed control valve and began feeding flasher bottoms into the residue treater vessel that was currently nearly empty. The exact amount of solvent that was already in the treator at that time was not specifically mentioned in the report.

17:19

with one transcript indicating it was completely empty and another that it was practically empty, whatever that means. However, we can assume that it was far less than the thirty percent startup level target. In order to open the Flasher Bottom's inlet valve, there were two interlocks that needed to be bypassed. The first, a minimum operating temperature safety interlock.

17:40

And the second, a minimum recirculation flow operational interlock. Technically there was a third safety interlock for vessel pressure, however, since it was well within the required range, it didn't need to be overridden to open the inlet valve in this case. For ResearchFlow, that interlock had been disabled by the DCS programmers as part of their testing leading up to the restart and had not yet been reinstated.

18:04

Though, having said that, the operators did have permissions to disable it had they wanted to. For the operating temperature bypass, it was common for some board operators to bypass the minimum temperature safety interlock and manually open the flasher bottom's inlet feed valve when the residue treat a solvent temperature. was within about five to ten degrees of the target operating temperature of one hundred and thirty five degrees Celsius or two hundred seventy five Fahrenheit.

18:28

After feeding methanol and MSAO into the solvent, the exothermic reaction generated the remaining heat required to achieve the target temperature, and this was faster than waiting for the heating loop to catch up. That said, there was also supposed to be a large amount of solvent already in the tank, which there wasn't at this point of the startup, and hence the current plant state didn't come close to that practice. But the behavior had been well established in the past.

18:55

And was normalized at this point. With an inlet flow rate of only three hundred and forty liters per hour, that's one and a half gallons a minute. It was going to take just over one full day to fill the residue treater to its normal operational level of 50%. There was no note or discussion regarding the filling of the residue treater at the 6 a.m. shift handover.

19:15

As a result, the outside operator on the day shift was unaware that the residue treater had been put into operation and no further samples were taken that shift. Following the handover back to night shift at six PM, at six fourteen PM the outside operator started the residue treater recirculation pump as requested by the panel operator. At that time, the residue treat's level was approximately 30%, or just under 5,000 liters, or 1300 gallons.

19:42

With its liquid temperature between sixty and sixty five degrees Celsius, that's between one hundred and forty and one hundred and forty nine degrees Fahrenheit, noting that this was far below the critical decomposition target temperature of one hundred and thirty five degrees Celsius. The vessel's pressure remained relatively constant at twenty two PSI gauge.

20:00

At six thirty eight PM the temperature began rising at about zero point six degrees Celsius or one point eight degrees Fahrenheit every minute. At approximately ten fifteen PM the vessel's pressure that had been steady at twenty two PSIG since the start of the filling process. Started to increase and ticked up very sharply, crossing thirty P Sig in a matter of minutes. At ten twenty one PM the level was fifty one percent when the recirculation flow suddenly dropped to zero.

20:27

And then in less than three minutes the temperature reached one hundred and forty one degrees Celsius or two hundred eighty six degrees Fahrenheit and was climbing at a rate of more than two degrees every minute, noting that the safe operating limit. was 155 degrees Celsius or 311 degrees Fahrenheit. At approximately ten twenty five PM, the residue treater high pressure alarm sounded, and two minutes later the pressure had reached forty four PSI gauge when the pressure relief valve opened.

20:54

The pressure briefly dropped two PSI for the next two minutes, but then went back to climbing again rapidly. At approximately ten thirty two PM, the residue treater pressure had exceeded the MAOP of fifty P Sig. The board operator noted this and suspected a blockage in the vent line. He contacted the outside operator, mister Barry Withrow, and directed him to go to the residue treater to check the vent system for any potential blockages.

21:19

With the second outside operator, Mr. Bill Oxley, also attending to assist. The panel operator manually switched the residue treater recirculation system to full cooling in an attempt to reduce the increasing tank pressure. At ten thirty three PM, the temperature had now reached one hundred and fifty point three degrees Celsius, that's three hundred and two point five degrees Fahrenheit.

21:41

And with the pressure now at fifty one point two psi gauge, the residue treated vessel exploded, releasing nine thousand five hundred liters or two and a half thousand gallons of methanol solvent onto the roadway. A fireball was seen on the south side of the unit with multiple alarms sounding on the methal and larven operator workstations.

22:01

Operators began emergency shutdown procedures by closing isolation valves, de energizing plant equipment, and activating stationary water cannons in the plant for fire suppression. Shortly after the explosion, mister Bill Oxley was seen struggling to walk back towards the control room, having suffered third degree burns.

22:18

The support leg bolts for the residue treater had been sheared off their foundations as the shell and top head of the two point six ton residue treater fell into the methanol unit. The bottom head separated and came to rest about twenty feet from the residue treaters' foundations. The explosion destroyed multiple adjacent pumps.

22:40

heat exchangers, as well as electrical switchgear. Within minutes the Bayer Fire Brigade was that was stationed on site arrived at the scene of the explosion and started setting up a command post and began fighting the fire. At approximately ten thirty eight PM, Metro nine hundred one one had contacted the Kanawa County Emergency Ambulance Authority, that's the KCEWA, and advised there had been a large explosion at the Bayer facility.

23:05

At approximately ten forty PM emergency medical services or EMS personnel began staging at the main gate of the facility. Shortly afterward, Tyler Mountain firefighters arrived on site and joined the Bayer Fire Brigade at the Methermal unit to assist.

23:19

The fire was brought under control in the early hours of the morning. The estimated energy of the explosion was equivalent to about seventeen pounds of TNT, which wasn't as big as some explosions we've covered on the show, but it was enough to do serious damage. The body of mister Barry Withrow, who had been attempting to troubleshoot the residue treater, was located at approximately two thirty AM the next morning. He had died as a result of blunt force trauma as a result of the explosion.

23:47

The other outside operator, mister Bill Oxley, who had been badly burned in the fireball, died forty one days later at the Western Pennsylvania Burn Center in Pittsburgh, Pennsylvania. Let's talk about the investigation.

⁠¶ The Investigation

24:02

The Chemical Safety Board or CSB were assigned to investigate the incident, along with the Bureau of Alcohol, Tobacco and Firearms and Explosives and the Occupational Safety and Health Administration OSHA. and first arrived on site on the thirtieth of august two thousand and eight. The ATF concluded their investigation on the second of september two thousand and eight and they concluded the incident was not a criminal act. The CSB investigation, however, took about six weeks on site.

24:28

But it wasn't until the twenty third of april two thousand nine that they released their preliminary findings with the final report released on the thirtieth of january twenty eleven, just under two and a half years after the incident had occurred. So what went wrong? The CSB's incident summary concluded the following, and I quote

⁠¶ What Went Wrong?

24:48

On the night of the incident, methamol containing solvent was pumped into the residue treater before the vessel was pre filled with clean solvent and heated to the required minimum operating temperature specified in the operating procedure. The emergency vent system was overwhelmed by the evolving gas from the runaway decomposition reaction of methanol and the residue treater violently exploded. End quote.

25:09

Okay? Well that's the chemistry and I get that. But how did we get to this point in the first place? There are safety systems, operators are trained, and there are standard operating procedures or SOPs for this sort of thing, right? The safety interlocks on the residue treater unit were in place to ensure that mother liquor flasher tails AKA flasher bottoms couldn't be introduced into the residue treater until there was enough heated solvent present first.

25:36

The conditions technically therefore were the pressure wasn't high high. The tank's temperature wasn't high high or low low either, and the circulation flow also wasn't low low. The SAP hadn't been fully updated to incorporate the newly developed HMI screens, however, it still contained important information on how to safely circulate.

25:55

Start the residual treater unit. One of those points was an administrative control that had to be performed before putting the residue treater methanol inlet feed valve into automatic. From the procedure it states if the tank is allowed to cool below one hundred thirty degrees Celsius or two hundred and sixty six degrees Fahrenheit, for any reason it must be sampled before being heated up again.

26:18

Of course, this applied not only to startup, but was also very relevant to startup. The problem was that there was no live measurement for methanol concentration virament instruments into the control system that you could then interlock against. And there's really good reasons why it needs to be a sampled, lab tested value. But then again, even if you could provide a real time measurement, they probably would have just bypassed that interlock as well.

26:43

But either way you slice it, it was there in writing in the SOP and they didn't check the concentration from any of the four samples that were taken during startup at all. So what does a correct startup sequence look like?

⁠¶ The Correct Start-up Procedure

26:57

The board operator and an outside operator manually pre fill the residue treater with solvent to the minimum level of thirty percent, then start recirculation pumps. Test and confirm the methanol concentration is less than point five percent. Then Start the solvent heating cycle. Normally, this is a closed loop automatic controller handled by the DCX.

27:18

The outside operator then collects another sample of the residue treated contents to re verify the liquid contains less than five percent methanol. The board operator then puts the flasher bottoms inlet flow control valve into automatic and the flasher bottoms would then begin entering the residue treater. The procedure ensures that when the flasher bottoms begin entering the residue treater, the flasher bottoms are diluted and heated such that the methanol will decompose rather than accumulate.

27:46

So why did the recirculation flow drop to zero during the startup? Well, Bayer's internal investigation revealed that, and I quote, an undocumented change in the heating cooling control scheme was made during the control system upgrade that resulted in a flow restriction when changing from heating to cooling. End quote. So to paraphrase that, the program is screwed up, and when transitioning from heating to cooling, all the valves shut off and kill the flow completely.

28:13

And whilst this is pretty shoddy and it smacks of a lack of functional testing, both Bayer and CSB determined that even if full coolant flow had been available, the cooler could not have removed enough heat to stop the reaction and it wouldn't have prevented the explosion. Still, not really good enough.

⁠¶ How The Operators Overrode So Much

28:31

So how did the operators override so many different things? When I was reading the report, I was curious how the DCS had its privileges set. Normally, we'll define three levels, operator, supervisor, and engineer. And sometimes we'll also define a fourth as an administrator, but that's not for operational use. During design and HasOps and PHAs, we'll look at the risks of incorrect operation and then split out set points, interlocks, and overrides based on risk.

28:59

For day to day process settings, these are left with an operator privilege, but for anything that's a safety related bypass, it's generally set to a supervisor level. It's an administrative control, of course, but the assumption made during HazOps is that operators are competent and supervisors are both competent, have more experience, and are ultimately accountable for the plant's safe operation.

29:21

What's interesting here is that with Bayer's restructuring, they didn't really revisit that experience hierarchy in the control system, and I'll talk more about that in a minute. It had become a common practice for supervisors to leave themselves logged in to allow operators to bypass whatever they needed to, including safety system interlocks, that became problematic when you were starting up.

29:44

Without this independent and more experienced check in place, better operating practices were sometimes ignored in order to get the process unit back up and running as quickly as possible. In this incident during startup, a supervisor remained logged in to the operator matrix edit screen, and this was left active all day and all night so that anyone could defeat the control functions if they wanted to.

30:07

And that's how the operators were able to override so many things so easily. I talked about an organizational change. So what happened?

⁠¶ Organisational Change

30:16

Between 2004 and 2007, Bayer Management restructured the unit supervisory and technical oversight staff structures to thin and streamline operational costs. Traditional first line supervisor positions for each operating unit were eliminated and replaced by self directed or empowered teams.

30:35

Four teams of operators worked in rotating shifts covering two shifts day and night since it was a twenty four seven operation, and those operator teams were supported by a technical advisor and a run plant engineer. But that was only on day shifts. If they had to, they could call the technical advisor who was on call on nights and weekends.

30:55

Whilst the report doesn't specifically mention this, the wording of it suggests that these roles were actually standard workweek day shift roles. I could be wrong, but it makes sense if you're trying to reduce costs, which Bayer were clearly trying to do here.

31:09

So instead of having a first line supervisor, all operators, including the technical advisor, or TA, reported to a production leader. There was a single industrial park Sight shift leader role, which management described as being a first among equals.

31:25

Who was responsible for overseeing all facility site operations and this was a shift based role? Bayer described their intention for shift leaders to be, and I quote, very good operators who have worked their way through the technical advisor role, end quote. But unfortunately this was not always true. Some of the people appointed to the shift leader role had prior experience as first line supervisors from the previous org structure in some of the process units. But it's simple.

31:51

important to understand that their role was advisory only, meaning none of the operators reported to them. The intention of the technical advisor role, who is also not like a traditional first line supervisor, was to be an experienced operator who worked the day shift and helped to schedule production and to provide advice to on-shift operators.

32:12

A traditional first line supervisor or a foreman had more of a work check or a looking over the shoulder sort of role, but in the new structure this didn't exist anymore. In practical terms, the reorganization resulted in a single technical advisor being assigned to the entire methanol larven unit, working day shift, and also acting as the liaison to the capital project team.

32:34

The shift leader was also available to assist if needed, but in reality they didn't work with the operators every day. Due to the scale of the upgrade project, Bayer assigned a second technical advisor to assist. That seems sensible to me. In practice, this meant that the first technical advisor focused on the larven production, and the second technical advisor focused on methanol production and project support. The only problem was that this TA had no methal unit operating experience at all.

33:05

In interviews following the incident, the TA expressed that whilst he lacked knowledge of the methamol unit equipment and chemistry, he had, and I quote, hoped to learn more about the process by having a greater involvement in the unit startup. But when it came to the crunch, they weren't even able to be present during startup anyhow due to operational difficulties on the larvain unit, and they got dragged into that instead.

33:29

Beyer had also appointed that specific second TA since they had experience as a technical advisor with some DCS training. Problem was that was in a completely different unit on a completely different control system. Not exactly transferable.

33:45

During the project's factory acceptance testing, or FAT, one of the more experienced methal unit operators actually helped the TA, but the technical advisor remained the primary contact for the project anyhow, despite his inexperience with this particular unit.

33:59

In the days leading up to the incident, the TA was noted as having worked somewhere between fifteen to seventeen hours every day, and when the operators were struggling to stabilize the methanol unit during startup, the TA had already left for the day, and they were not contacted for any assistance. Though it remains unclear if the TA could have provided much useful assistance, anyhow. It has to also be pointed out that the DCS upgrade just wasn't finished.

⁠¶ The DCS Upgrade Wasn't Finished

34:24

The PCS seven system was provided with six physical screens to display SCADA HMI information in the methanol unit. Five screens were available to monitor the methanol process and one screen was dedicated to displaying the current alarms. However, in order to operate some of the methamol unit equipment, operators found they needed to use three of the five available screens to have the required information all out in front of them, leaving only two for other plant information.

34:51

In short, the HMI design required more physical screens than could be physically installed. The operators requested from the Siemens Project Engineers to add equipment overview screens to display multiple pieces of equipment which are referred to in HMI design circles as level one graphics.

35:07

Whilst the project team had agreed to develop them, they were not ready for the startup, and whilst operators found the poor HMI design to be frustrating, they didn't see it as a showstopper to restarting the unit. When you're designing a HMI, the ground level pages are the ones you always start with, the so called level three and level four displays.

35:26

These have every single setting and controllable item displayed on them, so you can completely operate and commission the facility from those screens. But it's not easy if you have a large or complicated process. It comes up from time to time when I've worked with operators. We need more screens over and again. When in fact that's actually completely incorrect. You don't need more screens at all. You need properly designed overview graphics.

35:52

Level two, which is generally a train level, and level one, which is generally a unit level, and level zero, which is your entire site facility overview. If you design your scala system properly, you should only need three or four screens in total, which is good, since that's all that a human being can ingest at any one time anyway.

36:11

And that's something I think that's worth exploring some more. There'll be a causality explored episode for supporters on that one. Watch for the link. So what do we learn from all of this?

⁠¶ What Do We Learn?

36:22

The CSB called out five root causes of the incident at crop size and I'll read them for Batum. 1. Bayer did not apply standard pre-start safety review, that's PSSR, and turnover practices to the Methanol control system redesign project.

36:38

Two, operations personnel were inadequately trained to operate the methamol unit with the new DCS. Three, malfunctioning equipment and the inadequate DCS checkout prevented the operators from achieving correct operating conditions in the crystallizers and solvent recovery equipment.

36:53

Four, the out of specification methamol solvent mixture was fed to the residue treater before the residue treater was prefilled with solvent and heated to the minimum safe operating temperature. And five The incoming process stream normally generated an exothermic decomposition reaction.

37:09

But methanol that had not crystallized due to equipment problems greatly increased the methanol concentration in the residue treater, which led to a runaway reaction that overwhelmed the relief system and over pressurized the residue treater. End quote. Now, point five is actually the reason why the tank exploded, and we've been through that in detail, but I'm more interested in how we got to that point.

37:30

The importance of acceptance testing, whether it's an off-site FAT sense or an on-site in an SAT or site acceptance test. You just need to know that when you flick a switch on the floor. Which the system is going to work before you turn it on? Where were the control systems engineers assisting with control loop adjustments on hand twenty four seven in case of system issues during startup? There's no mention of that anywhere.

37:55

But the simple fact is the system just wasn't ready to run yet. It just wasn't. So why the hell did they start it? Who decided that it would be just fine to fire it up? Not only that, who decided it was okay to fill the residue treater without solvent in it first? So for me, the most important root cause, which the CSB does cover a lot in the report, lays with the people that made the decision to start up, that didn't do any real PSSR, and to appreciate that it comes back to experience.

38:22

the experience of the personnel making the decisions and the drivers behind those decisions, in other words, the organizational structure. Organizational structure should be subject to external review. And that's potentially a bold statement, but I'll explain what I mean. The CSB concluded that the organizational changes made at Bayer Crop Science leading up to the incident was a direct contributor to the incident.

38:47

A self directed organization did not require management advice or input into the unit restart schedule. Therefore, the self-directed work teams themselves ultimately decided to just start the methanol unit, despite the fact that the control system and some plant equipment wasn't fully tested and the SOP hadn't been updated.

39:05

After being in this structure and the shuffling of personnel through positions over time, management had become so far removed from the day-to-day operation that they were completely unaware that the operators seldom used the SOP for startups. And some of them regularly bypassed critical safety interlocks, which reinforced behaviors that directly led to the incident.

39:26

Not only that, it's clear that the assigned technical advisor was not suitably qualified for the role and was not present when they should have been. Since they were but they weren't dedicated to the methymal unit startup. A more qualified and more experienced TA who was actually dedicated to the startup. And dedicated means they don't get dragged into other parts of the plant. They aren't allowed to be. They're dedicated. That's what dedicated means.

39:51

It's common practice to have the most experienced operators dedicated to plant startup as it's one of the most dangerous times in plant operation. And this isn't news. This is a well-established understanding. Having said that, The structure of a self-directed team plays a part in this.

40:08

In his excellent book Organizing for Safety, How Structure Creates Culture, doctor Andrew Hopkins states that, and I quote, decentralized organizational structures allow profit and production to take precedence over safety. And quote. And he has said on many occasions that in certain industries, meaning those where harm can come to personnel, the public or the environment, they should have their organizational structures regulated or approved by an independent body.

40:38

Currently, this isn't the case, at least so far as I'm aware, anywhere in the world. But the thing is, the management in large organizations regularly just decide to shuffle the deck, change who reports to whom, centralized, decentralized, downsize, right size, whatever terminology you want to use. And the issue is that the problems don't show up the next day. In the case of BP Texas City, it took several years.

41:03

as it did in this case as well before those organizational changes led to the incident. The problem is that who reports to whom drives survival instincts. If you have a managing manager that is non-technical that you report to and they say get it running, you'll lean towards that as your priority front of mind. Since this person decides if you keep your job, if you get a pay rise next year, and it's a basic human survival instinct.

41:31

I've watched technically excellent engineers and operators be overruled by technically ignorant managers and, well, just roll over and give up. Now we saw another example of that with the Challenger Space Shuttle. That's just another data point. I've also watched people that I technically respected be moved into management first roles and watched their priorities erode over time, as the pressures of management up from above them, who are even less technical, repoint.

41:59

Their priorities and in effect their promotion into management makes them worse engineers overall. Structure does actually change people. It can elevate the best of the technical decisions to the forefront. And it can also do the exact opposite, taking wonderful engineers and making them ineffective, frustrated, and perhaps shadows of their former selves.

42:23

Organizational structure has to have both management and technical reporting alignment, and leadership roles should have relevant technical experience, otherwise they can't effectively manage their own teams and ensure safe operational procedures are actually followed. In high risk industries, this trend of splitting management and technical knowledge roles needs to end. It is a fantasy that just doesn't work. Anyway, rant over.

⁠¶ Aftermath

42:49

Let's talk about the aftermath. Despite Bayer Crop Science reporting that no toxic chemicals were released because they were consumed in the intense fires, end quote. During the investigation, the CSB confirmed that the only air quality monitoring equipment that was near the unit was that could detect toxic chemicals wasn't even operational at the time of the incident. So we don't really know.

43:09

Therefore, there is no information available to actually know what chemicals were actually released into the atmosphere in the incident, let alone what their concentrations were. To that point, on the sixteenth of september twenty ten, Portia Gray filed a lawsuit against Bayer, alleging her son, Ray Sean Gray, died weeks after breathing dust and powder fallout from his dormitory room that they claimed was Came from the Bay facility following the occasional

43:35

Explosion. His dormitory room in the nearby university had its window open and was downwind from the facility, and upon Raishawn's return to his room to shelter in place per the instruction, his room was coated with a layer of this dust. Try as I might, I could find no information on the outcome of that case. Bayer Crop Science paid a total of one hundred forty three thousand US dollars in fines to the US Occupational Safety and Health Administration.

44:02

For thirteen serious and two repeated violations from the incident following their six month investigation. In September of 2015, Bayer Crop Science paid$5.6 million to settle violations of U.S. chemical accident prevention laws. In addition to the nine hundred seventy five thousand US dollar civil penalty, Bayer committed to investing.

44:23

four point two three million US dollars in improving emergency preparation and responses at the facility, as well as additional protective measures against accidental discharge into the Kanawha River within three years of the settlement date. The plant has since been brought back into production and continues to manufacture product today. It's easy to directly blame the operators for overriding the control system and for the self directed team for deciding it was safe.

⁠¶ Conclusion

44:50

So simply start the unit without any extended consultation, even though it hadn't even been fully commissioned yet. The problem is that our perception of risk can be diluted or normalized. It's because of how we learn, or one aspect of how we learn, and that's reinforcement learning. It's the same problem the way we train machines. It has the same issue.

45:10

If we're operating a system repeatedly whilst it's running, and it's running maybe 95% of the time, with only 5% of the time where starting it up or shutting it down, there's plant trips and trip recovery, then we just get used to system operation once it's balanced. And operational, the 95% of the time, that's what we get used to. And this is one of the reasons why startup is just so dangerous.

45:32

It's the same equipment operated differently until you reach an operational equilibrium anyway. And that's why pre-start checklists are very important. This is why operator retraining for startups and trip recovery is so important too. It's also why independent oversight from someone technically experienced with the chemical process in question is present during the entire startup sequence, or in this case there needed to be a set of technical leads.

46:01

back to back since it was a twenty four seven startup over a week and a half. People plus structure plus procedures equals success. There are many causes for this incident that we've covered, but ultimately the one that worries me the most is this background lurking root cause of many incidents like this. Poor organizational structure.

46:22

It feels to me like everyone has an opinion about how you should structure an organization. Who reports to whom, matrix management, operations led, engineering led, safety led, or self-led, in this case, my goodness, it never really ends. There's always a different way of saying it. There are companies out there that will come in and help you restructure based on the latest matrix determinant.

46:42

Sub self led, whatever trendy, usually Silicon Valley led or la latest highly financially successful company. This is what they're doing, we should do it too. Usually not in the process industries though. And they get paid huge amounts of money with promises of increased efficiency and cutting costs. And if it isn't outsourced, there are those internally within organizations that think that they can do the same.

47:05

But it's similar to a political election cycle kind of issue, meaning that most politicians only consider programs and projects within their current election cycle to point out and say, I achieved that in my four years in office. With that anything that's longer term, longer time horizon generally getting ignored or overlooked.

47:23

With an organizational change, it's worse than that. The people making these decisions generally exit the organization within a year or two of the change having been made, saying See, I improved efficiencies and cut costs and made it better by shuffling people around, or something like that. But the impacts of these decisions are seldom felt in that time frame. It can take years for the problems to manifest.

47:45

by which time the instigators are generally long gone and they're never held to account. There is no feedback loop, and hence they never learn from their mistakes, and go on to make the same basic mistakes organizationally All over again. The problem is we have a growing body of evidence that suggests that there are organizational structures that work better and those that work worse in terms of managing risk in critical operations.

48:09

When explosions kill people, we trace them back to bad design, we enact regulations, design standards, and hold people to account. They get fined. They go to jail. You know, punishment. When organizational change is the root cause though, we do what? Nothing? Anything? How many more incidents do we need to have around the world like this before we start taking it seriously?

48:36

If you're enjoying causality and you'd like to support us and keep the show ad-free, you can by becoming a premium supporter. We have support options via Patreon, Apple Podcasts, and YouTube memberships.

⁠¶ Outroduction

48:47

as well. Just visit engineer.network slash causality to learn how you can help this show to continue to be made. Thank you. A big thank you to all of our supporters, a special thank you to our silver producers Mitch Bilger, Shane O'Neal, Jared Roman, Katharina Will, Chad During, Ian Gallagher, and Jamie Russell. And an extra special thank you to our gold producers, Steven Breidel, Callan Frodelias Fujin. And our gold pressure.

49:10

producer known only as R. By popular request and for the month of May only, we're bringing back the very popular offer where you can get an additional month's discount on top of the two months discount you already get for annual patrons.

49:22

But it's only for the month of May. It's available for premium and above levels, so if you've been on the fence about supporting the show and want access to the bonus materials and early episode releases, then it's never been easier. Get in while it lasts. Thank you. Causality is heavily researched and links to all materials used for the creation of this episode are contained in the show notes. You can find them in the text of the episode description of your podcast player or on our website.

49:46

You can follow me on the Fetiververse at chige at engineered.space or the network at engnet at engineered.space. This was Causality. I'm John Chiji. Thank you so much for listening.

✨ This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.

Episode description

Transcript