Crowdstrike outage globally spreads Windows Blue Screen of Death, company plans update rollback
IT security professionals are in for a long weekend following an outage at cybersecurity company Crowdstrike.
Some IT professionals are in for a long weekend following an outage at cybersecurity company Crowdstrike. A recent update has lead to a series of outages across the world. At the time this article went live, the London Stock Exchange has had services disrupted. There are reports of issues at banks, airlines, media firms and countless government services across the globe. 911 emergency services are disrupted in several states here in America. Many users reporting the dreaded Windows Blue Screen of Death (BSOD).
If you're a @CrowdStrike customer and your machine is off, leave it that way.
— Jake Williams (@MalwareJake) July 19, 2024
Something has caused blue screen loops with csagent.sys and it's, um, not good... pic.twitter.com/PeYLH8qhGT
Crowdstrike told NBC that the company is now in the process of rolling back that update globally. There are reports of other fixes, but the outage will undoubtedly lead to a lot of time wasted on the part of the IT professionals of the world. Quite literally around the world, today's outage has ruined many people's weekends.
Wow, Crowdstrike issue. Thoughts and prayers fellow IT guys and girls around the world.
— Steve (@JarOfSteve) July 19, 2024
Several media companies including NBC are affected by the outages. The United Kingdom's largest railway operator is experiencing a widespread IT outage as well. A poster to the r/sysadmin subreddit posted that they "just had 160 all BSOD. This is NOT going to be a fun evening." One IT professional posted to our own Shacknews Chatty forum that "I am so glad we didn't switch to them just because "they are the best" or whatever our VAR was claiming. Dodged a fucking missile!" I am sure many others in the IT security field are jealous of CplBeaker today.
Microsoft's 365 suite of apps and some cloud services have been restored according to the company with “a small subset of services is still experiencing residual impact” Some airline issues were tied to the Microsoft outage, but other airlines remain at a standstill due to the Crowdstrike issues.
Crowdstrike (CRWD) stock is down over 12% in premarket trading as the world stares at the company in disapproval. Dow Jones Futures are trading down 0.25%, but things will definitely change when the stock market opens later today.
-
Asif Khan posted a new article, Crowdstrike outage globally spreads Windows Blue Screen of Death, company plans update rollback
-
BBC freaking out and a global IT outage?
https://www.bbc.co.uk/news/live/cnk4jdwp49et
Summary
IT outages are reported across the world, affecting airlines, media, and banks
Airlines and airports have reported issues, with many flights grounded
The cause is not known - but Microsoft says it's taking mitigation issues
In the UK, railway companies say they're experiencing "widespread IT issues"
Sky News has not been able to broadcast live, its executive chairman says
The London Stock Exchange is also experiencing outages-
yeh lots of freaking out on the news in Oz here too.. very meh in my mind.
https://www.sbs.com.au/news/article/global-it-outage-impacting-australian-banks-supermarkets-media-outlets-and-more/ezge7qp0g
-
-
-
Crowdstrike it seems: https://reddit.com/r/sysadmin/comments/1e6vx6n/crowdstrike_bsod/
-
-
-
-
-
-
I’m joking but man it sure feels like that Jedi meme, I now know why I slept like shit tonight:
https://i.imgur.com/YyHyz0p.jpeg
Even the time for the second peak roughly aligns with the time this thread started, haha. -
-
-
I feel really bad for those people who have Crowdstrike on hardened Domain Controllers who either misplaced, or didn't record their bitlocker recovery keys, or domain recovery passwords. Hope your backups work and you have one from yesterday.
I also feel bad for remote workers, since if their laptops are bricked right now getting those fixed is going to be a huge pain as well if they are in an area where IT can't get to them. -
The problem is that this isn't fixable remotely. The crash is happening before the network stack even starts. So you are going to need to have to have manual intervention on every single machine affected by this. And the fix is booting into Safe Mode and deleting a file which is also impossible on AWS and Azure systems. For those you need to use a 2nd VM to attach the affected VM's OS disk to and go in and delete the file. Which if there is any sort of disk encryption in place you will need a recovery key to do, and if your system that escrows the recovery keys is also affected then you have even more of a mess.
This isn't something that is going to get fixed today, this is going to take weeks to fully resolve.
Now the decision is what security company to invest in who is going to be getting a bunch of new customers. Trellix or Microsoft are the 2 big ones but Trellix is private, and Microsoft is so big that will a 10% boost in security sales really boost the stock that much. Maybe take a stab at something like Sophos.-
-
On the large scale yep pretty much just Defender and Crowdstrike. Trellix is probably next (formerly FireEye), and then all of the midrange ones like Sophos, etc. It is just such a huge lift to replace these systems that once they are in place they stay in place, so big enterprises just go with the market leaders.
-
-
-
-
-
-
-
-
I was terrified when I walked into my office this morning. Almost gave me a heart attack for like 15 minutes, but luckily the fix is simple
https://www.reddit.com/r/crowdstrike/comments/1e6vmkf/bsod_error_in_latest_crowdstrike_update/
Boot into safe mode...rename the folder, delete the file, or change the registry key...and then reboot. Sucks for all of the techs that need to do this manually to every single PC in their environment, but at least it's not an overly complicated process.-
-
I don't bitlocker&oq enabled, so I don't know enough about it, but I found this comment section
https://old.reddit.com/r/crowdstrike/comments/1e6vmkf/bsod_error_in_latest_crowdstrike_update/ldvxx62/
-
-
-
-
Just reboot 15 times!
https://i.imgur.com/oZTOk70.jpeg-
Azure https://azure.status.microsoft/en-gb/status
We've received feedback from customers that several reboots (as many as 15 have been reported) may be required, but overall feedback is that reboots are an effective troubleshooting step at this stage. -
from twatter (I won't link it, don't want to give elon hits)
@_aarony
Rebooting 3 and up to 15 or more times is working on a large percentage of machines. It appears that sometimes the network stack is up long enough and crowdstrike update mechanism is able to fix the broken .sys file. Try rebooting over and over and over and over. Seriously. -
This seems like the best fix I've seen so far that doesn't involve physically accessing the machines https://www.reddit.com/r/sysadmin/comments/1e708o0/fix_the_crowdstrike_boot_loopbsod_automatically/
-
-
-
-
-
-
-
Fixed it:
I imagine with the scope and reach of systems being affected that some investigation will happen by some government, intelligence service, or third party. If they find out they are lying about the cause and exposure is higher than they day, I doubt things will go well for the leadership programer / scapegoat they blame it on.
-
-
Never underestimate the power of profit-seeking to cut corners on QA.
This wasn't a hack:
https://arstechnica.com/information-technology/2024/07/major-outages-at-crowdstrike-microsoft-leave-the-world-with-bsods-and-confusion/
-
-
Lol. I am being asked "why didn't Microsoft alert us to this issue". And now I need to spend time crafting a "because it isn't a Microsoft issue, and we don't use CrowdStrike" response that is polite and overly verbose because C-level people. 50/50 chance I need to make a Powerpoint slide deck and present it to the leadership team... about a product we don't even use.
-
oh man, this CrowdStrike ad ...
https://vid.crowdstrike.com/watch/hCVMAuN4BmyU9iGGA2XoQv?
-
-
-
-
-
-
https://twitter.com/US_Stormwatch/status/1814268813879206397 timelapse of US air traffic and the Crowdstrike outage.
-
-
-
-
Sooo my connecting flight in SFO was delayed because they couldn't find a ground crew to tow the plane in once it arrived. Then the flight attendant crew was missing because the flight they were supposed to arrive on was canceled. Thankfully they eventually found a replacement flight crew and we boarded, but then the ground crew that loads the plane was missing. Finally we hung out for another 20-30 minutes since it was the last of only two flights to Eugene that day in case there were any stragglers, which honestly seemed pretty decent of them.
All told it ended up being maybe an extra 4-5 hours, I got off pretty easy compared to a lot of people. Lucky that was my only connecting flight.
-
-
-
-
-
i don't think it did? they have and advisory up on their status page, but it very much reads like supporting customers that installed it on their own shit: https://azure.status.microsoft/en-us/status
-
-
this article specifically suggests that they are not
https://www.bleepingcomputer.com/news/microsoft/major-microsoft-365-outage-caused-by-azure-configuration-change/
-
-
-
-
-
I think this is where the confusion came from. It was terrible timing.
http://www.shacknews.com/chatty?id=42510992
-
-
-
I'm completely dead in the water for work. All of the network traffic (even non-intranet) on our work laptops is routed through VPN for security reasons, and the VPN is offline still.
Mobile Outlook/Teams works, but two years ago our work required us to install remote management software on our phones to continue using mobile Outlook/Teams, and I told them they can either give me a work phone or I just won't have mobile access. You can guess which direction they went, lol. So now my only means of updates is via text from my boss.
And yeah, can't access literally anything on my work laptop since it's all routed via VPN. So just sit and wait today (or sit and play games while I wait lol).-
-
-
Yup, lol.
My position was that I had mobile Outlook/Teams installed on my personal device as a convenience to the company. Their reasoning was that if the phone is lost, they want to have the ability to remote wipe it to protect company data. But I don't want some idiot screwing up and pushing the wrong button and wiping my personal device. Also, if they determine it's critical that I have remote availability, they can provide me with a work device.
So I told them they have my phone number and can call or text if they need something urgent, and it's been like that the last two years. It's definitely a hit to the company because it used to be sometimes I'd see an email or teams message after hours and if I wasn't doing anything I might hop on to respond or resolve it. But now it just waits until I'm back the next day. /shrug-
-
Honestly, in the almost 6 years I’ve been here, I’ve had one instance of an unscheduled, off-hours emergency.
Although we’re a big company and have some divisions that run 24/7, weeks have a pretty good culture from the top down that most things can wait until regular business hours. Unless it’s something that impacts production, it can usually wait.
Plus, I’m in the software development division. I’m sure our tech ops guys have been up all night dealing with this but I don’t even find out we were impacted until I tried starting my work day.
-
-
-
-
-
its very normal to have some sort of controls on device that have access to work resources. in this case everything is working as intended, they should either provide device if the person needs to work off hours or not have that expectation. but the expectation that even a BYOD has controls on it for work resource is not odd
-
Controls within the apps themselves are fine. They don't need full control to wipe my phone if it's BYOD. As I said they can immediately block access to Slack, Teams, Outlook trivially. They do not need nor would I trust any company with that power over my personal phone as I have seen companies completely wipe a persons phone after they quit.
If it's a company provided phone sure they can do what they want with it. My personal phone, you can lock down the individual accounts and that's all you get as you don’t need anything more.-
If your org uses O365 it is possible (easy even) for them to wipe your device even without company portal or whatever mdm they are using. Simply disabling the accounts should work in principal but in practice it doesn't work perfectly. Existing emails still remain on the device for example, and other work-related applications or data could remain.
The Company Portal creates a segmented area on your device for 'work' stuff. If you leave the company they can easily and effectively remove only the work apps and data.
It's really not as invasive as you think. They can see stuff like phone model, OS version, if it's jailbroken and so on. Doesn't show them the list of apps, at least how we set it up. Now, if your org's policy is to completely wipe the device upon leaving, that's not necessary and extreme IMO. Maybe if you're in some sensitive sector I guess, but seems extreme.
Source: IT guy who is working on this exact thing and completely wiped test phones that had no MDM or other connection to the org except for signing in to email.
-
-
-
-
-
-
-
-
What's really crazy is the post on /r/wallstreetbets before the update was pushed:
https://www.reddit.com/r/wallstreetbets/s/Wy22QkZP0W-
-
This guy has thing backwards. e.g. "Is CrowdStrike compatible with companies that run their IT systems on premises?" it was built for on-prem before Azure's cloud-based management of Windows, and thus part of the reason there was adoption was that you could deploy on prem extremely easily compared to the competition, in minutes, and manage everything from the CrowdStrike console, just as you can do today finally in say Azure now for Windows.
-
-
-
We're unaffected and I thought all my side work clients were unaffected, but I have one that outsourced "security" to another firm, with me being the go-to for other stuff. Well the Internet has been down and it's not unheard of. But then the contact on-site asked if this was related to CS. I rapidly wrote no, but then thought better of it and had her contact the other IT firm to ask. Then I get a call from said firm confirming that yes, they use CS and that was most likely the reason the server was down.
At first I was thinking, not my problem then, but these dudes are in Fort Collins and the office affected is on Castle Rock, where I live. So guess who felt morally obligated to take care of it.
More like ClownStrike, amirite.
-
-
Lol.. same here. Our company does not use Cloudstrike at all, but it feels like bad juju to update or reboot anything at this point even though we aren't affected. It can wait until Monday. I was supposed to work on another project offsite today too... but fuck that. The ol' SysAdmin side of me is at the office just in case.
Why? Because when you've been in the trenches.. the PTSD of reimaging or updating kicks in.
*Stay strong my fallen IT brothers.. we feel your pain!*
-
-
You can't make this shit up:
In 2010, McAfee released an update to their software that crashed WindowsXP systems worldwide, and required manual intervention to fix. https://www.zdnet.com/article/defective-mcafee-update-causes-worldwide-meltdown-of-xp-pcs/
The CEO at the time was George Kurtz.
The CEO of crowdstrike today is George Kurtz. -
-
-
-
My buddy works for the county and has been at work since 10:00pm last night manually recovering workstations one by one.
Sheriff dept, emergency services, county hospitals, etc were all offline last night. Crazy impact.
He said he’s going to go into a coma after this because he’s been up for almost 30 hours straight. But on the flip side he’s hourly so making bank.-
-
He's told me cyber security has been a massive focus for his county the last several years. Apparently there's been a few government agencies hit with ransomware that has cost them a ton of money, so other government administrations are scrambling to make sure they're secure. I'm guessing they see CrowdStrike as a cheaper alternative to paying a ransom or liability from leaked data?
-
-
-
-
-
-
-
yeah, i know that you don't want to do that with definitions, but you could still do a slightly gradual rollout to avoid situations like this. or at least if you're making changes significant enough to cause a BSOD you'd think that would be a separate gradual rollout. maybe it's something really dumb in the definitions themselves that is causing the bsod though, idk.
regardless, it's probably worth slightly reduced security to avoid like, air traffic halting worldwide. -
-
-
in critical security patching, you actually *do* do a gradual rollout.
think about google chrome, windows updates, apple updates, etc. in any critical security patching, it's actually way more important to slow down the update into cohorts.
1) to make sure the update actually does what it's supposed to do
2) don't take down your own servers while patching the world.
the fact that there was no throttling means the folks at crowdstrike are rookies.
-
-
-
-
-
-
-
-
-
extreme case: you could imagine that a piece of data is so now big that it ends up accidentally allocated in a paged pool and then later, in a separate thread, minutes or ever hours later accessed in a non-paged context, and now there is a page fault in some OS code path owned by Microsoft that cannot handle page faults. This probably not what happened, but just an example of the complexity of data changes.
-
-
-
-
-
-
-
-
-
-
-
-
-
Thankfully our entire fleet of Windows Servers (~300) weren't affected. Maybe around 30-40?
We were able to restore OS'es in about 5 hours between 2-3 of us.
Apparently there are people with 1000s of machines affected, maybe even 10's of thousands, that are doing creative things like PXE booting and running embedded scripts to remove the files.
Looks like I'll actually salvage a weekend! -
-
-
-
We've been at it since 11pm PDT last night. Making steady progress. Our Windows admins are getting pretty adept with the Azure "mount the boot vol on a helper VM and edit, then reattach" dance.
I'm a hybrid admin (mostly Linux) so the few places I have Windows admin privs, I fixed those servers in about an hour at 1am. Now I'm just facilitating.
Our CrowdStrike sales rep sent $50 Uber Eats gift cards to each of us as apology / fuel. I almost wanted to reject it.
-
-
https://misskey.io/notes/9vw052ic52zv017x
"Forced reboot almost destroyed my asshole" -
-
-
A deep dive of the root cause from some twitter dude:
https://x.com/Perpetualmaniac/status/1814376668095754753
"Crowdstrike Analysis:
It was a NULL pointer from the memory unsafe C++ language."-
-
-
C++ is hard. Maybe they have a DEI engineer that did this but for mission-critical software like this Crowdstrike should have set up automated testing using address sanitizer and thread sanitizer that runs on every code update.
guy's bio: Google Whistleblower via James O'Keefe . Disclosed Google's "Machine Learning Fairness", the AI system that censors and controls your access to information.-
He was fired because he was racist: https://www.splcenter.org/hatewatch/2014/02/14/old-photos-surface-showing-breitbart-okeefe-hobnobbing-white-nationalists
-
-
-
-
-
-
-
I just rebooked my flight to Japan. It was supposed to be this morning at like 8am. Direct flight with my wife. Even dropped another $600 for premium seating.
Because of the crowdsuck crash they automatically put me on a flight tomorrow at 11pm with a 2 hour layover and coach seats, gave me a $600 coupon to use… eventually.
But they got my wife on a flight today at noon, so she’s already in the air.
We spent 4 hours in line at the airport to deal with rebooking. Airline only had 2 customer service reps.
I won’t believe I’m on this flight until I’m actually in the seat.
-
Take a break
https://i.imgur.com/cFWt2K6.jpeg
-
https://www.crowdstrike.com/blog/technical-details-on-todays-outage/
Logic error when evaluating named pipes. -
(Video with sound)
https://i.imgur.com/sRu0VW4.mp4 -
(Video with sound)
https://i.imgur.com/xvQcXzH.mp4
-