
so no there i was
it was barely the crack of dawn like like not even 10 30 in the morning yet it was the the the smell of burned coffee was still wafting down the halls from the break room developers were getting out of the scrum meeting calls in the morning and it was right about then when it happened
the food network website was down now the we manage quite a few other websites and this was like our largest like 100 million unique users hundreds of millions of page views so this site being down was a big deal so immediately grabbed some a couple of engineers threw them on looking at what was going on now right away it was apparent the site itself wasn't actually down that was a bit of an exaggeration no it was recipe pages well this is almost worse because if you're going to food network you're typically going to look at recipes and food networks uh you know website is ad driven and all of the uh
most is all ad driven so if we don't have people coming we don't have seeing some of the pages then we're not making any money and all of those ad impressions are all pre-sold and if we don't turn out the the expected number of pages then somebody's got to get a refund and i don't want that coming out of my paycheck so we're doing some digging into that trying to find out what's going on and the pages themselves are not actually down it's really only part of it and the pages are sort of corrupted all right let me tell you a little bit about myself first so i'm brian saylor i'm uh i've spent a lot of years working in the
software development field so i've worked at some startup companies i've worked at the government and about 18 years back i kind of fell into working with media business so i started out working with ew scripts down here in knoxville eventually moved over to scripps networks then to discovery so i hope everybody is familiar with all the discovery brands of sites and channels and of course as of a few weeks ago warner brothers discovery so now that the list of websites and applications and mobile apps and things like that that i'm working with are going to grow up uh exponentially again working with these typically i'm working more with i've been doing the leading software development i've been
the software development manager i've been overseeing multiple teams of software developers mostly i work with automation now but who cares that's just me so let's get back to our story so we're looking at recipe pages why the recipe page is messed up so not all the recipe pages are messed up is maybe one in ten and we start digging into those and it comes clear after a little bit that it's the comments and ratings pieces um the modules of all these websites are all modular right and they've got different services feeding different parts of them and the commons and rating systems is not just missing it's sort of it's messed up the things are all broken
uh which just makes for a bad experience so we call the community team who manages all of that right all of our things are split up we've got teams handling search teams handling the community pieces we've got team handling content delivery community team looks into their things and nothing's wrong so there's no errors on our side we've got our system's not sending any any bag problems nothing's nothing in the monitors have tripped anything no alerts it's not on our side it must be on your side so our engineers are still digging into this what's going on so we start taking the actual requests that are coming in the server to server requests that are coming back with those
individual things and this time by this time it's now 20 of the recipes right it's getting worse so if you pull out one of those those calls right there's error handling for things like response code is that we're clearly getting 200 response codes because that would be a trap and it was there's special handling for that kind of stuff it's not timing out again that would get trapped and we have special ways of handling that kind of stuff the comments are just like not going to show up
no what we find is what we're getting back from this system that returns json this json endpoint is an html page which then tells us that hey this site may not be safe it's not categorized are you sure you want to proceed where is this coming from right this is this is this is between two of our internal company services and yet there's an html page it doesn't come from their system doesn't come from our system where is it coming from well as probably some of you have figured this is some sort of this is a security system intended for employees and checking the sites that they're going to and checking if they're in the in category categories that are
okay but in this particular case it was triggering on internal um service calls you know that are they're all internal to the to the company so we didn't work with the security team we got it straightened out eventually um and then had to flush all the rescue pages again because we were caching the bad responses but you can imagine these developers had now spent a number of hours tracking down a problem that wasn't part of any of their systems no one knew where it had come from eventually we got back to it but the you know we we're we track like per minute like when when sites are down and when we're not able to deliver our content to
our customers in this particular case something went wrong okay not the end of the world we dealt with it but it was annoying
so let's talk about another story so one day i'm working with i'm just doing my regular stuff working on my computer doing email and my system my computer starts getting really slow eventually start digging into it there's a process running and it's it's camping on one of the cpus burning 100 of the cpus on my mac huge amount of memory uh fine i don't know what it is i kill it great everything's good again for about 30 minutes process spawns back up again camps on the cpus everything's slow kill it again of course it comes right back start asking starting around the the development in engineering groups a couple of people reporting the same thing but only a few
by the next day everybody was having the problem everybody's computer was suddenly super slow um so like so take uh design are ux designers right they're running photoshop and they're creating mock-ups of pages that are going to get implemented by development teams they're creating image assets and photoshop is suddenly taking super long to do anything right so things that would normally take them 30 minutes are taking them an hour and a half or two hours to get done engineering and quality assurance they're all being impacted now a few people are on windows machines and they're not having a problem but most of the team is all on macintosh's and and they're all being impacted so what was going on because we couldn't
get rid of the software in fact we did track down what the software was and we uninstalled it and 24 hours later it reinstalls so this was this was security software that was you know so we you know i reached out to our desktop support team walked through this problem with them and they confirmed yes this is this is security software that's intended to check for somebody's computer getting compromised and to notify them that you know something's wrong but why why is it sitting there and chewing up the you know all the cpu resources and all the memory start working with a security team they reach out to the vendor uh the vendor promises a fix
so eventually you know so we there's gonna be a fix it's gonna be in the next friday that's gonna roll out this has now been a problem for two weeks friday rolls out the fix rolls out and it doesn't make any difference now another thing we noticed while this was going on is that the it's right into the disk and it's riding several terabytes a day and these are all ssd drives on these macs right so they've got like a lifespan maybe of seven years of normal usage but you suddenly start writing terabytes of data day after day after day you know suddenly that may kick you know die after a year or two who's gonna pay
for the replacement um you know we missed our uh you know the development teams missed their their deadlines right because everything is taking so much longer right or they're putting in extra hours to try to finish things still no word you know on on how we're going to fix this eventually again continuing to work with our desktop support team they eventually walk through exactly how to disable the software not uninstall it but just have it not run because if you uninstall it right this the software management software will reinstall it yeah so we worked around it we went through we provided those instructions to everybody and they they got it off their machines now there was probably a really good
reason to have that software doing what it was going to be doing but it was certainly not working and it was a huge inconvenience and it was certainly making everything difficult to do
here's another another item so we have a design team who's building prototypes um so they're building sort of an interactive prototype for some you know new uh update to one of our products and they're about to be presenting it to the vice president they've been working on this for a while this is a little um it's it's still static pages but they're but the links on the pages work so you can kind of navigate between the different pages of the of the application kind of makes it semi interactive it makes it really nice way to walk through a prototype about an hour before their presentation they came to get me for the last hour they've been trying to
get it to work and he did the whole thing had stopped working been been working for days they've been working on this prototype for two weeks now and suddenly it doesn't work none of the things they click on do anything and they've been trying for an hour they can't figure out what's going on they come down to me i look at it does anyone want to wager a guess what happened dns good good try no no windows update oh i can tell you some stories about that yeah in the middle a middle of a website launch yeah that's a good time for windows update no no someone's come on somebody's got a all right well you dig into again this the
prototype is making making calls right uh to a back end to to pull down the static pages right uh it pulls a little payload and substitutes this all it's all um um in a react framework or something the same thing right those pages are getting intercepted and being replaced with are you sure you want to proceed this site's uncategorized it was working yesterday it was working the day before why today and why right before the presentation is that happening right i had to give them the bad news like i can't do anything about that but i can tell you who you can reach out to and maybe they can help but you know this point it's 30 minutes
to the presentation you know i i think they wound up having to call off the presentation which they had set up eventually they got it straightened out they worked out something but you can imagine the stress those engineers were under trying to get something to work and then finding out that it didn't have anything ever there they were sure they broke something and they didn't let's um let's have a different team we had a a video streaming team that was building products for streaming basically tv everywhere stuff so a lot of people were using that for second screen and they had built this stuff off so you can run it to uh through a roku run it
to your television um this is streaming versus this is this at this point this is not necessarily actually like discovery plus streaming or that kind of thing but the precursor to those so they've been working on those things for you know the past nine months and they they were going to set up a bank of tvs now in the office so they brought in three televisions they set them up and they just need to install the application okay so there's a hetv application and a food network application and a travel channel application they need to install them so those come off of a store site where they get installed from but installation doesn't work on any of
the tvs the installation doesn't work they spend hours and hours trying to figure there's a networking problem there's something going on you know they couldn't figure out what was good they just couldn't reach the store well this this actually was actually the same problem right is that basically it was categorized as games and so it couldn't reach right the the the apple store or the the you know the samsung store whichever you know where it was they were going to to install their applications from were all being blocked meanwhile they'd spent the entire day trying to figure out what was wrong they reached out to a security team which which were helpful they're like ah we understand what's
going on we can we can help with that you know give us the mac address of the televisions and we can get in exceptions for you at this point the engineers are like i don't know what the mac address is and i can't figure out how to get the the television to show it to me what those engineers wound up doing was taking all those televisions and taking them home and setting them up and then bringing them back to the office it's fine problem will work around it okay meanwhile they spent the entire day working on something and then the end result was somebody is making my job difficult and i will just go take this home and do it
and now again eventually this kind of thing was sorted out so that you know in later you know months later when they were doing more of these they had a way around that right but again developers are suddenly having to work around problems that are making their job difficult let's try something else
does anyone know if big al has been moved out into the knoxville zoo yet out of his winter quarters i don't know uh it's interesting if you ever get a chance that the a 400 pound uh aldabra tortoise being moved from uh out to his enclosures is an interesting uh thing to do apparently takes a lot of patience but we're talking about a shell so another team that that worked on internal applications around linear right so they're they're building things like transcoding and uh taking raw footage converting it over to stuff that gets sent out to uh internet or satellites or to like there's lots of different formats lots of editing things that happen but
there's systems that that manage all of that some of those sit on uh computers which have graphics cards on them because it uses the gpu so they're basically pcs right so they had a problem at one point where their monitors kicked off and saying that none of the assets were stuck at a certain point in the pipeline and they were not moving up through the pipeline so they began investigating what went wrong this thing's been working for five years suddenly something's wrong they're digging through it what they eventually found is in the the in the shell and the ssh config had been modified
there was a bunch of lines being added for servers that they had no idea what they were but a bunch of servers and the keys to access them and things like that and like we don't know what these are and they're suddenly you know and they're right written on top of our configurations and and what they use to move assets between servers using you know ssh and scp suddenly broke because these lines of code were stuck on top of it now what this was was it was uh attempted a honeypot right so they of course you know like i don't know what these servers are let's go access them and see what they are maybe they're part of our stuff for
something and of course then they triggered it so that was a great idea right the idea what the idea was to leave things on people's computers showing access to different um systems in the company um that are actually just there just to get you to go there and if you go there right then then that then they know your computer's compromised and then they come down and you know with guns and things or something but in this case right i mean was a great idea right no one knew about it um and no one was tested you know tested this was working with different things and it was overriding uh you know configurations that were
being used now some end users may have been infected right but then they just found that they couldn't get to one of the systems that they were supposed to work on and they eventually figured it out and fixed it this was not a common problem right they you roll this out on a thousand computers and but why this computer why why this one that was part of you know an automated system why it was rolled out to that one when it was intended to be employee computers not not sure but meanwhile again one of the production system was was offline for a period of time
interesting okay i had a a team that was responsible for doing maintenance and quality assurance on some of the applications uh and there was a third party uh group that needed to swap out one of the libraries used on some of our websites and our quality assurance group wanted to make sure that that when they changed that out that that wasn't going to break anything the only way to do that was to put it on the website and try it and for whatever reason they couldn't really put it on our test systems they couldn't just they couldn't upload the asset because the asset wasn't actually on our server so fine use a use a proxy system like
charles proxy and you know substitute it out to pull the library from somewhere else unfortunately we had some product people and some other people that wanted to use it they weren't really familiar with how to do that you have to set that up for everybody great i was managing stuff at the time like you know what i can just create a little proxy you can just overwrite their host file send it to there and then that'll send you right to foodnetwork.com and just swap out that one file and just pass everything through wrote that up in an hour or so they were you know they were able to use that solve the problem but the next day i came in
and that server it was down i'm not whatever well i just restarted and the team hadn't had a chance to use it yet so they they were they still were going to be using it and i come in the next day and it's down again that's weird so that's why i started digging into that and found that somebody was hacking it like it was under attack i mean you know this is a dead simple proxy you can't it won't proxy just anything you it would only proxy to the our one website and only if you had the the right host headers and only um but somebody was passing uh things trying to attack that website that proxy
directly which was crashing because he didn't know what to do with it like not like i don't have host header doesn't match nothing i don't know what to do with it so i tracked down the ip address where i was coming from i'm figuring like somebody's somebody's computer in the in the company has been compromised so let's track that back and i'll go notify people and i traced it back to somebody in the security organization the security team was hacking one of my systems oh okay um so for a brief moment i considered hacking them back i decided that probably wouldn't end well um and really i didn't have the time either so uh what i did is i just i just updated
the application so that it ignored any requests it didn't know how to handle and just dropped it and that solved the problem so now they can hack it all they want they would just ignore them but it was still annoying all right so let's uh i could talk for another three hours about stories like these and like catch me later and i can share a bunch of ones some of them are funny some of them are sad um but let's talk more about why what is the goal of security right the security groups in your company are focused on protection right protecting the company assets protect you know the data customer pii things of those nature about reducing
risk reducing the risk that assets or data being compromised or misused now in software development they're about delivering products and data to customers right about making new and improved products and pushing it out on a rapid schedule they're they're they're tasked often with innovation and coming up with new ways of doing something how to do it faster how to do it cheaper how to come up with some way to meet this customer's need that we can't figure out how to do
so you've got on one side we've got the the desire to limit risk limit access to our assets limit access to the data and then you have another group in the company who is trying to share our assets and share the data to the customers because that's what we're getting paid to do that's where the company is making their money right they're driving the the software development group in most cases not in all cases are driving um the company value right this is the product the company is selling to make money or is the data they're selling or whatever right and that puts these two groups at odds right one of them is trying to
you know limit access and one is trying to push access and if we stop the one group from pushing that access we don't make any money and then the company goes out of business
so who's the bad guy here is it the software you know is it the security teams i mean i've worked with a lot of security people one-on-one and they are great people and they are interested in protecting the company is it the software development team who is um that is uh continuing to develop and push boundaries and are in some cases creating some of the vulnerabilities that the security team is trying to patch up they're good guys too they are trying they what they are doing is what is allowing the company to make money
so really i mean really there isn't any bad guys here right they're just two groups of people that are that are trying to to do their jobs and they're they're in conflict
so what can we do to help
how can we make this better so i've got a couple of points i'd like to talk about so let's start with number one i'd like you to think first what could go wrong who could be impacted right identify those things whatever change you're making whatever system you're putting in place somehow that can go wrong who would it affect when it does how bad would it be okay can we test it how do we validate it how do we how do we check for the ways it could go wrong and make sure that's not happening can we put it in standby mode can we put it into a use beta users beta testers in different groups to try it out
can we recruit people to actually do testing because because having the same person implement something do the testing usually always works poorly and that's something we learned in software development long long ago right the developer you don't want the developer testing it but what we see in a lot of cases is that is that that same thing's not being applied in other cases recruit some people to help test that can look at it from the outside
communicate and let's be honest this has got to be the hardest piece of this let people know what changes are coming let them know what the impacts could be right what are the potential problems who to contact when it happens
those things that bring the other groups in and like them feel like their their problems are understood right instead of things being dumped on them and then having to deal with it right just knowing about it ahead of time so when it comes up they're like you know this could be that thing that was that they said last week was rolling out right
but again of these three i'd say communication is probably the most difficult it always is and i work for a media company
so i think it goes a little so let's go a little beyond that what i want to see is the security groups and the software engineering groups work as partners right help them with training help with understanding problems communicate with them recruit those people in to help test identify super users in those groups and go hey we're going to make a change can we roll it out to a select people first have them try it out and then when something goes wrong can we roll it back right if the teams are working together and partnering on it things will go much much better because when when is when security or any other sort of thing is imposed on a
group from the outside in an authoritarian manner those groups try to fight back against it or they try to work around it right and and what they're doing to work around it just makes the security vulnerabilities worse right the people that really care that those those engineers that cared about security vulnerabilities start trying to work around the stuff because they still have to get their job done instead of trying to do it well and trying to integrate security in the right way because they're like fine some people are just smashing us with us
i've seen a lot of conference talks where people stand up and and talk about their individual use case and how that must solve the problem for everybody so let's just stop here and just point out that that's not the case here everybody's situation is different okay um in many of the company some of the companies i've worked at right delivering products quickly right we measure the time from ideation to the time that it is um delivered to the end user right and we do you know measure that like in hours right once a product person says this is what i want to do and when is it actually on that application in the customer's hands
right in the past and in other companies right that's measured in months you know if you've got a three-month turnaround time you've got plenty of time for things to go wrong and and to get it fixed and things like that but when developers are getting pushed to like we're expecting this new change you're working on today to be in production tomorrow a four-hour delay throws everything off but like i said it depends if you're are you working in a financial institution in the military in something like hey we can we can we can we're fine with a much slower pace and we want to and can prioritize the security over that so just you know bear those in mind
all right i'm going to do one more story i had moved out of doing software management and into working some other things i was doing devops and working on automation and other things of that nature but the security teams had reached out to the development organization for these websites that we were managing and asked for some help and while i wasn't managing the software development at the time they asked if i would step in because you know i got along well with security people and you know i could probably best understand what it is that they were asking for and figure out what we needed to do to help them so they wanted to talk about web
application firewalls ah yes waff familiar with it um we've actually got at the time we had our own custom version sort of built into our applications we got a home built worked pretty well it's not the same thing as having a professional product though um i had uh pitched rolling out uh our you know one of the high-end uh firewall systems with our system before we ran it through management and they were not comfortable with spending the money at the time okay fine so i let our you know security engineer know all this kind of stuff they're like well we want to use this one you want to use this this web application firewall actually it's
exactly the same one i want to use but myself so that's awesome but i wasn't able to sell the cost on it last time they're like no problem we've already paid for it okay well then i just i can i can sell that now so what do you want to do we want to know how to work with you to roll this out okay well let's talk about a plan um can we uh provide some training to some of the engineers in the in the group and some of the quality assurance people to how to work with it how to test it great we can do that let's set that up all right how do we actually roll this
out how do we test it and validate it let's make a plan so worked with him laid out a plan to change a configuration that would affect not affect our production systems and allow some people to go through and just make sure that just directing the traffic through the web application firewall would not have any bearing right because they're like oh no it won't make any difference like it'll just go through this waffle won't do anything it'll be turned off but we've made a change something could go wrong let's test it let's make those changes and then test it and if that works now we can go ahead and push that up to the up to production and we can put the web
application firewall in standby mode so it'll just execute its rules and log anything that it said it would have blocked or something like that just make a big log of them let that run for a couple weeks now let's have some engineers go look through that and look for stuff that should have been that needs to get let through right things that were our biggest fear with these is that it it triggering on legitimate traffic that just looks really suspicious and you know we certainly have plenty of cases where we've written stuff that looks suspicious to ourselves great they were fine with that and then once we've gone through that find any issues um work with the
security team to help put the modify those rules to fix it go back through the process again check it again one of the rings all good then push it to production and that was the process we went through and it worked great the engineers were happy the site was good we put a new level of security in place the security team helped us out but this was an example now of partnering with the two teams together all right i'm going to end on that note [Music] but like i said earlier i have lots of other stories you can come catch me later if you want to hear about more about like you know me uh
the security team getting mad at me while uh trying while i was trying to stop an active security breach or uh you know or what i might have had done to uh retaliate against the security team for the uh them hacking into my systems all right so thank you and i just want you all remember that i hate you all [Applause]