Updated: Jan 24, 2021
Below is my interview transcript with Timothy Chen talking about the impact of AI and machine learning on cloud infrastructure automation. The original interview can be found at the below link.
Mohamed Ahmed: All right. Hello, everyone. Welcome to one of our first episodes in Looking Ahead. In this one, we're going to have a quick chat with my friend Tim. I'm going to get shortly into the background about Tim and we're going to get started. Hey Tim.
Timothy Chen: Yeah, glad to be here.
Mohamed Ahmed: Good to see you. All right, cool. Cloud infrastructure undeniably changes how we write and operate software. But there's an exponential increase in the complexity and this is basically going to cripple our innovation very soon if we do not rethink the tools and systems that we have. In this episode, we're going to chat together about how I think maybe some intelligence or cloud infrastructure management and operations can take us to the next wave of innovation. So, intrapreneurs, startups, enterprises, research institutions have been rushing to run their infrastructure on top of the cloud. Software development and operations models have been constantly challenged at a faster pace and the complexity keeps increasing. Tools that we use right now to manage the software lifecycle, they definitely allowed us to move fast but we still have to deal with lots and lots of complexities with that.
Let's get started discussing those tools. But before that, let me just introduce Tim in a proper way. Tim is a managing partner at Essence VC with a decade of experience leading engineering in enterprise infrastructure and open source communities and companies. Prior to Essence, Tim was the SVP of engineering at Cosmos, which is a popular open-source blockchain SDK. Prior to Cosmos Tim was the CEO of Hyperpilot, which was a deep tech company in the enterprise infrastructure space. The company later exited to Cloudera. Prior to Hyperpilot Tim was an early employee at Mesosphere and the CloudFoundry. Tim also is active in open-source space as an Apache member. Tim, you have a lot. You've done a lot in the cloud infrastructure space and as we were chatting a few days ago, you move to both sides of the table as an intrapreneur, engineer, and now as a VC. Definitely we look forward to hearing from you.
Timothy Chen: Trying to at least.
Mohamed Ahmed: All right, cool. Let's start with this. What is your take right now in the state of the cloud infrastructure? Where do you see the problems and bottlenecks that engineers and companies are facing these days at either the early stages of just writing code and then deploying it, running it and maintaining it on cloud infrastructure?
Timothy Chen: Yeah. Well, I mean, right off the bat, that's as a huge topic, actually, right? Because when you think about what cloud is, the definition of a cloud I feel like has been sort of not really changing, but people's perception and people's how they adopt cloud infrastructure has actually been more and more mature, right? They're more and more actually using the cloud for more things. Like in the past, the people that are using the cloud are mostly going to be startups in the very beginning. Now they're just using the pure EC2 VMs that is running somewhere in the cloud and I still have to do everything myself. I do SSH into every node, I still have to do provisioning, figure out how to put the OS, how to do all the configurations, right? The cloud providers are getting more and more sophisticated because that means they are getting more and more mature. At the same time, either startups all the way to enterprises are getting so much more comfortable, right? It's a mandate for a lot of people to actually move to the cloud.
It's interesting like moving to the cloud, means so many different things to a lot of people actually. Moving to the cloud infrastructure is no longer just give me a bunch of VMs to run in the cloud, right? It's actually all around.
Cloud-native has been a very hot buzzword for a while, right? Which means like I'm going to be able to write my applications, I'm going to be able to manage applications, test it, deploy it, and even much, much further and sophisticated things that we haven't really got to adopt mainstream yet, which we can always talk about. There's so much more in between that's moving, basically, the mindset is moving towards in a way that I can fully sort of run my whole IT in the cloud at some point. The whole development testing, the whole SDLC and shipping and updating everything in the cloud, and leveraging the cloud to do everything for me from states to releases to everything.
Cloud infrastructure is so much more complicated now. It's trying to becoming this big gigantic toolbox for everyone that I can actually put my code in there, I can use lambdas now, I can ship containers now, I have Kubernetes as a service and I have all these different varieties of options. I now have some multiple clouds to pick from. There's a lot of things going on. There's infrastructure layer in general, there's applications and everything in between. This space is just taking off. It's funny to kind of still say cloud infrastructure that's still moving fast because it's been 20 years or something like that, that cloud has been there. But it still feels like we're in some kind of infancy stage because not every single enterprise is fully on the cloud. That kind of gives you an idea that's actually pretty complex to really get there. Since we're on this intelligence topic, I mean, there's a lot of things I think a lot of people really need right now. I mean, we can kind of stop there. I don't know if you're saying particularly [crosstalk 00:06:29],
Mohamed Ahmed: Got it. Let me dig deeper. The main topic is getting complex, right? And getting complex means the steep learning curve, slower progression, and array of other problems that we may have. But let's zoom a bit deeper and maybe take the major stages of software development and operations lifecycle. When it comes to writing code, what are the complexities that you see right now that the cloud model introduced and how this is impacting engineers and companies and their ability or their desire to move really faster? Do you think the cloud made us move faster? Slower? In what sense? What are the risks there that may actually slow us down the road?
Timothy Chen: Yeah, I think the cloud definitely has improved a lot of things but also introduced new complexities though. That it's probably always been the problem but not as complicated or not going to be like out of proportionally being a problem. Like if you talk about how people are using a cloud today, we're in 2020 now. 2020 is the cloud-native days. It's no longer just give me five VMs and I'm going to SSH into each one of them and try to manage on my own. This is Kubernetes, this is becoming like the standard more and more so. Like containers is the way you should ship and deploy them. I've tried to fast forward all of that to today because there's enough to talk about even just today.
If you're working on a new sort of like platform or service, regardless larger or smaller company, if you have the choice to start from the fresh like you don't have any legacy constraints, you're most likely going to be writing your code into a Docker container, right? Because that's probably the sort of standard way to ship any code today. So, I'm going to create a Docker container, and then I have options. Now, once I have my service, actually I need to think about different things. Do I write some microservices? Do I have five or 10 of them? Do I need to put them into containers into some cluster? Do I use Kubernetes or not? If I choose to use Kubernetes, which is probably the default choice, to be honest, today. There are some other options but for the most part, if you have fresh new, no legacy strings attached, you're going to probably pick Kubernetes today.
There's complexity because I think people that are getting into the cloud world today are most likely actually learning not just what Amazon EC2 is anymore and what Amazon's or Google's or Azure's services are because you need to get the basic stuff which is, what is a VM? How does it work? Where do I store my sites? How do I even get a machine? How do I actually allocate different things that I need, which is like load balancers, my cloud storage? There's a storage piece, there's the compute, and there's a networking and there's everything in between like the security models, all of that. Those are the things you have to get in your head first. Then the next step is, okay, I'm going to try to do to 2020 cloud-native way, which is put into all the containers. I need to learn Docker. I need to learn how to push tack images. I need to learn Kubernetes.
Kubernetes has its own suite of all these things. I can't treat Kubernetes like I have my VM anymore. I don't even know where I'm going to deploy in my app. So, I need to make sure it's resilience. It has HA, it has parts that can move around, right? If it's relying on databases. There's a lot of complexity on each step you need to actually get ramped up with. I think that's a lot of complexity actually for a lot of people. That's one step. It’s just getting closer to even just know what to do, to even have something that's functional and closer to how the Kubernetes design was for, which is I want to have applications able to be running in a way that I don't have to care which individual servers they're running. It's a declarative model. It's able to say I want these services. I want to have five of them running and I want it to be reachable by this IP and ports and make sure that anything that can fail, I have sort of a backup for it. There's sort of that concept that needs to be tested in all the things between. I think that that's definitely the number one thing to think about, just learning all those.
Mohamed Ahmed: Absolutely. You actually imply the interesting point and let me also get your thoughts on that, which is the complexity for the DevOps versus the complexity for the developers. Which one of these do you see actually most hit by that complexity that we're referring to?
Timothy Chen: It's a complicated issue because, to be honest, developers and DevOps used to be... well, DevOps are trying to get closer like a developer. We call them SREs system engineers, right? People that really look after systems and their only job is to keep the cluster running. I mean, we more defined smaller roles because things move faster that way. The DevOps movement was trying to make everything much faster. Try to move operations like a developer, building the toolings, building automation all of that happening. Now like developers and DevOps in some way they're not that different, in some ways that you're also trying to build a lot of toolings to help to operate. The larger company you are, the reality is that you're more likely going to have more separation and concerns. I'm more likely going to be throwing my application code to the other side, call it DevOps.
If you're in that situation where I can really just take my code, ship it to somewhere else and I don't really deal with it that much, then you have... I think the SREs will have maybe tougher time in the beginning because operating all these new things is quite complicated. You had to learn a lot of new concepts, there are a lot of new things that can fail, and it's moving fast. Like Kubernetes up their versions every three months or so, Dockers upping their versions all the time and there are all these tools in between that you're using that's changing rapidly. So, it's not stale. I'll learn one tool and I can be good for a year or two. You're just like really learning a lot of things. That is not easy for a lot of SREs because you have to learn all of that in sort of rapid manner. Otherwise, you'll feel like you're left behind because those tools are moving away so fast.
Developers, it's not like they have the easiest time. They might be a little bit easier but it's still a jump between knowing how to write my app into containers. Like I still need to learn about what a container is. I can't just hard code IPs anymore. I can't hardcore host names anymore. The things I'm used to doing is not that easy. I need to really separate a lot of concerns and do all that. I think there's probably more ways you can sort of go by doing some simple things as a developer. But SREs, you want to make sure things are running all the time and also make sure that it's running smoothly too. So, you have a lot more burdens on your hands to make sure it's actually really smooth sailing. And security all that is really hard, right? I think SREs have a really harder time in general.
Mohamed Ahmed: Absolutely. I could see this actually from my own personal experience. Now, if we maybe focus a bit on the interaction dynamics between the dev and SREs. Yeah, now the SREs have more to manage. And I guess the developers, now almost more than a decade DevOps movement, the developers need also to stick their hands more into the infrastructure.
Timothy Chen: Yeah.
Mohamed Ahmed: How do you see the dynamics actually moving forward? Right now, let's take a snapshot of what's your perspective right now on that and the interaction dynamics the DevOps and SREs moving forward if we just continue with the same way we are right now.
Timothy Chen: You know, there's always this running joke between software industries and even VCs. It's like a trend that keeps going back and forth. There are the bundling effects and unbundling effects, and that triangle keeps going back and forth. Things used to be unbundled there's going to be some way try to bundle them together because they feel the pain of unbundled. Once it's bundled, you have the pain of unbundling. Dev and DevOps are kind of in that way to me because I feel like at the beginning of the whole Kubernetes cloud-native... because I think in some way cloud infra now is really cloud-native. It's been sort of the default trend at least, you know. Unless you have legacy concerns or legacy infra. A lot of people are basically going through with Kubernetes.
If you're in that train and you're just getting started, you're most likely not going to say DevOps go learn yourself, devs go learn yourself and let's meet once in six months, figure out what we learned. They're most likely going to be partnering together in the beginning. Because it's hard to separate devs and DevOps if you're using all these tools together because you need to develop your tools with Kubernetes in mind. You have to map your code, you have to map the things you write and also how you're going to run all your dependent services into Kubernetes and also figure out like, oh, does that fit? How do we actually figure out low balancing service discovery or security? All this stuff that my application needs to be figured out as well.
I've been seeing a lot DevOps and devs are sitting with each other very closely in the beginning just to try to figure everything out. That's kind of a partnering journey you kind of has to make. Dev and DevOps aren’t that far away. You kind of have to really close partner just because there's so much complexity in this whole journey. But of course, when things mature, which is what we see in the sort of VMware days, it's like, yeah, I have a team, it's running fine. I'm mostly just making sure things are running great and might add some additional tooling, but it's not like I'm re-platforming every three months or so, or six months or so. So, that is probably going to be... I think we're probably going to get to a stage Kubernetes becoming a lot more stable for a lot more people and a lot more sort of like everyone kind of learn how to run the cloud-native shop from a sort of system management into all applications. But right now, I feel like it's still in the early days. So, they're sitting closer than ever.
Usually, I see in larger companies, you have your sort of IT team that's managing all the infrastructure in IT, but they usually partner developers and a new DevOps team just to be able to figure all this out.
That has been a very consistent pattern so far. It's like this Greenfield's Kubernetes thing, we're going to have a brand new initiative and try to collaborate and go from there. I think it's never going to be a straight fixed-line. It's going to keep swinging back and forth for the most part. But I think the reality is the larger you're, developers want to care less about infra. That is probably true. What if I'm running front ends, I don't want to learn Kubernetes like at all. Like the less I can do, the better I feel like it's easier on my side. A lot of developers in larger companies do have that mindset. Basically, just stay my lane.
Mohamed Ahmed: A friend of mine actually at 500 startups was joking about that then when I was explaining to him Kubernetes and cloud-native and how that technology is trying to help engineers and companies to move faster. He's sending me it's just like déjà vu. We've been going through those cycles since the 80s and 90s. The mainframes, desktops, server-client architecture, and then the cloud, then it's all about just trying to save money, move faster, do a couple of things as you said and we're still doing that again. Why do you think that we're just going through that pattern over and over again rather than figuring out just completely a different model that will just solve some of those problems from the get them go?
Timothy Chen: I mean, there are different fields in this world where it's very complex but doesn't change every year. Like if you're a doctor, you're not like learning oh humans have evolved. Now, last year from this year the change is like 50%. Let's all go relearn how does the bodywork? There's a lot of mysteries still, right? But you're not like just changing all the time. Software by default, it's been built on foundations that basically is a rapidly changing field altogether in the first place. Every standard we have, you know. I mean, there are IPs, TCP IP, all this stuff that we built the foundation of is not changing that much, but it could still be changed. We're in this space where just complexity has built a lot of things that have not really foresee all the complexity that we'll have in the future. We have a lot more servers, we have a lot more data, we have a lot more you know budgets, we have a lot more everything now. It's been designed with one sort of friend of mine but that frame quickly breaks in a couple of years. So, we're just redoing all of this.
That's why we see companies like Google, Microsoft, and some ways Google has been living out the future, right? [inaudible 00:21:01] They've seen the complexity, seen this sort of the scale. So, they've adopted a way that's kind of like what we call the containers way, right? Facebook and a lot of these companies if you've seen all of them had kind of move to containers like 10 years ago. So, we're kind of living out that journey where complexity is still growing the same way we're looking at. Like the amount of data, the model application. We never had microservices before. We never had a startup running 50 services. That complexity just keeps growing all the time and the amount of data we need. Everyone's using AI now. Everyone's doing some kind of AI and all. But it seems like every startup will soon just be running some kind of AI stuff too. That complexity will just keep growing. I think it's a matter of fact like the needs of our apps and the needs of everything else beneath that is just changing every year or so. We're making new [crosstalk 00:22:01] all the time. It's just too hard to have [crosstalk 00:22:05].
Mohamed Ahmed: Makes sense. And actually, that's a really nice perspective. Now, since you mentioned the future, let's talk a bit about the... Okay, we talked about the complexities. What is your take in terms of what should we do? What are the different ways for startups, innovators, engineers in companies that can work some of that complexities? There's some, of course, short term practical stuff that we can do right now. Yes, understood. But let's just think even beyond that. How should we really rethink our tools, rethink how infrastructure is being run, provision, and so on? And of course, we want to also touch on the software development part if you see any irrelevance.
Timothy Chen: Yeah. You're talking more like just how do we handle the complexity of adopting cloud-native overall or?
Mohamed Ahmed: Yes, either in the adoption or in just understanding the model, right? It doesn't have to be specific to that unless you see this as the biggest risk at this moment. Whatever you see relevant.
Timothy Chen: I think it's certainly complicated because the tooling has moved quite fast. What we view as a standard today it probably can change and will change next year. I think the reality is just going to be like everyone has to sort of actually... You know, everyone's journey to cloud-native, therefore, is very different. Because maybe I do have a lot more with stake like I have a lot of light production users. I can't just change things that easily. Or I have 50 to 100 teams or I have a lot of interesting different requirements. Everyone's journey would differ quite a bit just because that's your constraint that you're playing with. I think everyone has to start somewhere. Basically, I think if you look at this as a big complexity and I want to eat them all, you're never going to really be able to do it successfully. You're going to have to start piecemeal at a time and everyone's going to pick different places, I think.
But containerizing your software, your apps are usually the first place for everyone. I think that you kind of has to take like one layer at a time. You don't want to make a big change, right? Because how do I actually get my apps to just run in containers? Right? That's already a big change for a lot of folks, right? By default, that means I can start to break apart some big applications that have really bad characteristics when things just fail, right? I need to break it into smaller services. You're going to do one step at a time. We're just talking about like really the basics, right? We haven't really got to like any sort of like ways to even help it to be run or developed much fast or smoother. I don't want to give out the device where like, okay, to deal with complexity adopt these three tools? I think that is three ways. Because you know, everyone's complexity is so different and we just have to take one step at a time. I think the easiest way has always been yeah, think about one problem you want to solve with new toolings that have been one of your big bottlenecks that you can actually see a win and start from there. Yeah.
Mohamed Ahmed: Got it. If we maybe try to think a bit differently and see if we can use something like the AI to help us solve those problems. And I know that each company or each team will have its own complexities. But there is some sort of an overarching or kind of patterns or some sort of problems that are generic enough that can be solved with tools that can help everyone move faster.
Timothy Chen: Yeah.
Mohamed Ahmed: Engineers to move faster in general in their journey, either to be cloud-native or either writing the... you know just the initial stages of writing their code and making it more resilient, more cloud-native kind of ready. Any thoughts on that?
Timothy Chen: Yeah. Well, there's so much to do even before we touch AI. You know, because I think the reality is all the tools are so nascent today. Like Kubernetes are still a fast-moving piece of software and all the ways people develop, operate, and manage are still... like if you look at the default Kubernetes dashboards that they have, right? It's showing you information but not really showing you like the crucial information that you need at the moment, right? And that's what shipped by default, right? You know, any tool takes a while to really mature and takes a while to even figure out what is the way to help people. I'm super excited about AI. Hyperpilot, the company I started, we were putting a machine learning into sort of infrastructure and Kubernetes as well. I can see that like a superpower that you can add on top of any sort of problem. But at the same time, any superpower comes with responsibility. In some way, it doesn't solve all problems. It also actually can introduce new problems.
If we talk about AI in general, like AI is also still considered like an early field, right? It's been there since the 70s and 60s but new techniques and new ways to do AI have been changing, right? Deep Learning is the biggest new sort of trend, right? And deep learning is just affecting everything. It's affecting not just applications and services, it's affecting the AI research, right? Now you can put deep learning into all the traditional ML methods to make them go faster and better. But the whole point of that deep learning is all about like how do I able to not have to think of all the predictable patterns myself. And in some ways, if I have a lot of data to train, I can automatically figure out the patterns and have a much better accuracy here. But you know, any AI, there's always the issue of trust and the issue of accuracy, right? Accuracy comes with trust basically. AI is not always right. Accuracy number 80 or 90% that's considered great.
Mohamed Ahmed: Yeah. I would like actually to dig into that part that if we first dig deeper into the solution, you know. So, it's interesting that you try to use AI to optimize the performance right and certain configurations of infrastructure and apps running on top of it.
Timothy Chen: Yeah.
Mohamed Ahmed: Now this is definitely an interesting area and as you said, it's basically trying to cover where there are lots of possibilities and I cannot really run through all of these and I need to get the answer very quickly in a reliable enough way. What are the other areas that you think that AI can be applied to that has also the same kind of problem attributes in the infrastructure space or in software development for cloud infrastructure?
Timothy Chen: There's the famous Marc Andreessen, the software is eating the world, right? I think that's been said before, I mean, [inaudible 00:29:38] or somebody, right? It's like AI is basically eating the world too. I think AI is slowly eating the infrastructure world as well. Like if we talk about like what problems? I feel like every problem. Possibly with AI solution but you can't just cookie-cutter like just throw the same model everywhere to solve every single problem. Because the infrastructure has always been one harder place to play IA is because infrastructure by default needs to be stable, you know.
If I'm building my applications on top of your platform and it only works 80% of the time, then either I as a person depending on your platform needs to figure out how do I mitigate that 20% failure myself, which is a lot of times what cloud is causing people to rethink their applications. Being part of cloud-native is like my machine will fail anytime. My container can die anytime, right? Therefore, we built in a lot of these patterns like StatefulSet, and load balancer between different things health checking everywhere, right migration, this and that. It's just because we know that by default the infrastructure is not going to be up all the time. Great. And also, AI comes in. Like there are so many problems. We can narrate every single problem that you see an infra and you see at every stack and every layer, there have been people researching all kinds of stuff to put AI all the way from your networking, your storage, every flash controller you have on your computer, CPU figuring out which cores to run, scheduling this and that, picking what configurations to even configure on your kernel, and picking every configuration need on your stack [inaudible 00:31:30]. How do I route my network across my data centers? You know, everything between. There's the AI you can put in everything.
Because anytime you have like complexity and a choice and some probability, you can probably apply AI basically. There are definitely more and more vendors. Well, AI is in every startup pitch I see so far. That has been consistent. There's a lot of vendors who are putting AI into their software stacks now that's in their infrastructures. One example I think of right away is like Harness. They're doing the CI/CD. And one big part of your pitch is like, we have AI. Because if you know what CI/CD is, it's me anytime I update my code, I go through like a rigorous test pipeline, which is running all my tests and I immediately go to play it somewhere, right? I don't have to wait. Maybe there might be a manual gate, but for the most part, CD really means go straight to production sometimes. But at least the stage. You know, [inaudible 00:32:36] and I yet to go to CD but that's probably the future, right? It’s going straight into production. How they use AI is basically trying to figure out like, my new version versus old version, right? Does it really give me any statistical differences in terms of the performance metrics that I care about? That version has been seen for the last part of that traffic is deviating from the standard deviation now, you know, from my P90 latency. So, maybe I should consider rolling back to my last version. You know, some sort of way to learn how does each version has impacted things.
Because there's learning in different ways. There's learning about the metrics that you're exposing through our applications and everything between or there's even like... I kind of lose a lot of train of thought. But there's a lot of complexity when it comes to like just knowing how to read metrics overall and how do I understand what is the changes that impacted. AI can be everywhere.
Mohamed Ahmed: Yeah, I agree. I mean, you can definitely apply AI in so many different things and I think you also mentioned that criteria. If there's any kind of probabilistic, you know, kind of decision making on top of matrix or some basic stats, you can plug AI in that area. But if we even just abstract this part, forget about where AI can be applied specifically, where are the areas that you think that are representing now as most bottlenecks in the space? Then maybe we can say okay, maybe there might be an AI that for that or maybe we can solve it in a different way. What is the biggest bottleneck? What is the biggest problem that you see from your point of view?
Timothy Chen: I think actually AI can really solve a lot of problems, to be honest. But it's just like is it really the sort of like the top of mind problem for you and are you willing to spend a lot of time and energy to solve that? Because AI takes time to train, takes time to prove, and it takes time to kind of come up with a strategy to figure out how to make that work. The problems that we worked at Hyperpilot was pretty interesting in a way that... one thing I definitely thought about right away is like, hey, it's so hard to configure all the things you use, from the software you use, the hardware you use, everything. There's a lot of choices that we actually don't bother choosing at all. If you think about it, like a very normal stack people will write today will probably going to be like you know, I don't pick a node JS server running in a front end and application and that's connected to the database called Postgres, right? And you know, for the most part, most people what they do is they package, they found some default thing. They found a default configuration to find online or somebody might have tuned something last year to reuse it and yeah, ship it. If you think about like all these choices you're ignoring and all the choices you kind of by default are just using for you don't even know why are all the way to [inaudible 00:35:59] have you picked and what is the sort of all the numbers you picked in between your node JS server configurations, your Postgres configurations? What Kubernetes configurations have you picked? Is it going to be three CPUs two Giga memory? Is it five CPUs four Giga memory? Is it five containers, three containers by default? If you have an auto-scaler group, when should it autoscale? Should it autoscale when 30% CPU 60% CPU? There's a lot of numbers and choices.
To me what I think AI obviously can apply everywhere, but what I like about configuration part of things is that you know, there's so much lay low hanging fruits that if you usually configure a few things, it changes your performance really to 3X by default, at least sometimes. And we see a lot of these examples like people tuning JVMs, people tuning Postgres, right? We don't really bother tuning because there are many effects like, it's good enough. Sometimes it's good enough to start with, so that's fine. It's like, yeah, using defaults is not that it will crash and burn all the time. But there's always going to be like a deflection point. People always kind of wait until like, wow, why have I been buying so many servers? Why am I just getting so bad of a performance? What should I do? The only thing people know just does use a new programming language, to use a new framework. Don't even bother tuning the first one.
I think actually AI can help quite a bit because it's very complex. Humans can really model every single tunable knob here. So, you might as well consider actually use some way to tune your software automatically. Because actually, it kind of has a huge impact in terms of like your constant performance actually. That to me has been you know, a very sort of like entering into a sort of any AI to tune your infra. Because that to me make a huge difference that you human just cannot do. It's not even the manually, I mean it's possible, but you're going to search so many different things. So, that is actually one good place I feel like AI can be really powerful, right? Because it gives you immediate results and actually, there's not much to lose too. Because once you have a good enough, we'll just say, okay, is this new software works or not? You know, it does and actually improves quite a bit. It should just work. Yeah.
Mohamed Ahmed: I'm curious to know actually your personal story behind that. How did you come up with that possible solution and decided at the end to found Hyperpilot? Did you go through some experience that made you think about that problem and the solution for it using AI?
Timothy Chen: In any system research sort of conference, this has been a research topic for decades, right? So, there's nothing new. Like it is really nothing new. There's a lot of preexisting sort of like literature on this kind of thing. It's always so specific. Like either, it's a big company like Microsoft coming with a paper I'm tuning very specific thing in this specific scenario or I'm doing this and that. I think it has just been the complexity too. There's just way too hard to even employ something like this because the original all over like math. It's a lot of math and there are no preexisting libraries people actually are spending time building all this. So, yeah, it all comes from system research really.
My co-founder is a Stanford professor. He's been in this research area for a while. I met him at Mesosphere. He was our director of research. He was actually looking at this, how to bring a lot of things he builts at Stanford into sort of this whole software infrastructure. That kind of gave me exposure as well. But yeah, it's definitely nothing new. I think it's just like the matter of how do we really know how to take a lot of these sort of research ideas and make it into a production-ready software? It's really hard. because there are two sides of the worlds you really need to know quite well. The research side, the math statistics, the ML, plus the systems, right? Because if you cannot reliably deploy this in production, and trust it won't crash and burn someone, nobody will ever want to try this, right?
Mohamed Ahmed: Absolutely.
Timothy Chen: So, that's like a very hard combination for anyone to really bring together, yeah.
Mohamed Ahmed: Yeah. One of my professors actually during my PhD told me that research in computer science engineering is roughly five to seven years ahead of the industry. And again, I think the reasons that you just mentioned these are the ones that basically making academia a bit ahead of the industry. It's really ideas and research might look actually lucrative and they may look applicable once you bring them into the reality of complex systems, running in cloud infrastructure, it's a really much harder problem to apply that. But once you apply them, definitely the gains are huge. The ROI is huge. Let's now assume that AI got used in some of those systems, where do you think if we just try to also project the dynamics between the DevOps, the developers, and maybe other roles in the organization or within the team, how would the AI actually make things easier or more complex or faster? What do you envision there?
Timothy Chen: Yeah. It's always fun to kind of like hey, what would 2025 or 2030...? You know it's like going to Disney when we were little, right? There's like this futuristic...
Mohamed Ahmed: Oh my god!
Timothy Chen: ...Florida Disneyland. There's like this futuristic... what will the bachelor people live like?
Mohamed Ahmed: But here's the reason for asking that question. The reason is, now we see that every time that the technology becomes more and more complex, there is more and more kind of human interaction that's needed in that case. And the whole idea, the whole promise of technology is to make us more productive. And part of being productive is being in sync together on the same page, moving at a fast pace in terms of developing applications and running them, but not being crippled by the slowest person or not being crippled by having those gaps or discrepancies between different team members. I wonder if AI actually would have a contribution to that. If AI would allow us to move faster as humans in terms of our interactions and in terms of our understanding of reality and making some progress here. That's what I'm referring to more than something really far ahead in the future. It could be more on how we envision the role of AI or how what is the right way from your perspective we should roll out the AI inside teams and organizations to make some of those problems, less severe?
Timothy Chen: If you look at how people are employing AI in infra world today, I think the most obvious and probably the easiest place that people are applying it right now is on your monitoring data. There's so much monitoring data dashboards and stats from everything right. If we go talk to sort of any enterprises, they probably already playing with some idea actually of just trying to just do some sort of simple way to just sift through the noise. Like a lot of fatigue or whatever these people... there's a lot of sort of problems people kind of describe that infra people are facing. There's a lot of complexity, yes. But the harder part usually is there's just so much noise coming from these systems. There are logs and numbers from everything underneath your app. Like it's just way too much data. Just even looking at logs is impossible, right? The logs are just such a small number of things that's available. Like your whole OS just dumps out millions of data points potentially for you to able to observe. Like what do you do? And you have like so many clusters, you have so many apps, you have so many layers in between. They all can give you all kinds of numbers and things and signals and just way too much data. Like picking a problem that humans just cannot even do well, right? But have a lot of potentials is usually a good place to start. Because replacing humans is going to be a very hard job and don’t think it will be possible, right?
I think replacing low hanging fruits like very low labor, really hard to make mistakes sort of low hanging repeatable tasks is one good thing AI can do quite well, and trying to sift through the noise because it's just way too much and not even humanly possible to be able to parse it easily are two good places to start. Monitoring data is definitely a little been one thing. I think that's definitely has been what trend. If you look at any IT monitoring system like Datadog or any vendors out there, they've been introducing this idea like I'm going to give you some kind of AI. I'm going to give you a way to give you a prediction. Like hey, if you keep these lines going up this way, I think you might end up in a state where your memory leak is way too high now. You're going to actually crash and burn something. Or you know, the last alerts that happened, what is the last thing that happened in your system that most correlated to that problem? And try to figure out that as well.
So, just trying to like find the sort of really the signal that really matters. It has been one of the most probably easiest and pretty impactful. But still not always as easy, actually really hard. Because there's so much data. So, people would really just pick and choose the vertical, pick and choose some number of metrics and really try to do that well. But you still can't trust them 100% of the time. At the end of the day, a human is always involved, and designing humans would be part of that process as much as possible is going to be pretty important. I think that is definitely one big place that AI has been really been introduced in infra world completely.
Mohamed Ahmed: I think you're definitely referring to the AI ops in general. How can I use AI to simplify my operations and it comes maybe into monitoring and handling life site incidents? One of the things that I personally noticed is that I see the contention usually goes up between developers and DevOps in such life site incidents by just trying to figure out where's the problem, right? Is it an infrastructure problem? Resources problem? Is it an application problem? Is it a bug? Is it expected behavior? And just trying to distill that and trying to understand what is going on sometimes creates a lot of contention either during the life site incident or in the post mortems that you'll go through after that. And if the AI can apply an array that would help those roles or teams to work more in harmony and it's easy for them to figure this out, I see a boost in the team's productivity if done right. That's why sometimes I personally when I think of the AI, I don't only think of how can we just make things faster and whatever? Bigger, whatever the metric that you have in mind, but it's also about the interaction between the engineers. Is it really helping us to have a more fulfilling job? Are we really feeling that we're making progress in general or not?
Timothy Chen: Yeah.
Mohamed Ahmed: So, the role of AI, I hope that as we are building the next generation of tools and systems that we use hopefully AI or any other kind of technology is to consider that factor. Because that's important, right? Now we're able to move much faster on the cloud, but are we really having the work-life harmony or balance that many speak a lot, right? It's now much tougher than before, especially this is all of us working from home. So, it's important that the human factor is really important. I'm not sure if you have any thoughts on that.
Timothy Chen: You know, when you're talking all this actually, I feel like it's actually what really interesting to talk about is like what a future AI ops will look like actually. Because I think that to me is very intriguing because I don't think we actually have seen it played out. We're in the earliest infancy of AI Ops, right? Like people don't even know what AI Ops really is for the most part. We kind of all we know right now is like there's an AI and monitoring data and they're telling us giving some suggestions, right? That's the kind of the only understood parts for the most of IT industry now. And that's just really scratching the surface I feel like. When we're talking about like what the future if we want to go there a little bit, you know because I think what is super interesting is like what does the future AI ops can look like?
I think there's a huge opportunity. Timing is debatable obviously, but if we just forget about timing, what is AI can do in this whole IT world? I think we're going to see at some points AI is not just giving you suggestions, AI is actually going to give you actionable things that actually would change the behavior of the clusters and start to work all the way back. Because right now AI is being applied in the end. I already configured everything manually, I already deployed every application manually, I've gone through all the things and it's running, right? Now I use AI to kind of tell me bits and pieces of information about what I can do. You know, it's going to start playing back because besides us giving recommendations it's going to actually act on that. It's going to go change something, it's going to go do something for you. But I think before we even get there, it's going to creep down into a sort of more and more systems and how do we actually automatically do a lot more sort of experiments in our clusters for you.
One thing I definitely see like LinkedIn, Facebook, and Google and all these sorts of like larger players are even doing is, hey, I have a hard time hard tuning everything, manually tuning everything. So, can I run experiments in production? Run five versions of the same software configured differently but all have the same traffic kind of like teed into multiple of the same replicas and just see how they behave? And only one of them is actually a true application. But you just kind of just observe the behavior just by copying traffic and copying the same app but just doing different things. And you can apply this to anything, right? You can apply this to your JVM tunings, your database tunings, different instance types, you can actually run experiments in production through AI. Because AI can help you in a lot of different ways. It could actually give you the sort of like most optimal suggestions, so you have to try a huge snowflake of choices, right? Can I actually just try 5% of the choices to make sort of near-optimal suggestions?
And so, I think we're going to get to a point where AI is going to run, you know, there's already the constant called chaos engineering, right? Which is like we're going to run failure in production. We're going to unplug your power plugs in production and kind of just see what happens, right? They call them these controlled experiments, which is like I'm going to try to make sure I'm just only introducing a little bit of failure and I have this blast radius boundary set up so it doesn't try to kill everything, but we want to make sure that it still works. And the reason why I'm doing it in production because it's the only representation that's close enough to what's actually going to happen in production. It's really hard to replicate, do some synthetic data and traffic. It doesn't really represent the real world. If chaos engineering is going to do more field experiments just learn what your system behaves in failures, we're going to see a lot more experiments.
Timothy Chen: All right. I think from chaos engineering doing field experiments, we're going to see a lot more tuning experiments as well that's going to happen. That we're going to run four or five experiments and see which one sticks. All the way then we're actually going to see AI controllers able to just changing things in real-time in your systems for you. It's going to be actually able to routes traffic's realistic. It's going to able to create more containers or change all different ways, you know, the operating system that has been behaving this and that based on different scenarios and situations, right?
Mohamed Ahmed: Got it. Now, last part, let me just ask you to wear your VC hat and let us just talk a bit about the startup world in general. What is your take in terms of the startups that you see in this space? We talk obviously in an abstract way about areas that are promising. But do you see now startups that are emerging to maybe solve some of those problems using AI? Do you have a general kind of advice to entrepreneurs and innovators in the space in general just to pursue or solve those problems in a more practical way without going too much into just daydreams or [inaudible 00:54:51]?
Timothy Chen: Yeah. I mean, that is hard. I mean, with a VC hat on, this is actually always been a struggle from an investor’s point of view because I think a lot of technology is great, right? It can actually have a lot of impacts but it's sometimes really hard to understand. Technology on its own can be good and bad sometimes. It can actually cause some harm and will cause a lot of confusion. Because software is never perfect, right? It's like AI. If I have a software that can ultimately save you 50% of your money with no changes at all, I just flip a switch, everyone will flip that switch. It's like if I can lose my fat 5% just by doing one thing that has no side effects, no side effect, price point nothing. Everyone would make that choice. There's always a tradeoff.
I think a lot of teams don't sometimes understand what that tradeoff is. They only think about things from a technical point of view. That if you trust me enough, I trust my ability, I trust my background, I trust that I can solve any problem I see from a technical point of view. But they don't really understand like businesses don't view technology that way. Businesses don't look at new shiny techniques to do a lot of these fancy things. They see it as a risk all the time, especially if we go to infrastructure, everything is a risk. Any change is a risk. Actually, any new version of code is a risk already like let alone like doing anything crazy. Risk management is just by default one number, one job any SRE is trying to make sure they're doing is to make sure things doesn't fail because I have SOAs, I have responsibilities, I have a salary I want to keep. I don't want to get fired for this kind of you know introducing some bad things and just not thinking clearly. I think one challenge for anybody who wants to apply AI in the infra, is to have the ability to really think through the customers’ shoes really deeply. To be really able to know who they are and how they think and how they view your product and your solution in their shoes in a much more intimate way. Because I think it's a really hard space. This is a lot of systems ML. This is not just to take my simple sci-kit-learn and just apply it to my data. No, it's actually a really hard field because you have to do a lot of things.
The people that are attacking this space tend to be more of the research or very sort of technically inclined founders. But at the same time, they have the least amount of experience sitting in a fortune 500 SRE or head of infra's shoes. That world is just completely different. They're thinking about not getting fired, thinking about just doing a minimum enough to just Make sure I don't cause chaos, right? And so, there's a tension that any startup needs to figure out like, how do I talk about my product? How do I introduce my product? How do I make it work and look like it's not really going to cause problems? Because AI is not going to be 100% accurate all the time, so then what problem can I start and how do I really build and sell something that seems like it's solving huge pain points, right? It's top of my problem, but that doesn't cause big crazy failures. And how do you guarantee that? Because people are not going to take chances?
Mohamed Ahmed: Yeah, absolutely.
Timothy Chen: There's this big problem and I think any company that wants to get into this space is just how to figure out like... I think that the ways to mitigate that are really, you kind of have to solve a problem. A very specific problem. Like if I say my AI is going to solve every problem for you, like every IT and performance problem or every IT cost problem, that's not going to fly. You really have to solve a specific problem. Because you have to have a very specific way to mitigate risk for them. If it goes south, do you have data that's even available to tell you that in a very quick way? Because one of the things I learned trying to go through Hyperpilot journey is that most fuel monitoring systems aren’t that reliable and it's not accurate too. People don't give you like a sub-second accuracy like intervals. It's usually been like minutes. So, what happens things fail between like you don't know it and you react too slowly. There's a lot of different failure patterns in between. So, unless you can really control the environment, it's really hard to make things to the right job.
I've been more excited about systems that cut in a close in a more managed, let's say. And not like AIs so you just plug into your existing and just crazily able to know everything, that's the future. Maybe we'll get there. The best and easiest way is like don’t even think about I have AI like at all. I'm going to run something for you. I'm a manage SAS or I'm a managed pass or I'm a platform that can run something for you or some work or some exact task for you, but I'm going to manage everything end to end. So, if it fails, it's my responsibility to fix it, not yours, right? Then I feel at least you're on the hook. I'm not on the hook, right? I feel more comfortable at least trying it, right? Not maybe complete trusting but at least trying it. Because you're on a hook, you're going to make sure everything you do see there it works. That's good.
I feel like it's probably the easier way to get people to be trusting AI in general, is that you kind of have a full end to end experience and you're hiding AI to help you do that platform and help you do that job much better and easier. Instead of saying like, hey, AI put into... you know, it's like I don't know, some crazy solution put into water it's just going to fix everything. No, it doesn't work that way? That's actually one of the hardest part is just like how to get human a little faster, right? Do you have a reliable system or can design a way to sell this that they don't even cares AI? I don't even care there's behind it. All I know is you're running this for me and 10X faster and 10X cheap, right?
Mohamed Ahmed: Yeah, absolutely. Maybe we can close with a small story. You know, I tried also to do the same thing for cloud infrastructure but also, I faced the same risk-averse. You know, teams of engineers and completely understandable. What I did, I interviewed actually a Tesla owner. Because Tesla actually has been talking about the autopilot, right? This is the thing. You give the machine learning or the AI control of your car and your life, right? If it's running on the highway. I asked my friend, "When did you really start running for the first time the autopilot? What did you do? What kind of risk do you have in mind that you wanted to really make sure that the worst scenario would not happen?" He told me, "I started using the autopilot two years after owning the Tesla car. That's number one." It took him really a while to do that. Then he said, "I'm not going to use it while my family is with me in the car. I'm not going to use it while on the highway." And he said, the first time he used it when he was at the Canadian border and bumper to bumper kind of long queue, and then he used AI. If you think of it, there was a low-risk situation. What was the worst case? It's going to just bump into the person in front of him. He can stop it really quickly. He has full control and it just misbehaves, not much that can be lost. This is important for anyone who was trying to really think of AI.
As you also mentioned, it's important to be focused, but at the same time derisk it for your users as much as you can. Even with a company that is with a ton of resources, large teams, and AI, and they have way more data than anyone on the planet right now in terms of training data for the AI, but still, people are not feeling comfortable using it in high-risk situations.
Timothy Chen: Yeah. I mean, this happens anywhere you have trust. We usually have trust to the platform, right? There's always going to be people that are willing to try it first and people that will wait until everyone else has tried it. Even when it comes to like the cloud, right? In the early days of the cloud, people did not even want to run anything sensitive at all in a cloud or trust your vendors who take our data. But hey, it becomes an industry norm. Like, hey, if those five people already trusted it, it makes no sense I don't, right? I know maybe there will be a point where everyone just uses Tesla autopilots or at least some kind of autopilot by default like I'll be dumb not to. If I'm going to be the one that fails, lots of other people will