Senior Developer Advocate for Generative AI at Amazon Web Services
Dec 1, 2022 56 min watch
Run a Hybrid Cloud with Amazon ECS Anywhere
I delivered this talk at AWS re:Invent 2022, alongside Cam Mac, Head of Product at Ocado Technology. This talk does a deep dive into the concept of hybrid cloud, and explains how Amazon ECS Anywhere works to help you orchestrate your own hardware using a cloud control plane. Cam Mac tells his own story of how Ocado Technology uses ECS Anywhere for grocery fulfillment.
You can watch the recording, download the deck, and read the transcript with slides below.
Transcript
Nathan Peck: Today I want to talk about building a hybrid cloud environment at the edge with Amazon, ECS, anywhere, and we’re going to go through basically why and then how.
Nathan Peck: So let me start out by talking first about what hybrid cloud is. And I want to present it as the basic concept of your hardware combined with the power of the cloud. So for example, in this diagram, you see, you could have an on premises server, you could have a point of sale, you could have even a vehicle that’s internet connected, or you could just have Internet of things devices: I’ve used Raspberry Pi’s before. And these devices run your code, run your application, but you want more. You want to be able to use the power of some of these AWS cloud services as well, like: EC2, AWS Fargate, or even AWS Lambda, and benefit from the cloud as well. Now, there’s a few reasons why.
Nathan Peck: First reason I want to talk about is capital expenditure investment. When you’ve invested in on premise hardware like a data center, you’ve already spent money on that and you want to get a return on your investment. So even if you do have a plan to move to the cloud or you like cloud services, you’re never going to want to just abandon your on premise data center before that thing actually breaks down. You want to get your value out of that.
But capital expenditure investment can be a dual edged sword there because sometimes you don’t want to spend more money. If you reach a point where your data center is not sized big enough, you’re starting to think, “Am I going to really want to spend a bunch more upfront money for on premise servers?” Maybe I want to make utilization of the cloud just to make sure that this growth and capacity requirement is going to last before I invest in buying for on premise in the long term.
Nathan Peck: Another reason for hybrid cloud is compliance requirements. In any industry that has sensitive data, for example healthcare data, or even an industry that operates in European regions, they have the General Data Protection Regulation, which requires that data be kept in the physical region as the end user, even down to the country level in some cases. So you may have a requirement that your data be handled in a certain way or kept in a certain location, which requires you to have a hybrid approach. If ABS doesn’t have infrastructure in your particular locale data.
Nathan Peck: Gravity and proximity is another reason. Sometimes the data is just so big, so heavy, and there’s so much of it that moving it to and from the cloud is not feasible. We’ve seen this in some clients that have just absolute petabytes of data on premise. Maybe it’s even old tapes or something like that. And you’re not necessarily going to want to digitize all that and move it to the cloud.
Alternatively, you have the requirement to work with that data on premise. For example, a video rendering studio. If you are building a big budget film these days, you’re probably using a 3D rendering studio and the artists are working with the asset files that could be gigabytes in size, maybe even tens of gigabytes in size, and they’re working with that locally on their desktop. So they have an asset store they need to grab, pull in those textures, pulling those 3D models, run test renders, and they do that locally inside the studio.
But then they also want to be able to render segments if the finished film and they want to render those quickly. So now we’re able to combine the power of the cloud for rendering the end product with the ability to keep data and assets and do some local processing as well.
Nathan Peck: The last benefit of hybrid cloud I want to talk about is consistent operations. So there’s many different ways of doing hybrid cloud and the easiest and first way they might think about it is just, well, treat cloud as one thing, and I’ll treat the on premise operations as a different thing.
But the reality is having a hybrid cloud gives you a fantastic ability to use consistent operations. You can actually have the same API locally as you do in the cloud, and you can create a system that allows you to move workloads between one and the other at will with minimal extra complexity. And we’ll talk about how that works.
Nathan Peck: Now, with the concepts of hybrid cloud introduced, we need to talk about why containers and there’s a few key benefits of containers I like to talk about. The first being velocity.
Containers increase your ability to deliver an application quickly by giving you a pre-built base to start from. For example, installing your runtime. I was thinking back a couple of months ago and I realized it had been several years, probably five to seven years, since I bothered actually installing Node.js or Python directly on a host, because now everything is inside of a container. The host operating system is just an empty shell into which able to bring a pre-built container image and benefit from the velocity of someone else having prepared that container image for me, and I didn’t have to set up any installs or create a system for deploying software onto that host, I just brought my container image.
Reduced risk is another important reason for containers. First of all, containers automate builds. So we know that automation reduces your risk of things breaking later on. And additionally, when you have things automated with container build, it allows you to create reproducible builds which leads to increased quality. So normally if you’re running things by hand or you’re running your own build scripts, you’re likely going to make mistakes or there’s going to be edge cases that happen in the delivery of software to an end computer which are hard to reproduce and hard to handle. So the advantage of Docker containers for delivering software is that you build once and then you’re able to deliver that pattern to as many computers as you want. And that one delivery will either succeed or fail once, rather than trying to say, “I’m going to pull this off of NPM”, “I’m going to pull this package in and install it”, “I’m going to pull this binary and install”, from all these different sources.
Then last but not least, I want to talk about operational excellence. Containers allow you to focus on delivering your business logic rather than focusing as much on the hardware and the operational aspects. So you can have a strong division, you can have developers who just think about the code and provide the container and then other operations people that focus on providing the platform that executes that container on demand as needed. And quality ends up being better for both in many cases.
Nathan Peck: So containers give you this application artifact that works everywhere. You can use it from local development on your laptop, you can use it on premise server, and you can also use that same container image in the cloud because it’s a standard format.
It’ll run the same in all these places, reliably and reproducibly.
Nathan Peck: But when you’re doing a containerized deployment. There’s one more piece that you need if you want to have fantastic containers, and that is a container orchestrator.
You see, when you have one container, let’s say on your local development laptop, life is good. It’s very easy to spin up that container, rebuild it, restart it as needed. But what happens when you have ten containers, 100 containers, 10 containers?
Now, you run into an issue of keeping track of all these containers. How are you going to know if one of them crashes and needs to be restarted? Or what if the demand for a particular application rises and falls over time? You need to scale that up and down.
How are your end clients who are using your system going to get traffic to these containers as you have a large number of them? So this is where the orchestrator helps out.
It allows you to set up a high level command. And that high level command could be something like, “I want to run ten copies of this container, I want to register them into this load balancer, and I want to scale the number of containers up and down dynamically according to CPU utilization and scale, if the CPU goes over a certain threshold, let’s say 80%.”
So what’s going to happen is orchestration is going to tie all this together. Orchestration is going to talk to your compute to launch your containers and the number of containers that you require. It’s going to be watching over those containers. If the containers crash, it’s going to restart them. It’s going to gather metrics from the containers and respond accordingly. So if CPU utilization goes too high, it’s going to launch more containers and it’s even going to reconfigure other resources either on-prem or in the cloud in accordance to the list of containers.
For example, a load balancer, you don’t have to go into a load balancer and by hand add different container IP addresses to the load balancer anymore because the orchestrator is going to be doing that for you on the fly as containers are stopped and started.
There’s a variety of tools that go into the container services landscape. And I want to show some of the ones that you’ll hear about again and again as you go through different container sessions here at re:Invent and as you look through the documentation.
The first core concept that a lot of people need that I’m going to talk about is application networking. Service Discovery and Service mesh: How do you actually know where containers are running? If you have a large pool of capacity, a bunch of VMs or a bunch of EC2 instances, or a bunch of physical servers in a data center rack, how are you going to know what is running on which server and what address to use when you want to talk to a particular container? So AWS Cloud Map is an in-cloud solution for that problem
There’s also management. I talked about the benefits of using an orchestrator. Well, there’s two different orchestrators you can use. Amazon Elastic container service and Amazon Elastic Kubernetes service. Amazon ECS is the main one we’re focusing on today. But you should also know that there is an Amazon EKS Anywhere which provides a Kubernetes deployment and distribution that’s designed to run on premises and be compatible with what we have in the cloud. There is a very fundamental difference in architecture, which we will discuss later on in the slides. But you should know that both exist.
I want to talk about the hosting layer. So on the cloud side, there’s multiple options for how you want to host your containers. Amazon EC2 is going to be the one that provides the cheapest way to say “give me this much CPU and memory capacity for my application to run”. But AWS Fargate provides a more hassle free, easier experience to natively think about containers and not have to think about VM’s anymore. And the reason why this is important is when you’re dealing with EC2 instances, you have to choose how many EC2 instances to run.
AWS Fargate allows you to think just in terms of the container. So I want to run ten containers, give me ten containers, whereas with EC2 you have to think, “Do I have enough EC2 instances to run ten containers Or do I need to run a certain size of EC2 instance to host ten containers on?”
Then the other component is the container registry. Once you’ve built that container locally, where are you going to store it and how you’re going to get in the cloud, so that way these cloud services can actually make use of it? Amazon Elastic Container Registry is a solution for that.
Now that was the cloud side of things, but there’s also the hybrid side of things. So this is bridging the gap between the cloud and what you have on premises or on your own hardware.
You’re going to see AWS Outpost listed there first. This is a very cool one. If you spun by the Expo Center earlier, you may have seen some AWS Outpost hardware, which is very cool to see. I always love seeing those beautiful servers sitting there on the shelf. Think about AWS Outpost as AWS hardware inside your data center. With that you get an actual rack, or a server to slot into your rack, which is the same hardware that you would expect to be using if you were provisioning an EC2 instance in the console.
Amazon ECS Anywhere is your hardware but just being orchestrated by AWS. So you’re not actually buying or renting any physical hardware, but you’re providing your own hardware. And we’re providing management to help you utilize that hardware on premises.
Amazon Elastic Kubernetes service also works with Outposts. And there’s also that Amazon EKS anywhere that I talked about before.
AWS IoT Greengrass is a very cool service for managing containers specifically for IoT devices, or sensors. The manufacturing industry loves this one. We also have seen this adopted in certain connected cars, vehicles… farmers use this. Basically it’s for devices that may go up and down in terms of connection. Sometimes they’re connected, sometimes they’re not connected. Maybe it needs to store some data while it’s disconnected and then upload a lot of data when it reconnects. That’s a great use case for Greengrass.
Then Snowball Edge. This is a physical hardware device that you can ship back and forth between an AWS data center and your on-premises. Snowball has a variety of different sizes. It’s all the way from a small device that can essentially fit in an envelope, which allows you to ship terabytes of data up and down to the cloud without paying the same ingress costs or networking bandwidth constraints. But it also provides rugged computers which have a bunch of CPUs, graphics cards, memory in it, that allows you to essentially run an EC2 on premises. People use this at festivals, they use it for shooting film on location, things like that, where they need to have the same power of EC2 and certain AWS services but brought with them to a rugged location. And this plastic cage is going to keep that computer safe. When you fill it up with data, you might just send it back to AWS and say, “I want you to stuff all this data into an S3 bucket so I can start processing in the cloud.”
Nathan Peck: Now I want to dive a little bit into the focus of this talk, which is Amazon ECS anywhere and how it works.
Nathan Peck: So that spectrum that I showed earlier, so you understand that where Amazon ECS lives. If you look think about capacity on AWS, you’re going to see ranging all the way from AWS regions down to your own hardware with Amazon ECS anywhere. As you go from left to right on this diagram, it’s going to get more specific and closer to you or your end customer.
An AWS region, think of it as gigantic buildings full of computers that are located in one particular location, usually far from a city center. There’s going to be multiple buildings that are separated by geographical distance. This is a full service installation that has all the AWS services in it. It has a ton of capacity, which allows you to go elastically up and down in terms of how much capacity you’re utilizing.
Local zones are smaller data centers that are located inside of city centers. So if you have like, there’s one in Los Angeles. There’s one in certain cities that we’re deploying to. And this is more for when you want to be closer to your customer for establishing, for example, your own point of presence or content delivery network that’s closer to your end customer.
AWS Wavelength is applications being deployed onto hardware inside of 5G network. So this gets even closer to your end customer right there in the cell tower or right in the cell provider’s data center, extremely close to mobile applications.
Outpost gets even closer. It gets right there inside of your building, right there inside of your data center. AWS will bring a rack of hardware in or a server and install it for you and make sure that that runs and they’ll bring in maintenance for it as well.
But then if you want to manage your own hardware, that’s where Amazon ECS Anywhere lives.
Nathan Peck: So I want to talk about how it works by showing this diagram so we can understand the pieces. So we have the region, and the region is where Amazon ECS fundamentally runs.
ECS control plane itself does not run on your hardware ever. The only thing that runs on your hardware is what’s called the ECS agent, which is a lightweight agent. Which establishes a connection back to the control plane that’s living inside of the AWS region.
And you’ll see a couple different pieces here inside of your server VM. You’ll see the ECS agent, you’ll see the SSM agent, you’ll see the operating system and containers. And as we build up the flow here, you’ll see how everything connects together.
Nathan Peck: So we have the ECS agent which connects back to the Amazon ECS control plane, and we have the SSM agent standing for a simple Systems or Systems Manager which connects back to Adobes Systems Manager.
Both of these agents run on your hardware to allow the cloud access to control it. Now interestingly, you do not need to have a public IP address for your software or for your hardware. You don’t have to open up any networking firewall rules or anything along those lines.
It’s designed that as long as your hardware has an Internet connection, it establishes its own outbound connection to the Ads region. And then over that connection, the AWS services are going to communicate back down over it to send instructions to those agents.
So this makes it very secure. You don’t have to open up any firewall port rules or anything to AWS. So let’s go through the process of bootstrapping. So you have your data center, you have your hardware in there.
Nathan Peck: Let’s say you’re using VMware on premise. So you have a VM. You need to start providing a few things for Amazon ECS, anywhere to work, the first being an operating system. Obviously you need to run some kind of operating system inside your VM and the operating system is your responsibility because you’re the one installing on your own hardware.
But then on top of that, the first thing that’s going to get installed is the SSM agent, systems Manager agent. And when that agent installs. It’s going to use an activation key that you have preprepared inside of Systems Manager.
So you go into Systems Manager, you say I would like to register a device and you can get an activation key back. And the activation key can be used to activate one machine or up to a thousand machines at a time, I believe.
You can also specify an expiration on how long the activation key lasts. But you share that activation key out to all of your machines in your network as you install that SSM agent. And the SSM agent will send that activation key back to Systems Manager and say, hey, I would like to become a registered piece of hardware with Systems Manager.
As a result, when this happens, if approved, then the SSM agent will generate a private key locally, a key pair similar to an SSH connection. The private key pair stays on the hardware, the public key gets shared up to SSM and registered inside of SSM.
And so this allows SSM to validate that the end of hardware is who it says it is. And the purpose of that is for SSM to be able to send back credentials in a secure manner to your hardware. It can actually encrypt those credentials using the public portion of the key.
Then the only person who can decrypt those credentials is the holder of the private portion of the key, which in this case is your particular piece of hardware running that particular SSM agent. Every single piece of hardware has its own SSL key pair.
And so SSM has a whole list of all the public keys that it can use. But all those private keys, they stay on your hardware, inside your data center, inside your device. And so the SSM agent uses this whole process.
At the point that it has the credential it can now communicate to any other AWS service in the cloud. But the main one for the functionality of Amazon ECS is the ECS control plane.
The ECS agent starts up, it says: “I have a role. The role allows me to talk to Amazon ECS and I am now able to register myself as a managed instance inside of Amazon ECS.” So now ECS has a list of all of the computers that have been registered with it.
And it’s starting to keep track of things like how much CPU. So the ECS agent looks at the number of CPU cores, it looks at the amount of memory in that device, and it says “Here’s the pool of resources that are available to run an application.”
Now, you can go into the Amazon ECS console or the API and you can issue those commands, like I said earlier, run ten copies of my application hooks up to this load balancer and Amazon ECS will communicate back down to the ECS agent and say I would like you to run a container.
The ECS agent communicates to a docker engine that runs locally inside of your VM. Once again, this is something that you do have to install. We provide a helper script that goes through this whole bootstrapping process.
So from your perspective, it’ll just look like one command to run. But all these components, SSM agent, ECS agent and Docker engine are running locally on your hardware. The docker engine then spins up containers as instructed by the ECS agent.
And once again, all those connections, the SSM agent connection and the ECS agent connection are outbound connections from your hardware. They’re communicating over the internet or via a private link that you establish directly to the service in the cloud.
But the cloud is not ever establishing any connection into your service. It’s only communicating back over the connection that was established by those agents. And the cool thing about this is that now that this whole flow is set up, the ECS agent can now bring in other credentials.
So that first set of credentials that was established by the SSM agent is just a top level credential that allows Amazon ECS anywhere to function. But each of those containers now that runs on top of your VM can have its own unique credentials that authorize that service to talk to a certain subset of resources inside AWS.
For example, in this diagram, an S3, or Cloudwatch. So an example of that might be gathering the logs out of an application and uploading them to Cloudwatch, or just an application that needs to store and persistent data inside S3.
So following this process, you can bootstrap all the way from a Bearer OS and a VM on the data center to these agents, bringing in all the configuration necessary for an application to function as if it was running in a production cloud environment.
Nathan Peck: The cool thing about it is it sounds complicated. It sounds like there’s a lot of components there. But the reality is these agents are very lightweight. They’re very thin agents. They don’t consume a lot of resources.
In fact, this was a fun little project I set up with Raspberry Pi’s connected to ECS and registering as capacity. So this right here was a 16 core, 32 GB cluster. They registered with the Cloud, and then Amazon ECS was placing tasks onto this piece of hardware.
Obviously, this is not a particularly powerful piece of hardware. These devices didn’t even actually have active cooling. I was just using a heat sink on the processor, so it wasn’t super powerful. But I could do quite a bit of processing on it because the agent wasn’t consuming a lot of overhead.
So it’s an important distinction between a lot of other container orchestrators where it’s running a full database and it’s running all this logic on your own hardware.
Nathan Peck: And so this gets into the key use cases.
Nathan Peck: The fun part, the first one I want to talk about is the consistent hybrid workload. So remember earlier when I was talking about use cases for hybrid: consistent operations. The cool thing about Amazon ECS Anywhere is you now have one set of API’s that can deploy both to the cloud resource and to the on premise resource. The two core API’s that you use day to day with ECS are the RunTask
API and the CreateService
API.
RunTask
is used to run a single container on demand and run it to completion until it exits. So this would be used if you have a batch job or you have like a script that you schedule to run, maybe on a cron job, something along those lines. And you just want to run that from top to bottom until it ends. And you don’t want to really think about where it runs. You want Amazon ECS to figure that out and use the available capacity. But you do want to run something on demand.
CreateService
is for when you have a website, an API, something along those lines where it needs to be up and running at all times and needs to restart if it ever crashes. And so that’s what you’ll use the CreateService
API for.
But either way, those two APIs are Amazon ECS APIs that can deploy to any of those locations in region, in AWS Local Zone, in Wavelength, in an Outpost, AWS managed hardware, or even your own hardware. Now it’s one unified entry point for launching an application anywhere.
Nathan Peck: I want to talk about some example customers that have benefited from this particular approach. The first one here being Tempus Ex. Tempus Ex, they are a provider of video transcoding. And they needed a way to consume all this data from webcasts and different sports leagues around the world. And you’ll see some of the results there.
I don’t want to go through all that text. I don’t want to just read because that’s kind of boring. But the core idea there is that facilitated processing speeds of up to 40 times faster. That leads into another use case of edge orchestration challenges.
Nathan Peck: As you are operating an application on the edge, you’ll find that you need to do certain maintenance to it. We’ve kind of become used to cloud services that are maintained and patched and upgraded by AWS engineers. But the reality is, when you’re operating your own software, on your own hardware, someone has to manage those patches and upgrades. And for a set up like Kubernetes, there’s a lot of different things need to be upgraded.
Kubernetes requires you to upgrade not just the orchestrator, but also the etcd database, the store of state. And you also have to upgrade the different agents on the nodes that are providing the capacity for Kubernetes. You’re going to have to manage all those components.
Nathan Peck: And this problem becomes harder to deal with the more locations you’re dealing with. Let’s say you’re a restaurant chain, or let’s say you are a business that has a lot of different warehouses that you are working with, or a lot of different devices that are mobile devices, like cars.
If you’re running a complete orchestrator in each of these locations, they all need to be patched. And this problem becomes harder the more locations do you have. Now let’s compare that and contrast it to the way it works with Amazon ECS.
Nathan Peck: If you remember in Amazon ECS the agents are connecting back to the cloud and that control plane is centralized in one place inside the cloud. Well AWS engineers are the ones managing, upgrading and patching that. The only thing that you have to upgrade and patch is the agents themselves. So you’ve removed several different components you would have to orchestrate and update in all these different locations and replaced it with the vast majority of that operation overhead happening on the cloud side.
And the cool thing about it is that these agents are backwards compatible too. So there’s technically not even a reason why you would need to upgrade the ECS agent or SSM agent unless there was a bug or there was a new feature of the platform that you wanted to adopt.
In many cases, you could leave a site running on an older version of the agent, and the agent will happily connect back to the Amazon ECS control plane, even if the control plane has been updated. That’s one of the nice benefits of Amazon ECS that I’ve run into.
Nathan Peck: An example of a customer who benefit from that is 3dEYE. 3dEYE is a video streaming software. And they have a lot of different third party data centers that they’re watching over. They’re doing, if I remember correctly, IP cameras. And so all of their customers are gathering IP camera data on premise from different security cameras. And obviously that creates a ton of different installations at different businesses, warehouses, different places with their security cameras all around the world.
And if they were to go in and try to manage the software on all of those, it would be a significant burden. But by using ECS anywhere, they’re able to keep the bulk of that centralized operational overhead in the cloud.
Then even like I said, they could use an old version of the agent, leave that on premise, or if they really want to, they can upgrade the agents. Once again, lightweight agents, very easy to upgrade compared to upgrading a database or an entire configuration stack of a hardcore orchestrator.
Nathan Peck: Now I want to talk about another one that I particularly like about ECS network is GPU scheduling. This has become one of the key use cases of Amazon ECS anywhere as more people adopt machine learning and are trying to train machine learning algorithms.
You can scale and place applications on more than just CPU and memory dimension. You can now also say, “I want this application to have a GPU core attached to it.” And the ECS agent is aware of the GPU’s that might be on that piece of physical hardware and schedule workloads.
I mentioned the video rendering studios before, 3D rendering. They love GPU’s. They need GPU’s for rendering films and film frames.
Nathan Peck: But one customer that is using GPU’s as well is Kepler. Kepler provides a machine learning service that uses film cameras to monitor elderly people in care residences. It’s essentially watching over them in case they would fall down or get hurt or have a medical emergency. Maybe they’re not even able to press the “Call for help” button, but the machine learning algorithm is able to detect that, call a nurse to save their life much faster.
Obviously, to run that model, they need particular hardware such as GPU’s, and they need some way to orchestrate all of that throughout all these care homes that are distributed around the world. They benefit from Amazon ECS anywhere.
There some other core features of Amazon ECS anywhere that I like.
ECS exec: So Exec is built into ECS as a way to get a shell inside of a running container, and the reason why this is important is that traditionally, if you have a device that’s running containers or that’s being used as compute capacity, how are you going to connect to it if you need to debug?
Well, it really feels bad to open port 22 to the world to allow SSH access from anywhere. Particularly now you have to give that device a public IP address on the Internet. You have to add firewall rules and make sure that you don’t have somebody brute forcing in there and trying to crack the password and get into your device and start running their arbitrary code on it.
I talked earlier about how that SSM agent opens the channel back up to the cloud. Well, ECS Exec uses that channel to provide essentially built in bastion host inside of SSM. You connect to Simple Systems Manager, ava Systems Manager, to initiate your SSH connection to Systems Manager, and then Systems Manager connects down over that connection that the SSM agent opened up to SSM.
So you’re able to get a shell on the remote container on that remote host without ever having SSH installed and without ever having port 22 open to the world. And you control access using IAM policies so you can give each of the Ads users on your account different access to different services and authorize who’s able to connect to SSM, and then through SSM down to which particular containers that might be running in a particular rack or location around the world.
Nathan Peck: Then last but not least, CloudWatch monitoring. One of the classic problems of running an application is what happens when the logs start to stack up and you’re trying to rotate them, and next thing you know, you get an alert that the disk space is running out on that particular host. And what do you do with those logs? You’re going to ship them off to a storage device somewhere or just delete them because you don’t really care about them?
Amazon ECS Anywhere out of the box, gathers logs and metrics. It’s a built in feature of Amazon ECS. Because it has that IAM role that allows it to communicate back out to other AS services. One of those is CloudWatch logs. And so it can gather the logs of your application, ship them off that piece of physical hardware and up into the cloud for storing exploration and querying later on.
Even if that particular piece of hardware was destroyed, you would still have the telemetry from that task. You would still have the log lines that that task had written prior to that piece of hardware being destroyed.
Nathan Peck: One more example customer I want to talk about is Just Walk Out Technology by Amazon. Just Walk Out technology is being used in different stores like the Amazon Go store, and you’ve probably seen it: they have all those cameras that kind of watch from overhead. And as you walk through the store, you can just grab something off the shelf and walk out of the store and you get a bill later on, so you never have to even talk to a cashier. I love the concept!
So obviously they want to be able to scale this out fast to many stores. And you’ll see the quote there: “As we continue to scale Just Walk Out technology, we look for ways to accelerate our deployment processes for in-store workloads. ECS Anywhere helps us to expand faster by maintaining the same deployment processes, metrics and tooling on premises and in the cloud.”
So, classic example of being able to utilize that same operational workflow both in the cloud as well as on-prem.
Nathan Peck: I’ve talked about several different customers at a sort of high level, but I want to bring up one customer who’s going to go into a deep dive on how they used Amazon, ECS Anywhere, and that is Cam Mac from Ocado Technology.
Welcome, Cam.
Cam Mac: Cheers. Good evening, everyone. My name is Cam Mac at Ocado Technology. I’m head of product for an area that is responsible for providing connectivity and compute for people and automation.
Cam Mac: Before I start talking about ECS Anywhere ot Ocado, I thought it might be useful to share a little bit about who we are and what we do.
Cam Mac: At Ocado technology, we are solving some of the toughest technological challenges of our age. I’ve been quite fortunate and I’ve been with the company for two decades. And for over the last 20 years, we have been transforming online grocery through cutting edge tech.
Our engineers build and support solutions for ensuring we have the right stock in the warehouse. An e-commerce offering to allow our customers to buy the products and solutions to fulfill and deliver the orders.
Cam Mac: Rather than buying off the shelf technology to support our needs, we decided to build our own because what was out there did not match our exacting standards. Now we license our technology to retailers all over the world.
Today, a two and a half thousand strong team of developers, across twelve development areas in eight different countries, are developing our industry leading capabilities in automation, robotics, machine learning and more.
We have over 500 patents covering our technology estate. And we are tremendously proud to have been able to solve some really tough problems and develop solutions that are being used by retailers all over the world.
Cam Mac: Our advanced capabilities in machine learning and artificial intelligence enables us to achieve amazing outcomes. Fresher food, greater convenience and choice, and the lowest rates of food waste, all in a business model that brings the best economic returns for our business.
This is why some of the world’s largest retailers are using our technology to become leaders in their markets. We are delivering highly automated warehouses like this one all over the world. More on how these work later on.
Cam Mac: Our cutting edge technology supports the online operations of eleven of the most innovative and forward thinking retailers in the world. Close to where we are today in North America we have Sobeys and Kroger.
We have a number of retailers in Europe and out in Australasia and in the Far East, we have Aeon and Coles. Since producing this slide, we have also signed a contract with Lotte in South Korea. So we have a new retailer come on board.
Cam Mac: So what is our proposition? Well, it’s OSP smart platform.
Cam Mac: OSP is the most advanced platform for groceries in the world. It’s the end to end suite of capabilities for e-commerce, fulfillment and logistics solutions, using advanced technologies such as machine learning and artificial intelligence. Through using OSP, retailers can build loyalty and win market share. We are built for the cloud and adopt a microservices architecture.
The beauty of this model is that it is extremely scalable. Warehouses can be as big as seven football pitches to serve a large area or scale back to serve a smaller population density area. There are also micro sized facilities that offer quick delivery to customers in urban areas.
Cam Mac: So what makes the Ocado way of doing online grocery different? First of all, grocery is the most difficult retail segment to deliver online profitably. There are some hard problems and hard challenges that we face.
Big order sizes (50 plus items). Products that need to be stored as frozen, chill, or room temperature. Short shelf life. Low margins, and low tolerance for substitutions, and yet high expectations for on-time delivery.
Ocado entered the sector in the UK determined to take a new approach. Unlike most traditional retailers with an online presence, Ocado fulfills its orders without the need for brick and mortar stores. This allows for greater efficiency and flexibility.
Cam Mac: The OSP warehouse is the most sophisticated of its kind in the world. There are standard, micro and mini size warehouses in the ecosystem. Goods from our suppliers arrive at the inbound area and are decanted into storage bins.
The bins enter the hive. This is at the center of the warehouse, comprising of a grid and thousands of bots collaborating with each other like a swarm of bees. The precise orchestration of this system allows us to collect a 50 plus item order in just a few minutes.
A key ingredient in the recipe for quick delivery: a pick station allows an operator or a robot to pick into the customer order. Total labor hours used to fulfill an order runs at just 15 minutes, compared to that of 1 hour and 14 minutes at a supermarket store.
We are cloud native at Ocado technology. Even our most latency sensitive bots orchestration system runs in the cloud. But there is still a need for edge based provisioning of on premise compute. Depending on the size of the warehouse, we would typically have about 50 to 200 devices running different kinds of workloads to support key business functions such as pick and decant.
Cam Mac: And there lies our challenge. How do we enable this for multiple warehouses in a high, highly repeatable way? Well, that’s why I’m here today to tell you about how we leverage ECS anywhere at Ocado Tech.
So the problem set for us was clear we needed a way to allow our engineers to deploy container workloads to devices scattered in warehouses across the world. We knew that some of these key, critical business functions could require multiple workloads.
Our engineers are super proud of the Ocado Technology platform. This is our in-house built suite of tools, pipelines and governance frameworks to enable us to build and deploy applications in a consistent way.
It was incredibly important for us to leverage this and to not reinvent the wheel. And regardless of whether we’re deploying into the cloud or out at the edge, we wanted to ensure the engineers had the same experience.
In addition, we also had the following considerations:
Simplicity. We wanted to avoid creating something that required specialist training or creating complex interfaces which not only would distract our engineers from using the product, but would also reduce their time to focus on innovation.
At Ocado Technology, we strongly believe in continuous deployment. This means that anything that slows down their productivity is not a good thing. To make it even easier for our engineers, we also factored in the need to deploy to devices in logical groups. For example, on a Monday I may choose to only update pick stations in Kroger warehouses, and on a Tuesday, some other set of devices for some other retailer. The permutations are endless and we needed to provide support for this.
Support can become incredibly difficult due to the sheer number of devices. This means we needed to manage our devices in a way that was compatible with both our staff located at the retailer’s warehouse as well as our engineers supporting the product backup base. Making it easy for those who support our products was paramount.
Lastly, we needed it needed for it to be easy for engineers to move over to using the product not just for their first application, but for any application.
Cam Mac: So why ECS Anywhere? Before we started our proof of concept with ECS Anywhere, we undertook research into the technologies available to us. We considered a number of options, including Greengrass, Rancher, and even a combination of IoT with an in-house built orchestration layer.
The choice was clear, however, over and above the great coverage ECS Anywhere has of our requirements, it was a product that we felt at home with. This is because at Ocado technology, we are already a heavy AWS ECS user.
We all know the economic and technological benefits of using managed services. We want to build our products without reinventing the wheel and want our engineers to focus on their missions, not on infrastructure structure. This sits firmly in line with our product strategy.
Our use case today centers on the need to deliver workload to devices within the walls of a warehouse where you have supporting services such as VPN’s. We know that one day we may need to deploy to devices located anywhere with just an Internet connection. ECS Anywhere gives us that straight out of the box. One of the key benefits of ECS Anywhere is that it allows you to scale out applications in the cloud to devices out at the edge.
Works really well for most vanilla case scenarios. Through our proof of concept, we identified that this logic was not entirely compatible with our use case, and we worked collaboratively with the ECS Anywhere team to tweak the back plane to overcome this issue.
Cam Mac: We’re tremendously proud of the outcome. So how do we go about assembling all of this together? What we have here is a very high level simplification of the components that make up the solution. We have an ECS cluster per warehouse.
This separation allows us to deploy per warehouse, which we thought was the right balance. The innovative applications developed by our engineers are pushed as containers with the appropriate tags and attributes.
This is done using our deployment tool, which has integration to our deployment pipeline using cloud formation together, this gives us the ability to deploy API particular workload to a particular device.
And at the receiving end is the Compute device. The ECS agent is deployed as part of the standard operating system build. Information that allows us to identify the device is collated by our in house platform agent, which passes this on to the ECS agent.
Upon successful validation that the device is an Ocado asset, the auto registration process enables the device to receive the next task. It’s as simple as that. So after approximately four months product development and a rollout period, we now have about 1300 devices running across 18 different warehouses in five different countries.
Cam Mac: This is helping five of our retailers in their missions to use OSP to gain competitive advantage of the online grocery market in their territories. Phase two of the rollout plan will add devices to five more sites in the next quarter across two new warehouses for two retailers, and this will add another three to 400 devices as well.
We’re also seeing an incredibly low overhead in maintaining the product and have a high degree of confidence that it will continue to scale to support the known projected number of warehouses and retailers.
Beyond the pick and decant use case, we’re also seeing other business functions express interest in how we can help them deliver workload to their devices. This ranges from decant imaging, where we use vision technology to detect unwanted packaging and guide the operator, to removal, through to developing the charge solution for the next generation bot.
Hopefully you have enjoyed today’s journey. We learnt about the challenges faced by retailers and how Ocado have developed innovation to help them turn these challenges into opportunities. This is achieved through the Ocado Smart platform.
The Ocado Smart platform is the end-to-end suite of e-commerce fulfillment and logistics solutions. As part of the fulfillment operation, we needed a way to deploy container workload to devices across the globe and we achieved this through integrating ECS Anywhere.
The integration was highly collaborative. We can’t thank the ECS Anywhere team enough for the way they responded, the way they engaged and worked tirelessly to support the tight deadlines of the project.
The end result? We have a product that easily allows our engineers to deliver workloads to devices across the globe. The solution runs off a single control plane and we have not had to upscale our engineers due to their familiarity with ECS.
Cam Mac: Thank you again. My name is Cam Mac at Ocado Technology and we hope you can also benefit from using ECS Anywhere in your organization to build great products to serve the needs of your business. Thank you very much.
Nathan Peck: High praise indeed! And if you’re interested Amazon ECS, feel free to reach out. I’m sure Cam can answer some of your questions and you’ll also find my Twitter handle up there @nathankpeck. DM’s are always open and reach out! I’m happy to answer some of your questions or connect you to an engineer inside of Amazon ECS Anywhere, who can help you to build your incredible platform as well.