I'm Michael Krigsman, industry analyst and host of CXOTalk.
And we're here at Future Stack '16, which is New Relic's conference being held in San
Francisco.
And I'm talking with Cameron Tuckerman-Lee, who is a site reliability engineer for Airbnb.
Hey Cameron, how are you doing?
Really good!
How about you?
Good!
We all know what Airbnb does, but what does a site reliability engineer do?
I think that's a good question.
I think the role is very different depending on what company you're at.
So, at a lot of companies, your SRE's are your operators.
You have developers on one part of your building that develop your applications, and then throw
them over the metaphorical wall over to your operators, who make sure that it's running
in production.
So, silos.
Yeah.
So, at Airbnb, we don't subscribe to that model; we are in the dev-ops model that is
becoming very popular lately.
So, the same engineers that are building applications are also the ones that are running them, scaling
them, and dealing with incidents.
But because of that, there's a new class of tools that are required to make sure that
they're doing that efficiently and using best practices; and so that's what the SRE team
does: it makes sure that the entire site is reliable and available, and we do that by
supporting the other teams that own their applications.
What kind of tools help with this?
So, some of it is ... a lot of it is learning.
So when there are incidents, how do you make sure that there's good follow-up to that;
that there's learning from that.
And so, there is this tooling around, like post-mortems, and making sure that when incidents
do occur, that if there are previous incidents that were like this, you are able to get that
data very quickly and understand it.
It's also getting the right people in the room.
So, how you do [that] with pagered escalations, how you deal with alerting; those are also
owned by the site reliability team.
You know, we're also the ones that own and maintain the integrations with some of our
monitoring tools, like StatsD and New Relic.
These are how, when there are incidents, that we're able to quickly triangulate where the
problem is and what the impact was.
So it's a combination of technology tools, but also processes and approaches combined
with data.
Absolutely.
So, I think there's lots of different good ways to go about incident response, but a
really not-great way to do that would be to have everybody be doing it their own way,
and have no consistency.
So, having a team like SRE means that Airbnb has a consistent approach to incident response,
so when there are problems that need to get escalated up the chain, they can get picked
up and handled very quickly.
And, you're very focused using the end-user as a reference-point.
Absolutely.
Tell us about that.
I think no business likes having downtime.
Obviously, there are financial implications to any business, but there is a really personal
human aspect to downtime at AriBnB.
The situation I like to remind myself of to motivate me is, you can imagine, you know:
you're going on vacation, just got off the plane, you're in the cab, you're heading to
your listing, you open up your application to get it's address, and you just see a 500.
It would be a pretty bad or potentially scary situation.
Yeah, very painful.
Yeah.
And so, Airbnb really is nothing without our community.
I can't imagine what the product would be without the guests and hosts that trust us;
so, making sure that we're not just up and available for taking bookings, but that people
are able to rely on us is really important to our business.
You mentioned the word "trust".
How does trust relate to technology, relate to user experience; how does that web work?
It's a good question.
So, some might say that Airbnb is the hospitality company, but some might also argue that we're
selling trust: the trust that you're going to be able to go to a stranger's home, and
feel welcome and have a good experience, and be able to experience that neighborhood like
a local.
And so, the technology that goes into making sure that people are what they say they are,
that you're able to interact with your host, and get to know each other beforehand; that
you're able to, when you're searching for a listing, find a place that's going to fit
with the kind of neighborhood that you're looking for; I think all contribute to making
sure that when you go someplace, you trust that it's going to be a good experience.
And how does that, then, connect to site reliability engineering, and to other engineering functions
inside Airbnb?
How do you think about the connections?
I think this comes down to engineers feeling like they're very involved in the product.
I don't think that many engineers at Airbnb feel like they're just doing what they're
told - they're shipping code, and once it's deployed, they don't care about it anymore.
They really feel like they need to own their own impact; that's the term that we throw
around a lot.
"Own your own impact."
"Own your own impact."
So, if you think something needs to get done, if you think something's not being done the
right way, it's up to you to stand up and make that change happen.
And so, this is from everybody from product teams developing new features for guests and
hosts to make their experience better, all the way to the, say, reliability team that
- you see that there's issues that need to get resolved, or there are some parts for
processes that aren't working out, we need to step up and do something to make sure that
our guests and hosts are going to have the best experience that they can [get].
So you really do see it as a kind of chain of linked tools and processes that have this
ultimate combined impact on the user.
Absolutely.
We want to have teams build on top of each other, all the way until the teams that are
building the actual experience that our users see.
We want to have a really strong foundation for them, so that when they are building Javascript
frameworks [for] user interfaces, that they're able to trust that the back-end is going to
stay up, that they're able to trust that if there are issues that go to production, that
we're able to tackle them very quickly and roll back.
And so, it really is a pyramid of supporting each other.
And finally, what's the data that you look at?
There are a couple different parts of the data that my team cares about.
It's everything from your traditional SRE metrics, mean time to resolve, mean time to
acknowledge, you know, when [it is] incident response.
My team is also starting to really care about metrics around making sure that our on-call
engineers are living healthy, productive lives; making sure that work-life balance is something
that extends [to] something when you're on call at 2 AM.
I think it's something important for industry to start looking at.
Lastly, the ones that are aligned with how our users are seeing things; and these are
what a lot of companies would call "service-level objectives," making sure that our response
time is up, our error rates low, that [it is] not just response time to sending out
bytes to our CDN as fast, but also making sure that when the browser does get that information,
it's also having fast load times.
And that's where things like application monitoring with companies and products like New Relic
come into play.
So, it is a very holistic view.
Absolutely.
We have been speaking with Cameron Tuckerman-Lee, who is Site Reliability Engineer at Airbnb.
Cameron, thanks a lot!
Thank you so much!
Không có nhận xét nào:
Đăng nhận xét