Insights from Product Reliability Engineers
Intro
Palantir’s platforms (AIP, Foundry, Gotham, Apollo) underpin mission-critical workflows throughout the world, whether it’s facilitating the resettling of over 100,000 refugees fleeing the war in Ukraine or reducing waiting times for life-saving cancer care. The Product Reliability Incident Management team’s core mandate is to address the highest-priority issues across our platforms so that our customers are able to perform mission-critical work. The team achieves this with a combination of proactive project work (around 70%) and reactive incident management (around 30%).
The team of product reliability engineers operates a 24/7 “follow-the-sun” support model, with major hubs in the United Kingdom, United States, and Australia. A typical on-call shift is action-packed: the team responds to the most critical, complex, and novel customer issues in real-time. Resolving issues requires simultaneously coordinating multiple product teams and other stakeholders while navigating metrics, dashboards, documentation, and logs.
On the more proactive side, the team has a broad mandate to own and implement any strategic project that will optimize Palantir’s product reliability incident management capabilities. Some recent projects include: developing new features and embedding LLM capabilities in our internal product reliability incident management tooling to accelerate issue resolution; building monitors and end-to-end tests to alert on current or imminent issues before users report them; and implementing readiness standards across our entire product development organization to ensure our teams can respond as quickly as possible whenever there’s a critical issue.
In this blog post, product reliability engineers (PREs) share what it’s like to work on the Product Reliability Incident Management team.
What does a day on Product Reliability Incident Management look like for you?
Lea (PRE, London)
Over the past few months, I’ve been working on building alerts that detect issues before users identify them. I’ve been building end-to-end tests that run continuously in production and simulate user workflows. When these tests fail, it means there’s a user-facing issue. With 10+ products involved in most user applications, making it simple for anyone to identify the cause of a failing test is challenging! To support this project, my days involve a mix of: collaborating with development teams to understand how their products fit into user workflows; creating dashboards that make it easy to identify the cause of test failures; using Foundry to analyse the accuracy and performance of the tests in production; getting usability feedback from dashboard users. When I’m on call, it’s satisfying to see the tests and monitors I’ve written catch problems before they become issues for users.
Kwesi (PRE, London)
On days when I’m on-call, I particularly enjoy it when a really complex and unusual issue comes in that’s difficult to troubleshoot. It’s so satisfying to help resolve our most challenging product issues.
When I’m not on-call, I spend most of my time either mentoring our newer teammates or working on more “cutting-edge” features. I really enjoy doing both technical and UX design reviews, code reviews and pair programming to help teammates implement more complicated features or help them onboard to an unfamiliar area of our codebase. One recent example of a “cutting-edge” feature has been building a workflow that leverages Large Language Models within AIP and Foundry to help automatically flag when a deeper root cause analysis by the relevant product teams would be particularly valuable.
Thea (PRE, Denver)
I particularly enjoy the novelty of days when I’m on-call; no two days are the same! Some days, I’ll need to handle multiple issues affecting different customers; other days, I’ll only have one or two issues, but remediation might be particularly complex and require co-ordinating between four or five different product teams.
I tend to reserve meetings and tasks requiring deep focus for days I’m not on-call. I meet with a very wide variety of stakeholders across a number of forums, whether it’s attending project team check-ins to review progress and set priorities; presenting at an all-hands meetings with Platform groups to announce process changes related to product reliability incident management communications; or having one-to-one meetings with my lead to discuss day-to-day questions and growth opportunities. Other deep-focus tasks might include spending an afternoon doing feature development to address a stakeholder need. A recent example is building a tool to help streamline impact assessment within the first few minutes of a critical issue being raised.
What does growth look like on Product Reliability Incident Management?
Lea
You get responsibility from the start. Within my first week, I was exposed to critical product reliability incidents on our platforms, shadowing an experienced “Field Marshal” who is trusted to ensure that any critical issue on our platforms is resolved. I’m now fully qualified as an “Independent Field Marshal.” The onboarding process was challenging, but I really value the fact that I’ve been able to help people independently with important problems so soon after ramping up on the team.
Learning to balance Field Marshaling with project work was also a rewarding challenge. Since project teams are small, you’re expected to become a relative expert in your project area within a few months. I’ve really enjoyed my first project on proactive product stability detection, because I can see the direct impact of my monitors catching issues before users do.
Kwesi
Prior to joining Palantir, I really wanted a role that would build my technical skills. Palantir empowered my growth from day one by putting me on a team responsible for building and maintaining our internal tools for accelerating issue resolution. I’ve done a bunch of front-end and back-end development work, and 18 months later, I focus on long-term architecture and design, stability improvements, and technical mentorship. Getting mentorship from extraordinary engineers really helped catalyse my growth, and I look forward to mentoring other team members to enable their growth as well.
On the non-technical side, before joining Palantir, I struggled with building consensus on how to solve problems. At Palantir, the only way to persuade people to do a thing is to build a shared agreement of the problem we’re trying to solve and demonstrate that a given solution is best; you cannot rely on job title or years of experience because we have a flat hierarchy. Developing these skills has been a challenge, but it’s very rewarding to see some of my ideas coming to fruition. An example from last year: we wanted to use LLMs to summarize product reliability issues in real-time to help share context during their resolution. We were able to persuade multiple different product teams to partner with us to integrate AIP and Foundry features into our issue management tooling to achieve this outcome.
Thea
During my time on the team, I’ve been able to grow my operational skills by taking on open-ended projects that require building consensus amongst a huge number of stakeholders, whether that’s designing and rolling out training for our business development organization to help them be more effective during critical product reliability issues or working with stakeholders to improve our speed in sending communications about user-facing issues to customers.
I’ve also had the opportunity to develop technical skills in parallel. Handling critical incidents across our entire product portfolio is a great way to understand product architecture and technical debugging techniques. I’ve even started to learn how to build features myself. As an example, we’ve been working on improving the speed at which we send user-facing communications for critical product support issues. We decided to build a new feature to auto-generate and surface a draft message based on the affected applications. Rather than delegating the work, I decided to build the feature. This involved learning GoLang and implementing the feature in our tooling. It was particularly rewarding to build a new feature in a couple of weeks, use the feature in production, and see the positive impact it had for my colleagues.
What do you like most about being on the Product Reliability Incident Management team?
Lea
It gives you a unique opportunity to have impact across such a broad range of applications, environments, and products. For example, I’ve been working on improving alerting for critical applications in our disconnected environments, which is a problem that cuts across multiple product and business development teams. I love that being on the Product Reliability Incident Management team gives me the chance to work on such widely-scoped problems.
Kwesi
The Product Reliability Incident Management team allows you to work on a very wide variety of projects, as it operates almost like a startup within a larger company. On the technical side, in 18 months, I’ve done back-end and front-end development, lots of stability improvement work, and have made architecture and design decisions on new technologies the team is developing. There are few roles that allow for this amount of variety in such a short space of time.
Thea
There’s almost too many to list! First, I love how collaborative the team is, and I get to work with different teams and different areas of the business every week. I also really enjoy the fast development pace; we frequently go from noticing a problem to brainstorming a solution, to rolling out the solution and releasing multiple versions based on feedback within a couple of weeks. Finally, I’ve had the chance to work on such a diverse array of projects. I’ve built dashboards to speed up troubleshooting for critical product support issues and rolled out training for how the entire business development organization can best respond when issues arise. I’ve also improved our processes and tooling to ensure that less complex critical product support issues can be handled by the product teams themselves.
Why do you think someone should join the Product Reliability Incident Management team at Palantir?
Lea
If you like unstructured problems that affect critical software, and require a wide variety of technical and people skills, then this is a role that will make you shine.
Kwesi
Managing our most critical product reliability issues provides a strong feeling of direct impact from your work. I also think it’s an excellent team to get exposure to a wide range of product development skills. Finally, it’s also a great team to join to develop mentoring skills because of the breadth of skills the overall team requires.
Thea
This team is a great fit for someone who wants to work on a product development team that requires cross-collaboration with a wide variety of stakeholders. It’s also an excellent team to join if you’re looking to develop a broad range of skills, both technical and non-technical. If you love to go from zero to one on many different topics, join us on Product Reliability Incident Management!
Join the Product Reliability Incident Management team
Palantir’s Product Reliability Incident Management team has a business-critical responsibility to help resolve some of the most impactful product issues while continually innovating as our business and offerings grow. We’re looking for people who can handle unstructured problems in a vast domain space, have strong technical or operational skills, enjoy collaborating with multiple teams across the business, and have a passion for personal development and growth.
If this sounds like you, consider applying to join our team as a Product Reliability Incident Management Engineer.
Product Reliability Incident Management at Palantir was originally published in Palantir Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.