Google SRE interview questions — mid-level & senior#
Questions verified from candidate interview reports on Glassdoor, Blind, Reddit (r/cscareerquestions, r/sre), engineering blogs, and interview prep platforms (IGotAnOffer, Interview Kickstart). These reflect the 2023–2025 interview cycle.
How Google SRE interviews are structured#
Google SRE interviews typically consist of 5–6 rounds across two stages. Understanding the format helps you prepare for what each round tests.
Phone screen (45 min): Linux fundamentals, basic networking, one system-design or troubleshooting scenario. Tests breadth — the bar to continue is knowing fundamentals cold.
Onsite / virtual onsite rounds:
| Round | Focus | What Google evaluates |
|---|---|---|
| NALSD | Non-Abstract Large System Design | Quantitative reasoning, scalability math, reliability tradeoffs |
| Linux / Systems | OS internals, kernel behavior, debugging | Depth of systems knowledge, not surface-level definitions |
| Casualty (×2) | Incident troubleshooting scenarios | Structured diagnostic reasoning using SRE-STAR(M) |
| Operational coding | Reliability-focused scripting | Writing correct, memory-efficient, production-grade code |
Google’s stated evaluation criteria emphasize operational maturity and execution sequencing over pattern-matching to memorized answers. In troubleshooting rounds specifically, interviewers score the path — not just whether you reach the right answer.
1. NALSD — Non-Abstract Large System Design#
NALSD is the SRE-specific flavor of system design. Unlike software engineering system design interviews that focus on architecture diagrams, NALSD starts with a quantitative load analysis: how many requests per second, how much storage, how many machines? Architectural decisions follow from the numbers.
What interviewers look for: Can you turn a vague problem statement into concrete capacity estimates? Do you identify the bottleneck before proposing a solution? Do your reliability choices (replication factor, retry strategy, failover time) match the stated SLO?
Q1. Design a disaster recovery plan to replicate a 5 Petabyte storage cluster with a 4-hour RTO.#
Reported source: DEV Community / AceInterviews, Google SRE NALSD Round Walkthrough
Step 1 — Establish the numbers
- 5 PB at a typical network throughput of ~10 Gbps ≈ 11 hours for a full bulk transfer. That already exceeds a 4-hour RTO, so a cold-restore-from-backup approach won’t work. The DR design must use continuous replication, not periodic snapshots.
Step 2 — Choose a replication strategy
| Strategy | RTO | RPO | Notes |
|---|---|---|---|
| Synchronous replication | ~0 | ~0 | Adds write latency; only viable if DR site is <~10ms away |
| Asynchronous replication | <4h (with pre-warmed standby) | Minutes to hours | Lower latency impact; acceptable RPO for most storage workloads |
| Snapshot + incremental sync | Hours–days | Hours | Too slow for a 4h RTO at 5 PB scale unless deltas are tiny |
Choose async replication to a warm standby cluster in a separate region. Track replication lag as an SLI; alert if lag exceeds 30 minutes.
Step 3 — Design the warm standby
- Run a secondary cluster at 100% capacity in a second region (active-passive). The standby receives continuous delta sync (e.g., journal-based replication, similar to DRBD or cloud-native equivalents like Google Cloud Storage replication).
- Replicate metadata separately (namespace, permissions, chunk maps) — metadata volume is small enough for synchronous or near-synchronous replication.
Step 4 — Define and test the failover process
- Failover steps: (1) detect primary failure via health checks (auto or manual), (2) promote standby, (3) redirect DNS / load balancer, (4) verify data integrity on standby.
- Target each step to complete in <1 hour, totaling <4 hours end-to-end.
- Schedule quarterly failover drills; measure actual RTO, not estimated RTO.
Step 5 — Monitor replication health
SLIs to track: replication lag (bytes behind), replication throughput, successful heartbeat pings from standby. Alert on lag > 1 hour.
Common follow-up: “What if the replication link goes down for 6 hours?” — answer: differentiate between planned and unplanned gaps. For planned (e.g., maintenance), pause writes or switch to synchronous temporarily. For unplanned, queue writes with a journal; on reconnect, replay the journal and verify checksums before marking standby ready.
Q2. Design a metrics pipeline to handle 10 million events per second.#
Reported source: DEV Community / AceInterviews
Step 1 — Estimate the data volume
- 10M events/s × assume 200 bytes/event = 2 GB/s ingest rate. Per day: ~170 TB. Per month: ~5 PB. Storage will be the dominant cost; raw data must be compressed and downsampled for long retention.
Step 2 — Define the pipeline stages
Producers → Ingestion layer → Processing / aggregation → Storage → Query layer- Ingestion layer: Use a horizontally scalable message queue (Kafka with 100+ partitions, or Google Pub/Sub). Each partition handles ~100K events/s; 100 partitions → 10M/s. Auto-scale consumers based on lag.
- Processing / aggregation: Stateless stream processors (Apache Beam, Flink, or Dataflow) aggregate raw events into 1-minute rollups before writing to long-term storage. Reduces storage writes ~60×.
- Storage: Write raw data to columnar object storage (e.g., Parquet on GCS/S3) for ad-hoc queries; write aggregated rollups to a time-series database (e.g., Prometheus remote write, BigTable, or InfluxDB) for dashboards.
- Query layer: Serve dashboards from the TSDB; run historical queries against columnar storage with a query engine (BigQuery, Presto).
Step 3 — Design for reliability
- Ingestion is the highest-risk point: Kafka replication factor = 3, min in-sync replicas = 2. Consumer group commit offsets after processing, not before — at-least-once semantics with idempotent writes downstream.
- Pipeline SLO: p99 latency from event emission to dashboard visibility < 30 seconds. Error budget: 0.1% (4.3 min/month of ingestion outage).
- Monitor: consumer lag (alert if > 5 minutes of lag), ingest throughput, processing error rate.
Common follow-up: “How do you handle a sudden 10× traffic spike?” — answer: Kafka auto-scales partitions only offline, so provision headroom (150% of expected peak). Use Kubernetes HPA on stream processors to scale horizontally within seconds.
Q3. Design a feature flag system resilient to control plane failures.#
Reported source: DEV Community / AceInterviews
Core problem: If the service that serves feature flag decisions goes down, clients must still be able to operate. A feature flag system that fails closed (disables all features) or fails open (enables all features) is unacceptable — services need predictable behavior.
Design principles:
- Client-side caching: Each service client maintains an in-memory snapshot of its flag evaluations, refreshed on a configurable TTL (e.g., 30 seconds). On control plane failure, the client continues serving from cache.
- Persistent local fallback: On startup, the client reads from a local file or embedded snapshot (written at last successful sync). Cold-starts during an outage use the last known good state.
- Decentralized propagation: Flag configs are pushed to a distributed store (e.g., Bigtable, Redis cluster) that clients read directly — the “control plane” only writes; clients don’t call it at query time.
- Safe defaults: Each flag definition includes a
default_valuefield used when the flag is absent from cache. Defaults are chosen to be safe (usually “off” for new features, “on” for critical paths that existed before flagging).
Failure modes to test:
| Failure | Expected behavior |
|---|---|
| Control plane API down | Clients use cached values; no impact |
| Client cache expired + control plane down | Client falls back to persistent snapshot |
| Client cold start + control plane down | Client uses embedded defaults |
| Flag store corruption | Client detects checksum mismatch; uses last valid snapshot |
SLO: Flag evaluation latency < 1ms p99 (must be in hot path). Flag propagation delay < 60 seconds (time for a new flag to reach all clients).
2. Linux internals and systems knowledge#
Google SRE interviews go deeper into Linux than most SRE interviews. Expect questions about kernel behavior, memory management, and process states — not just “what command do you use.”
Q4. How does malloc() work at the kernel level?#
Reported source: Keypressure.com — “How I Failed a Google SRE Interview”
malloc() is a libc function, not a syscall. It manages a heap allocator in user space. Two kernel interfaces back it:
brk()/sbrk(): Extends the process heap by moving the program break (end of the data segment). Used for small allocations. Fast — no page fault until memory is actually touched.mmap(MAP_ANONYMOUS): Allocates a new memory region for large allocations (typically > 128 KB by default in glibc). The OS returns zero-filled pages. Freed memory can be returned to the OS withmunmap().
What the allocator does between these syscalls:
glibc’s malloc (ptmalloc2) maintains free lists (“bins”) of various sizes. When malloc(n) is called:
- Search the appropriate bin for a free chunk of size ≥ n.
- If found, return it (no syscall).
- If not, call
brk()ormmap()to get more memory from the kernel.
Why this matters in SRE contexts:
- A process with high RSS but low actual memory use is likely experiencing heap fragmentation — many small allocations scattered across a large heap. The allocator can’t return fragmented regions to the OS.
straceshowing repeatedbrk()calls with small increments signals inefficient small-allocation patterns.mmap()/munmap()pairs in a tight loop indicate a large-allocation workload that might benefit from an alternative allocator (e.g., tcmalloc, jemalloc).
Q5. Why does a process end up in uninterruptible sleep (D-state)? How do you debug it?#
Reported source: DEV Community / AceInterviews, multiple Glassdoor reports
D-state (TASK_UNINTERRUPTIBLE): The process is waiting for an I/O operation that cannot be interrupted — typically a kernel-level wait on disk I/O or an NFS call. Unlike sleeping in S-state (interruptible), a D-state process ignores signals, including SIGKILL.
Common causes:
| Cause | Details |
|---|---|
| Slow or hung disk I/O | Waiting for a read/write to return from a degraded or failing storage device |
| NFS hang | NFS server unresponsive; kernel waits indefinitely for soft-mount timeouts |
| Kernel bug or deadlock | Rare; process stuck inside a kernel critical section |
| Memory pressure + swap | Kernel waiting for a page to be swapped in from slow swap device |
How to debug:
# Identify D-state processes
ps aux | awk '$8 == "D"'
# Show the kernel call stack for a stuck process (as root)
cat /proc/<pid>/wchan # The kernel function the process is waiting in
cat /proc/<pid>/stack # Full kernel stack trace
# Check for I/O wait system-wide
iostat -x 1 # Look for await > 1000ms or util% near 100%
iotop -o # Live per-process I/O
# For NFS specifically
nfsstat -c # Check retransmit counts
dmesg | grep -i nfsWhat to do:
- If the storage device is failing: failover to a healthy device, then investigate the hardware.
- If NFS: remounting with
softoption (vs.hard) prevents indefinite hangs but risks data integrity — evaluate against the workload. - If a kernel deadlock is suspected: capture a kernel panic trace (via magic sysrq
echo t > /proc/sysrq-triggerto dump all threads) and file a kernel bug.
Q6. Service latency doubled but CPU utilization is only at 50%. What do you investigate?#
Reported source: DEV Community / AceInterviews — described as “tests understanding of CFS throttling”
This is a deliberate trap — the obvious guess (CPU is the bottleneck) is disproven by the 50% figure. The real cause is often cgroup CPU throttling from a misconfigured cpu.cfs_quota_us, which limits a container’s CPU even when the host has spare capacity.
Investigation sequence (SRE-STAR(M)):
Symptom: Latency doubled; CPU at 50% (as reported by
topor host-level metrics).Triage — rule out the obvious:
- Is the latency increase uniform across all endpoints or only some? (Narrow scope)
- Did anything change recently — deploy, config update, traffic pattern?
Assess — check CFS throttling:
# Inside the container or via kubelet stats cat /sys/fs/cgroup/cpu/cpu.stat # Look for: throttled_time (nanoseconds) # If throttled_time is increasing, cgroup quota is the bottleneckKubernetes exposes this as the
container_cpu_cfs_throttled_seconds_totalmetric in cAdvisor/Prometheus.Other hypotheses to check in parallel:
- Lock contention: Application threads blocking on a mutex; CPU is not busy but work isn’t progressing. Use
perf lockor profiling. - I/O wait: Disk or network I/O;
iostat,netstat -sfor retransmits,ss -s. - GC pressure: For JVM/.NET workloads, GC pauses cause latency spikes without raising CPU. Check GC logs.
- Downstream dependency: A slow database query or external API call. Check distributed traces.
- Memory bandwidth saturation: Rare, but possible on memory-intensive workloads.
- Lock contention: Application threads blocking on a mutex; CPU is not busy but work isn’t progressing. Use
Root cause: If CFS throttling is confirmed, fix by either increasing
cpu.cfs_quota_us(i.e., raising the CPU limit in the Kubernetes pod spec) or by reducing the CPU request/limit mismatch.Mitigation: Increase CPU limits or restructure the workload to avoid bursty CPU patterns that exceed quota.
Q7. How would you delete a file named -rf?#
Reported source: Keypressure.com — noted as a “basic UNIX” question used to test attention to detail
rm -- -rf # The -- signals end of options; -rf is treated as a filename
rm ./-rf # Prefix with ./ makes it a path, not a flagThis tests that you understand how shell argument parsing works — -rf is interpreted as the -r -f flags by rm unless you force it to be treated as a literal filename.
3. Troubleshooting / casualty rounds#
In these rounds, the interviewer presents a real-looking incident scenario and asks you to drive the response. They evaluate your reasoning process using the SRE-STAR(M) framework: Symptom → Triage → Assess → Root Cause → Mitigation.
What interviewers watch for:
- Do you start with impact assessment before diving into root cause?
- Do you ask for data before speculating?
- Do you sequence actions logically (stop the bleeding before investigating)?
- Do you communicate your reasoning, not just your conclusions?
Q8. You’re on-call and receive: “Shakespeare-BlackboxProbe_SearchFailure — no search results for the past 5 minutes.” Walk through your response.#
Reported source: This scenario comes directly from the Google SRE Workbook (publicly available). Glassdoor candidates confirm similar black-box probe scenarios are used in interviews.
Symptom: Black-box probe reports search failure for 5 minutes. Black-box failure means the end-user experience is broken — this is high severity.
Triage — establish scope and impact:
- Is this alert firing for all queries or specific ones? Check if other black-box probes are failing (different queries, different regions).
- Is the white-box monitoring (internal metrics) also showing errors, or only the black-box probe? A divergence suggests the internal service is up but something in the serving path is broken.
- Check user-facing error rates in dashboards. Quantify impact: how many QPS affected?
Assess — what changed recently?:
- Check the deployment/change log: any deploy in the last hour?
- Check correlated alerts: are any backend services (indexing, serving, load balancers) showing alerts?
Root cause investigation:
Work down the serving stack:
- Are the frontend serving nodes returning errors? Check HTTP response codes.
- Are backend index servers reachable from frontends? Check internal RPC error rates.
- Is the search index itself current? (Could indicate an indexing pipeline failure, not a serving failure.)
- Is the load balancer routing correctly? Check backend health check status.
Mitigation (before root cause is confirmed):
- If a bad deploy is suspected: roll back the serving binary immediately — don’t wait for root cause confirmation.
- If specific backends are unhealthy: drain traffic away from them and let the load balancer route to healthy ones.
- Communicate: page the owning team if you’re not part of it; post a status update every 15 minutes.
Resolution + Postmortem:
- Once service is restored, document the timeline and root cause.
- Write a blameless postmortem: What failed? Why wasn’t it caught earlier? What automation or monitoring would detect this sooner?
Q9. Global load balancers are returning 503 errors but all backend health checks pass as healthy. What happened?#
Reported source: DEV Community / AceInterviews — NALSD troubleshooting scenario
This is a classic “the monitoring doesn’t match the reality” scenario.
Triage:
- Confirm the 503s are real: check the load balancer access logs directly, not just aggregated dashboards (could be a monitoring collection lag).
- Are 503s coming from all regions or one? All load balancers, or only a subset?
Hypotheses (backends pass health checks but traffic fails):
Health check and production traffic use different paths: The health check hits
/healthz(always returns 200), but real requests hit/api/searchwhich is broken. The LB sees a healthy backend but serves broken traffic.- Check: Manually send a real production-shaped request to one backend. Does it return an error?
Backend connection pool exhausted: Backends are alive but all connections are in use; new requests are immediately rejected. Health check uses its own dedicated connection, so it succeeds.
- Check: Backend metrics for active connections, connection queue depth, thread pool saturation.
TLS certificate mismatch between LB and backend: LB can reach the backend for health checks (if health checks are HTTP), but TLS handshake for real HTTPS traffic fails.
- Check:
openssl s_client -connect backend:443from the load balancer.
- Check:
Upstream dependency failure: Backends are healthy but every request fails because a downstream dependency (database, cache) is down. The backend returns 503 to the LB.
- Check: Distributed traces or backend logs for dependency error messages.
Mitigation: Identify which hypothesis matches the evidence and address specifically. If hypothesis 1 (bad health check definition), update the health check endpoint to use a representative request. If hypothesis 2 (pool exhausted), either reduce traffic (rate limiting) or scale out backends immediately.
Q10. A BGP route leak is suspected — traffic to your service from South America suddenly routes through Asia, adding 200ms of latency. How do you respond?#
Reported source: DEV Community / AceInterviews
Triage:
- Confirm the routing change: use traceroute or MTR from multiple vantage points in South America. Look for unexpected ASNs in the path.
- Check BGP looking glass (e.g., RIPE RIS, route-views) to see if your prefixes are being announced with an unexpected path.
- Quantify impact: what percentage of South American traffic is affected?
Assess:
- Which ASN is leaking the route? The leak usually originates at a transit ISP or peer that accepted a more-specific prefix from a customer and propagated it upstream.
- Do you control the BGP announcements (if you run BGP), or are you dependent on a cloud provider?
Mitigation options:
- If you manage BGP: Adjust local preference or MED to force traffic back through the correct AS path. If you have anycast, ensure your South American PoP is announcing with correct community tags.
- If using a cloud provider: Open an emergency ticket with the provider; they can de-preference the leaked route. Meanwhile, consider re-advertising your prefixes with a more-specific prefix to override the leaked route.
- Application-level mitigation: If the provider fix will take hours, consider geo-routing at the application layer (DNS-based routing) to send South American users to a closer endpoint directly, bypassing the broken BGP path.
Postmortem actions: Implement BGP monitoring (e.g., BGPalerter, RIPE BGP monitoring) to detect future route leaks within minutes rather than hours.
4. SLO, SLI, and error budget questions#
Google invented the SLO/SLI/error budget framework and expects SRE candidates to apply it fluently — not just define the terms.
Q11. How do you define SLOs for a service that has never had them before?#
Reported source: InterviewBit, IGotAnOffer — confirmed by multiple Glassdoor reports
Step 1 — Identify the critical user journeys
SLOs should measure what users actually care about. For a search service: can users get results? Are results returned fast enough to be useful? For a storage API: can users read and write data reliably?
Start with 2–3 user journeys; each gets its own SLI and SLO. Avoid measuring everything — a dashboard with 50 SLOs is noise.
Step 2 — Define SLIs from user journeys
An SLI is a ratio: (good events) / (total events) over a rolling window.
| User journey | SLI |
|---|---|
| “Search returns results” | % of search requests returning HTTP 200 with ≥1 result |
| “Reads complete quickly” | % of read requests completing in < 200ms |
| “Writes are durable” | % of acknowledged writes visible within 5 seconds |
Avoid SLIs you cannot measure. If you don’t have latency histograms today, add them before defining a latency SLO.
Step 3 — Set the initial SLO conservatively
For a new service with no historical data:
- Set a “comfort” SLO: what level would you be embarrassed to miss? Start at 99% (two 9s). This gives you 7.3 hours of downtime per month.
- After 4–8 weeks of data, review: are you consistently achieving 99.5%? Tighten the SLO. Frequently burning the budget? Loosen or fix the underlying reliability issues.
Step 4 — Calculate error budget and define policy
- SLO = 99.9% → error budget = 0.1% ≈ 43 minutes/month
- Policy: if error budget is >50% consumed in the first 2 weeks, freeze non-critical feature work and focus on reliability.
- Policy: if error budget is unused for 3 consecutive months, the SLO may be too loose — revisit.
Step 5 — Socialize and get buy-in
SLOs only work if the product team and engineering leadership agree on the policy. A unilaterally declared SLO that nobody follows is worse than no SLO.
Q12. How do you use an error budget to decide between shipping a new feature and doing reliability work?#
Reported source: InterviewBit, FinalRoundAI — core Google SRE philosophy question
The error budget is the quantitative answer to the usually-political question of “reliability vs. velocity.”
The framework:
| Error budget remaining | Action |
|---|---|
| > 50% remaining | Green light for feature work; reliability is healthy |
| 25–50% remaining | Increase testing rigor on new releases; slow canary rollouts |
| < 25% remaining | Freeze new feature deployments; focus engineering time on reliability |
| Budget exhausted | Full freeze on feature work until budget partially recovers |
Why this works: The decision is not made by the SRE team vs. the product team arguing in a meeting — it follows mechanically from the SLO and the measured reliability. This removes the politics.
Nuances:
- Budget exhaustion from causes outside engineering control (e.g., a third-party provider outage): the team might choose to exclude this from the budget calculation for the policy decision, since engineering can’t fix it.
- Planned downtime: Maintenance windows consume error budget. Schedule them when budget is healthy, not when it’s almost exhausted.
- Differentiate symptoms: A budget burned by a single large incident (→ fix the root cause) differs from one burned by many small incidents (→ fix the systemic fragility).
5. Operational coding#
Google SRE coding rounds are not LeetCode algorithmic puzzles. They test whether you can write production-quality code for reliability tasks: log parsing, monitoring scripts, concurrent system utilities.
Q13. Write a script that reads a 50 GB log file, extracts HTTP 5xx errors, and outputs a summary — using no more than 512 MB of RAM.#
Reported source: DEV Community / AceInterviews — operational coding round question
The constraint rules out loading the file into memory. You must stream it line by line.
import sys
from collections import Counter
def summarize_5xx(filepath: str) -> None:
counts: Counter = Counter()
total_lines = 0
with open(filepath, "r", buffering=1 << 20) as f: # 1 MB read buffer
for line in f:
total_lines += 1
# Assumes Combined Log Format: ... "GET /path HTTP/1.1" 503 ...
parts = line.split()
if len(parts) >= 9:
try:
status = int(parts[8])
if 500 <= status < 600:
counts[status] += 1
except (ValueError, IndexError):
pass
print(f"Total lines processed: {total_lines:,}")
print(f"HTTP 5xx errors found: {sum(counts.values()):,}")
print("\nBreakdown by status code:")
for code, count in sorted(counts.items()):
print(f" {code}: {count:,}")
if __name__ == "__main__":
summarize_5xx(sys.argv[1])Why this stays within 512 MB:
- The file is never loaded into memory;
for line in freads one line at a time using Python’s file iterator. buffering=1 << 20sets a 1 MB read buffer — trades syscall count for memory.- The
Counteronly grows with the number of distinct 5xx codes (at most ~100 entries), not with file size.
Production additions to discuss:
- Handle compressed logs (
.gz): usegzip.open()with the same streaming approach. - Parallel processing: if runtime matters, split the file into chunks and process with
multiprocessing.Pool, then merge counters. - Progress reporting:
tqdmwrapping the file iterator for long-running jobs.
What the interviewer checks: Do you immediately reach for readlines() or f.read() (wrong — loads everything) or do you reason about memory constraints before writing a line of code?
Q14. Implement a function that fetches URLs concurrently, respects a rate limit, and enforces a per-request timeout.#
Reported source: DEV Community / AceInterviews — operational coding round
import asyncio
import aiohttp
from typing import List, Dict
async def fetch_with_limit(
urls: List[str],
max_concurrent: int = 10,
rate_per_second: int = 5,
timeout_seconds: float = 10.0,
) -> Dict[str, dict]:
results: Dict[str, dict] = {}
semaphore = asyncio.Semaphore(max_concurrent)
# Token bucket: release one token every (1 / rate_per_second) seconds
rate_limiter = asyncio.Semaphore(rate_per_second)
async def release_rate_token():
await asyncio.sleep(1.0 / rate_per_second)
rate_limiter.release()
async def fetch_one(session: aiohttp.ClientSession, url: str) -> None:
await rate_limiter.acquire()
asyncio.ensure_future(release_rate_token())
async with semaphore:
try:
async with session.get(
url,
timeout=aiohttp.ClientTimeout(total=timeout_seconds)
) as resp:
results[url] = {
"status": resp.status,
"body": await resp.text(),
"error": None,
}
except asyncio.TimeoutError:
results[url] = {"status": None, "body": None, "error": "timeout"}
except aiohttp.ClientError as e:
results[url] = {"status": None, "body": None, "error": str(e)}
async with aiohttp.ClientSession() as session:
await asyncio.gather(*(fetch_one(session, url) for url in urls))
return resultsKey points to explain:
Semaphore(max_concurrent)caps in-flight requests.- The rate limiter uses a token bucket pattern: one token is released per
1/rateseconds. aiohttp.ClientTimeoutenforces the per-request timeout at the HTTP client level.- Errors are captured per-URL, not raised — a reliability script must handle partial failures gracefully.
6. SRE philosophy and behavioral questions#
Q15. What is the difference between toil and engineering work? Give an example of converting toil into engineering.#
Reported source: Glassdoor, InterviewBit — standard SRE philosophy question
Toil is operational work with these properties:
- Manual (requires a human each time)
- Repetitive (the same task runs over and over)
- Automatable (a machine could do it)
- Scales linearly with service growth (more traffic = more toil)
- Produces no lasting improvement to the system
Examples of toil: manually restarting a flaky service, manually provisioning servers for each new customer, manually checking alert dashboards.
Engineering work permanently improves the system: it reduces future toil, adds reliability, or improves developer productivity.
Conversion example:
- Toil: On-call engineers manually restart the payment service every Thursday night because a memory leak causes it to OOM crash around that time (weekly job triggers high load).
- Engineering response: Profile the service, identify the allocation causing the leak, fix the root cause in code. Alternatively: instrument the service with automatic restart on memory threshold and alert on leak rate rather than on crash. The Thursday manual restart becomes unnecessary.
Google’s guideline: SREs should spend no more than 50% of their time on toil. If an engineer is spending more, they should escalate to management — toil growth signals that engineering investment in automation is overdue.
Q16. Describe a stressful incident you were part of. What did you do, and what would you do differently?#
Reported source: Glassdoor — behavioral round question at Google SRE interviews
This is a behavioral question that Google uses to assess incident management maturity. Use the STAR format, but emphasize what you specifically did (not “we”), what the impact was, and the lessons.
What Google looks for:
- Calm under pressure: Do you describe a chaotic situation in organized terms?
- Communication: Did you keep stakeholders informed?
- Postmortem mindset: Do you focus on systemic fixes, not blame?
- Honest reflection: Do you acknowledge mistakes without being defensive?
Example structure:
Situation: During peak load on Black Friday, our checkout service latency jumped 10× — users were abandoning carts.
Task: I was the on-call SRE.
Action: I immediately checked our error budget dashboard — we were burning budget at 100× the normal rate. I correlated the latency spike with a database query regression in a deploy 2 hours earlier. I rolled back the deploy within 15 minutes of alert, communicated status in our incident Slack channel every 10 minutes, and paged the database team to verify no data corruption.
Result: Service recovered within 30 minutes. Error budget impact was significant but within our quarterly budget.
Reflection: The deploy passed all staging tests because the load pattern in staging didn’t reproduce the Black Friday traffic profile. I would add load testing with production-shaped traffic to the release gate for high-risk deployments.
Interview preparation tips for Google SRE#
For NALSD: Practice starting every design by estimating load, storage, and throughput. If you can’t write down numbers in the first 5 minutes, that’s a gap to address. Work through the questions in the Google SRE Workbook — it contains NALSD exercises.
For Linux rounds: Go deeper than
top,ps, anddf. Understand how the kernel scheduler works, what CFS throttling means, and how to read/procfiles. Brendan Gregg’s website and his book Systems Performance are the standard references.For troubleshooting rounds: Practice narrating your reasoning out loud. The interviewer needs to hear your thought process, not just your conclusion. If you hit a dead end, say so explicitly: “I’ve ruled out X and Y — the next hypothesis I’d check is Z by looking at…”
For coding: Test your solution against edge cases before declaring it done. Mention memory and runtime characteristics unprompted — Google SRE interviewers expect you to reason about production constraints.
Related pages#
- DevOps & SRE interview questions — general — multi-region architecture, CI/CD design, SLOs, incident management, and Kubernetes at scale
- Linux — OS fundamentals for SRE
- Kubernetes — deep dive on Kubernetes concepts
- Networking (TCP/IP, load balancing) — networking for distributed systems
- AWS & cloud concepts — cloud architecture and services
- Company SRE interview questions — Google, Meta, Amazon and other top tech companies