toidiu

Sub sections

STAR

Situation: Describe the situation or task you faced

Task: Explain what you were required to achieve

Action: Describe the actions you took to complete the task

Result: Explain the outcome of your actions and what you learned from the experience

Amazon Leadership Principles

Table of content

netbench orchestrator

aws_crypto

ktls

aws_crypto

s2n-quic optimistic ack mitigation

aws_crypto

s2n-quic ack delay

aws_crypto

s2n-tls default TLS 1.3 support

aws_crypto

s2n-quic async client hello

aws_crypto

s2n-quic rustls testing and parity

aws_crypto

s2n-quic advocate better slowloris mitigation

aws_crypto

s2n-quic handshake status

aws_crypto

s2n-quic path challenge

aws_crypto

s2n-quic client implementation

aws_crypto

s2n-quic connection migration

aws_crypto

Mentor auto pulling andon

aws_lm

Mentor Intern

aws_lm

PDT build

aws_lm

Simplify metrics

aws_lm

#invent and simplify | #dive deep

Customer impact remediation

aws_lm

#bias for action

ICE investication

aws_lm

Granular andon

aws_lm

Detect stuck migrations

aws_lm

Fingerprint service

aws_lm

#invent and simplify | #bias for action

Advise SP on technical contractor

shatterproof

#right alot | #disagree and commit

Access and expose recs generated by Data Science

ihr

#invent and simplify | #frugality | #earn trust

Max connections and failing kube watch

ihr

#right alot | #bias for action | #dive deep

Fastly blacklist to whitelist migration

ihr

#highest standard | #earn trust

PostgresMapper - increasing code realibility

rust

#learn and be curious | #think big

Batch job alerting and little knowledge of system (on-call)

ihr

#bias for action

Amazon subscription not enabling features

ihr

#customer obsession | #dive deep

Vertical squad lacking resources

ihr

#disagree and commit | #develop the best

Migrate jenkins from standalone to k8s

ihr

#think big | #invent and simplify

Skynet - delivering results for hackday project

ihr

#deliver results | #right alot

Templating kube config

ihr

#think big | #disagree and commit

PR to rust lang - infer outlives predicates

rust

#learn and be curious | #ownership | #think big

Own production outage when on-call was in transit

ihr

#ownership | #customer obsession

Android rotation work took much longer than expected

s&j

#customer obsession | #highest standard

Starting and leading the Rust meetup

rust

#ownership | #develop the best

---------- ----------

netbench orchestrator

aws_crypto

https://github.com/aws/s2n-netbench/tree/main/netbench-orchestrator

ktls

aws_crypto

https://github.com/aws/s2n-tls/issues/3711

s2n-quic optimistic ack mitigation

aws_crypto

https://github.com/aws/s2n-quic/pull/1986

S

Add mitigation for optimistic acks
The rfc was clear how to do this (skip packets), but also vague because it didnt mention how many packets to skip and how often.

T

Come up with a strategy for skipping packet and mitigating the attack.
Implement the mitigation in s2n-quic.

A

Audited other QUIC implementations and conducted analysis to answer two key questions:
- How many packets to skip?
- How often should packets be skipped?
There were 2 other implementations that were using a "static" (skipping did not evolve with cwnd) approach. Their strategy was to:
- Track skipping 1 pn. Overwrite value when a new pn needs to be skipped.
- Skip a random pn in some static range.
The "static" approach was network dependent (DC vs public wifi):
- Does overriting skip pn nullify the mitigation?
- Is a static skip range effective for all networks?
  - A cwnd is very different in a DS vs public internet
Considered the option of skipping multiple packets:
- Pro: This would allow us to skip more frequenly
- Con: Require storing multiple skip pn. How many pn should we store?
Analyzed the purpose of the mitigation to come up with an optimal solution.
- The goal of the mitigation was to prevent cwnd bloat.
- Which could be done if packets were acked prior to the peer receiving them.
- By basing skip range on the number of packets that could be received with an cwnd.
Solution: evolve skip packet range based on cwnd and only store 1 skip pn.
Calculate range based on packets we expect to send in a single period.
- pkt_per_cwnd = cwnd / mtu
- rand = pkt_per_cwnd/2..pkt_per_cwnd*2

R

Successfully implemented the mitigation.
Only had to store 1 skip pn.
By evolving the skip range based on cwnd, s2n-quic would scale to all networks.

s2n-quic ack delay

aws_crypto

summary

tracking issue: https://github.com/aws/s2n-quic/issues/1276
cpu increase: https://github.com/aws/s2n-quic/pull/1298
revert pr with some analysis shows : https://github.com/aws/s2n-quic/pull/1368
measure batching multi-packets: https://user-images.githubusercontent.com/4350690/174196918-6af428e4-9ab7-4458-b3b9-e27ed89c3318.png

metrics

S

Explore the ACK delay RFC and see if s2n-quic could benefit from implementing it.

Pros of delay:

send/recv acks is CPU expensive
- on the send heavy server it takes 24.33%
asymmetric links like satellite

Cons of delay:

progress: delayed acknowledgment can delay data being sent from peer
- the RFC lets you adjust thresholds
ECN: congestion signal by the network
- ack packets with ECN markings immediately
loss recovery: detect a packet was not received and retransmit
cc: regular acks help establish good RTT estimates and drive CC
BBR: requires a high precision signal to work so it was unclear how delaying acks would affect this

RFC features:

negotiating the extension: min_ack_delay. the minimum amount of time, that the endpoint sending this value is willing to delay an acknowledgment
ACK_FREQUENCY frame:
- Ack-Eliciting Threshold: max ack eliciting packets received before sending an ack
- Requested Max Ack Delay: max time to wait before sending an ack
- Reordering Threshold: max number of reordered packet before sending an ack
IMMEDIATE_ACK frame: asks peer to send an immediate ack
Expedite ECN: ecn marked packets should not be delayed

T

Choose on an implementation.

Options:

single round batching (10 packet GSO payload)
- easy to implement (1-2 sprints)
- this solution creates building blocks for others after it
multi round batching (multi-gso payload) (2-4 sprints)
- medium difficulty to implement
- requires tuning with application, production usecase, traffic pattern
implement Ack delay RFC (4-8 sprints)
- difficult to implement
- requires tuning with application, production usecase, traffic pattern
- requires negotiating the extension

A

Impl: Batch ACK Processing (single round)

single round batching (10 packet GSO payload)
- connection can signal interest in processing acks
- store pending acks on connection
- refactor CC and LR to accept signals from batched ack processing (timestamp)
- swap from single processing to batched processing
- emit metrics

R

Flamegraph result 24.33% -> 25.81% https://github.com/aws/s2n-quic/pull/1298
Batching different amount of packets (2, 10, 40, 80): https://user-images.githubusercontent.com/4350690/174196918-6af428e4-9ab7-4458-b3b9-e27ed89c3318.png

lessons learned:

we had to be cautious about delaying acks because we were operating on public internet
I wonder if an environment like DC would be better for delaying acks
acks are a signal within the noise of loss,delay,congestion,etc
within a DC there is less noise so it makes sense that we can have less signal and get away with it

s2n-tls default TLS 1.3 support

aws_crypto

metrics

4000 tests
~1500 scoped down tls 1.2 tests
~33 "default" policy tests

S

include tls 1.3 support by default
good modern default. tls 1.3 is provably secure, modern ciphers,

T

which can be risky because customers will see change in behavior
it would also affect test coverage for tls 1.2
- tests that test the tls 1.2 policy
  - can pass when switched to tls 1.3 because written previous to tls 1.3 support and might not assert protocol
- tests that test the "default" policy

It is not possible to detect the difference between these.

A

assess different options
balance the risk of security vs not making progress
lay out the options, make a case for it

R

manually audit 4000 tests
scope bad tls 1.2 to certain files
pin all tests that are using default to a numbered policy
- test regression for "default" policy tests
don't pin tests
- test regression for tls 1.2 tests
run all tests twice (different platform, libcrypto, fuzz, valgrind)
run a single tls 1.2 test and accept a minimal risk of tls 1.2 test regression

s2n-quic async client hello

aws_crypto

https://github.com/aws/s2n-quic/issues/1137

S

Currently s2n-quic does the certificate lookup/loading operations synchronously.

Non ideal for application which server multiple domains concurrently and need to load multiple certificates, since that would block the thread.

T

Allowed for certificate lookup/loading operations to be performed asynchronously and enable non-blocking behavior.

A

s2n-quic:
- pass the connection waker down to the tls libraries so that they could wake on progress
s2n-tls:
- The work involved converting the callback only-invoked-once to a poll-the-callback model in s2n-tls.
- s2n-tls by default did not allow for poll callbacks.
- s2n-tls previously only called the callback once, which not the Rust model and has quite a few drawbacks.
  - Polling only once means the application/s2n-quic has to schedule (separate thread possibly) the completion of the callback.
  - It also needs to manage the state associated with the callback separately.
  - The Rust polling model allows for all state associated with the future to live within the object bing polled.
  - Additionally, the future can make progress as part of the already existing runtime that s2n-quic starts with.
s2n-tls bindings:
- gluing the new callback polling behavior in an extensible way for other callbacks.

R

setup the framework for future async callbacks in s2n-tls

s2n-quic rustls testing and parity

aws_crypto

s2n-quic advocate better slowloris mitigation

aws_crypto

S

The previous implementation involved closing the connection by default if the bytes transferred dropped below some user specified amount.

T

While simple and effective implementation, this seems like an sharp edge and made me uneasy. The reason for this was:

the user specified value could become stale
the default action closing the connection could be an availability risk
in a worse case scenario this could lead to a stampede of closing all connections

A

R

s2n-quic handshake status

aws_crypto

s2n-quic path challenge

aws_crypto

s2n-quic client implementation

aws_crypto

s2n-quic connection migration

aws_crypto

Mentor auto pulling andon

aws_lm

summary

metrics

conduct initial review of doc written by SDE1
provide feedback on structure and clarity of doc

S

T

A

R

Mentor Intern

aws_lm

Project plan: https://quip-amazon.com/Wa4iADaP4cqI/Intern-Project-LM-Detective

S

The Dispatcher is a critical component in LM but our team lacked metrics.
The intern project "LM Detective" would give us:
- better understanding of Dispatcher Service and ongoing migrations
- allow LM to take a data driven approach to improving the Dispatcher service and scheduling

T

Draft the intern project plan. Omitting some information to probe for understanding.
Help onboard the intern to AWS and LM specific technology.
Educate the intern about LM technology and help guide to a successful delivery.

A

Met with the intern multiple times a week to ensure progress.
Guided the intern to set milestones and define a project plan.
Helped define the stretch goals and helped prepare their final presentation.

R

The intern successfuly completed the project and received a return offer.
LM could use the project to answer questions such as:
- What is the state of the dispatcher given a specific migration id?
- For all migrations in pending/executing state, what are the src droplets and tgt droplets?
- Why is a particular migration in status pending?
The intern also completed the stretch goal for the project
- Create a UI (graph viz) to visualize the data

PDT build

aws_lm

Project plan: https://quip-amazon.com/NtptAZ55eOpb/LM-in-PDT-Work

S

LM was tasked to make the service available in PDT.
PDT should be the last region; LM should therefore also be available in all other regions.

T

Build the entire LM stack in PDT to meet gov compliance. (canaries, dispatcher, alarms, metrics)
Created a project plan to identify all components and missing regions.
Automate the infrastructure to make builds reproducible.

A

Cleaned up our LPT rules to make service launches reqproducible, extensible and maintainable.
Synthesized new alarms, cleaning-up/creating dashboards along the way.

R

Full LM availability in all regions, including PDT (met gov compliance).
A clean, normalized LPT stack for all services owned by LM.

metrics

helped surface questions and drive conversation for support in restricted regions
create and manage doc to track work for PDT region build
debug various issues as they came up during rebuild

Simplify metrics

aws_lm #invent and simplify | #dive deep

metrics

while fixing workflow in PDT, i took the initiative to simplify metrics for all regions
removed 2 metrics packages
replaced packaged code with ~100 lines of code to capture the same metrics in a less concoluted manner
dug into old profiler library codebase to understand implementation
- verified metrics in igraph against what the code was posting

S

T

A

R

Customer impact remediation

aws_lm #bias for action

bias for action: post lse, i took the initiative to dig into customer impact. created a doc to track impact, pulled in senior engineers for advice and drafted a remediation script. helped discover another race condition bug which was customer impacting

metrics

remediated ~40 customer impacting instances

S

T

A

R

ICE investication

aws_lm

metrics

conducted initial deep research into ICEing and its various causes
- categoried data into actionable and non-actionable
started conversations with cross teams to help gather more metrics

S

T

A

R

Granular andon

aws_lm

metrics

auto deploy rules
integrate with workflow
write requirement doc for deprecating old andon
change matching rules from regext o string (eliminate errors)
created pipeline for auto releasing the andon rules
added alarms to catch if rules were not being read from lm workflow
added validation logic to better catch malformed rules during code review

S

T

A

R

Detect stuck migrations

aws_lm

Fingerprint service

aws_lm #invent and simplify | #bias for action

invent and simplify: re use hibernate codebase, especially the tests and validation. also simplify the logic to make it more adaptable.
bias for action: research other team's usecase; simplify and structure codebase to allow for extensibility. drove consensus when technical issues(java type generics) or framework choices made a right choice non obvious.

FAQ: https://quip-amazon.com/3ujuAnetkBr1 Dashboard: https://w.amazon.com/bin/view/EC2/LiveMigration/Dashboards/DropletFingerprintService/

S

KaOS owns fingerprint rules.
KaOS deletes and then creates a DynamoDB table with updated rules. This causes elevated ICEing.
LM consumed these rules.

T

Create a single highly available service to serve fingerprint rules.
De-risk KaOS rolling out new fingerprint rules.
Take the opportunity to change the format of fingerprint rules. This was a high risk item since a change in format made it difficult to detect mistakes in parsing logic.
Emit metrics for fingerprint matching rules to track the flow migrations.

A

Wrote a FAQ to convince leadership: https://quip-amazon.com/3ujuAnetkBr1
Coordinated with the KaOS team and launch a multi-region LM service.
Wrote a lot of tests to ensure finger format change were equivalent.
Mentored junior teammate (karthiga) on creating a detailed service health dashboard.
Conducted ORR to ensure service best practices.
Created a fully automated service using LPT, managed fleets, and paperwork.

R

Launched the service under MCM.
There was zero impact to LM during the migration (due to the detailed metrics and tests).
The ORR and automation served as a template for future service launches for the team.

Extra info:

wrote prfaq to convince leadership
wrote a deserializer to safely transition from old parser + old rule format -> new parser + new rule format
- multiple unit tests to ensure rules were exactly the same during migration
automated release of service with pipeline + rollback alarms
- replaced MCM practice (manual inspection of metrics over ~2 weeks)
regional dashboards along with alarms for automated and manual debugging
- set a template for regional dashboards
managed SDE1 who helped create alarms for the service
learn and bootstrap out paperwork
- unlocking team to move from substrate to prod
unblocked self by communicating cross teams for paperwork in substrate
provision tod workers for lm team
- lpt managed
- quilt pipelines
conducted an ORR for service pre/post launch (balancing the risk of doing some task post launch)
successfully transfered the ownership of 'rules' to sister team (8hr time difference)
- email + meetings + ongoing support
released under MCM
- set team standards for not having false alarms prior to launch

Advise SP on technical contractor

shatterproof #right alot | #disagree and commit

disagree and commit: present the risks, push back on execution and technical ability. support decision and provide help going forward

metrics

disagree
- found an alternative contractor
  - interviewed 2 contractors
- quantified the actual costs vs quoted costs from the contractor (asked for cost breakdown rather than lump sum)
- pointed out faults with architecture (lambda is ephemeral and not good for DB connections)
- US based was a positive for contractor
  - nudged them to find another US based contractor for comparison
commit
- decision was based on business needs (far greater than future tech challenges)
  - learned the value and a process of how to end a stalemate!
- reviewed future API spec and noted on the strengths of the contractor
- kept trust so that I could prove useful in helping them hire a full-time person to augment contractor

S

I was a technical advisor for SP. They had a business goal which was to choose a contractor to help them build a website, and back and also conduct some user surveys.

T

I started the process by first identifying the overall architecture of the applicaiton. The second step was to find a second contractor and then interview both based on their technical expertise.

A

After speaking with both and reviewing the technical spec I was of the opinion that the original contractor did not have the technical expertise however they did have the best business insight.

I recommended that there was a high risk of technical debt and that we seek out the second contractor.

In the end my client decided to go with the first contractor. Their primary concern was to reduce the number of parties involved and since the first contractor was someone they were familiar with they choose the more expensive and less technically stable option.

R

Once the decision was made I supported the decision. The primarily goal was to move forward and do everything in our power to allow the first contractor to succeed.

Recently my gut feeling has proven correct, in that the contractor is not very flexible technically or process wise and very expensive.

Access and expose recs generated by Data Science

ihr #invent and simplify | #frugality | #earn trust

invent and simplify: come up with a solution and then simplify

earn trust: negotiate a solution with DS to further the relationship

frugality: reuse dynamo, jenkins rather than invent new solution

metrics

nightly job; system designed for time agnostic release
kept 7 days of backups
provide interface for testing
reuse existing infrastructure (dynamo, jenkins, schema models)
poll for new dataset every 5 min

S

Data Science (DS) ran a nightly job to generate music recommendations for users.
The dataset would live in DynamoDB and the old DS workflow was to rewrite the same dynamo table with new dataset each night.
This was a high risk operation for the my team (APIs). Additionally it caused some outages due to accidental schema changes.

T

I was in charge of creating a HA and resilient workflow.
The ownership of the data would reside with them.
We simply wanted assurances that there were some sanity for newly published data.
Treat data as immutable
- Enforce that DS publish the dataset to a new table each night.
- Maintain a backlog of 7 datasets into the past.
- This gives the benefit of 'rollback' if a new dataset was broken
Given multiple datasets (a,b,c); DS can 'point' to latest dataset
- DS can maintain a few versions of the data. "Version Table"
- DS can run tests on new datasets to confirm schema compatibility
- rollbacks are as easy as pointing to an older known dataset
Maintain a log of actions (publish new dataset, test pass/fail, point to new dataset)

A

Work closely with DS to come up with the system design.
Added a Jenkins tests which could be triggered by DS
- used to verify that data schema would not be a breaking change
- extensible model for other types of tests
- potential to track failing/passing metrics
Poll based mechanism within API code to look at "Version Table" and start using the latest dataset.
- A poll interval of 5 min was used since 'older' data would still produce good enough data.
- Log when a new dataset was detected (help correlate errors with new dataset)

R

We saw a system that could be used to audit the creation of new datasets.
Outages due to breaking schema's were eliminated.

Max connections and failing kube watch

ihr #right alot | #bias for action | #dive deep

right alot: pin point the differences between stg and prod

bias for action: execute a non invasive solution, which could go out quick while still preventing issues

metrics

errors occured at 1-2 week increment
random 500 errors for a particular api only
different behavior across stg and prod envs
5 max connections
fixed issue for 2 additional micro services

S

An API that I wrote and owned would start to 500 errors sporadically. When Additionally, when I inspected the stg vs prod environments the results would differ and not always align and would happen at seemingly random times.

Restarting the container would fix the issue however.

T

After 2-3 more occurances, it became apparent that there was a deeper issue. In response I decided to add logging to the service around the relevant code.

At the next occurance we noticed from the logs that the IP of the service was did not match the one being requested.

A

I then took a look at mechanism that was doing service discovery and added additionally logs. This showed that not all the kube watch events were working correctly.

Further dive revealed that this condition was met at a convenient number of 5 connections. Diving further into the http library used by the service discovery library(pandorum) I discovered that OkHttp has a default limit of 5 persistent connections.

R

I added a config change to increase the max connection limit. Additionally, I added a check at startup to confirm that the limit matched the number of services we were trying to connect to so as to prevent future failures.

Fastly blacklist to whitelist migration

ihr #highest standard | #earn trust

highest standard: rather than fix the symptom at hand fix the core issue

earn trust: outages mean lower oncall morale. rolling out the changes in % meant gaining the trust of the team

customer obsession: also bad for customers

metrics

approx 8 blacklist rules
added approx 15 whitelist rules
rolled out in increments of 10%-20%
rolled out over 3 weeks
maintained the hitratio of approx 89% over the roolout

S

Realized that our recs were being incorrectly cached. The reason for this was the historic configuration of specifying the blacklisted paths.

T

The inital fix for the recs service was an easy config change. However to avoid such a mistake in the future I proposed that we should change from blacklist to whitelist.

A

Initially this seems like a very risky manuver, expecially when changing this for live production traffic. Therefore I took a few precautions to eliminate the risk.

Used randomint() VCL function to distribute the traffic.

I decided the a % based rollout, and then progressed to do it over 3 weeks.

R

At each 10% increase I gained additional confidence, about the new configs. Watching the cache hit ratio (89%), the traffic pattern also remained the same. Additionally, by communicating with the rest of my team as well as being aware of other company events, I was able to aviod being suspect of any failures by mistake.

The result was a clean migration, a more resilient system and probably a higher morale since there were no outages.

PostgresMapper - increasing code realibility

rust #learn and be curious | #think big

learn and be curious: learn how to do proc macros. read other source code

metrics

added 2 methods
selectively replaced old code with new features
fixed broken functionality with tokio-postgres
reduced code update locations from 3 to 1
- hit 0 runtime errors because of mismatch fields thereafter
technique allowed for slow adoption rather than breaking code

S

The crate was a side project written by someone to provide some simple deserialization capabilites for postgres in rust. It addressed a important need but failed to do any field name checks at compile time.

T

Add additional methods to reduce runtime errors.

A

Added simple methods that read annotations and provided two methogs get_sql_table and get_fields. this was something that the compiler could do really well and not error.

R

As a result, the runtime errors due to table migrations went down to zero.

Batch job alerting and little knowledge of system (on-call)

ihr #bias for action

bias for action: took action with limited info, while evaluating risk.

metrics

10K messages queued
10pm with no response for rest of team
silenced alarms for 2 hr increments
after 3 occurance and 2am alarm made decision to terminate job

S

I was on call got a page at approx 10pm. There was a rabbitMq that had backlogged and was alerting. The first hurdle was that there was no login/password information about how to access the queue and see what was going on.

T

I tried to reach the secondary in the hopes of gathering more information but to no avail. I tried the tertiary(boss) and also the primary on the ingestion team but received no answer.

My task at this point became deciding how to proceed.

A

I was able to get login information from the SRE oncall and able to inspect the queue. At this point I saw that a message was causing a backlog so I cleared it manually.

Once the queue started draining I silenced the specific alarm and went about the night. However the error happened again and I noticed that there was another message now causing the backlog.

It became apparent that manually skipping the error message was not a solution. There were approx 10k messages queued, which was the limit of the queue but I suspected that if the backlog continued it could fill up the drives causing more damage.

R

It was also apparent that the batch job was not correct, in the sense that it was unable to handle all message types, which was resulting in the backlog. I therefore decided to cancel the batch job and drain the queue.

Since it was a batch job and it was being developed (not customer facing), there was little harm in stopping the alert and job, which could be looked at in the morning.

Amazon subscription not enabling features

ihr #customer obsession | #dive deep

customer obsession: subscription is the core of customer experience

dive deep: look at neighbouring code for amazon, google and apple

metrics

solved issue across 2 different services (amazon and heroku)
added a test to cover the edge case
traced and rebated a few thousand customers affected

S

Subscription micro-service was responsible for determining subscription status for users. A CS report indicated that a customer was charged for subscription but was not seeing premium features.

T

Verify that the customer was actually experiencing an error. Figure out a fix to the issue.

A

First I verified that the error was actually happening. This was done using a combination of two internal endpoints: subscription status and paid status of the user. I also verified that the subscription had not expired.

The error was occurring for a Amazon user.

Explored the other subscription services since the code should be similar, including Amazon, Apple, Google...

Oddly enough, the code was slightly different for Amazon and Heroku users compared to the Google subscription code. Looking at git history one could see that the two had been added afterwards.

R

I was able to fix the subscription code and fix user experience for future users. This fix got applied not only to Amazon, but also Heroku users.

A follow up task was to track other users affected and provide them with rebates/extended experience.

Vertical squad lacking resources

ihr #disagree and commit | #develop the best

disagree and commit: point out the lack of resources. end up at a compromise

develop the best: involve and promote the design lead as the project leader

metrics

attrition of 2 prod manager, 1 ios, 1 android
rollout/feedback taking up to 2 weeks
perliminary work being delayed for over a quarter
raised issue with 1 prod manager, my manager, 2 design, 1 coworker

S

The company had experimented with a new team structure "vertical team". The goal of the team would be to focus on improving a KPI.

However, due to some employee churn and other org changes, the squads had started to lose direction.

The recs team did not have enough resources but was expected to produce results. This worked alright for previously started projects however new projects were being proposed.

T

I had a strong hunch that without more product involvement and dev resources the squad would not be able to succeed in its mission.

A

I took it upon myself to speak with the different stake holders involved. This involved the design team, a product manager I had a good relationship, a coworker and my manager.

After speaking with individuals I was able to propose a meeting where we could come together and discuss risks, and goals. We also took this time to re-evaluate goals under current circumstances.

R

I was able to convince my manager and org to reach a middle compromise. There would be a shift in roles. The designer, who had the best idea about the product direction would assume a temporary product manager role. Additionally, we would be able to get an iOS dev for 2 sprints to comsume and execute the feature.

With a reduction of scope and a more focused involvement of those part of the squad, I was able to shift my focus on other work that needed to be done while wrapping up and supporting previous features.

Migrate jenkins from standalone to k8s

ihr #think big | #invent and simplify

think big: improve reliability and reduce work exponentially

invent and simplify: utilize helm chart and then customize for internal tools

metrics

able to upgrade jenkins from 1x to 2x
able to upgrade all plugins with confidence
able to address security warning in jenkins
launched 3 slaves and restrict jobs to specific slaves
- later more slaves were added addition slaves for rust
time to launch a new jenkins server took < 2 hrs
added detailed documentation so others would be able to contribute

S

Jenkins was the test automation server we used. However it was deployed on on a VM without any way to recover, upgrade or replicate the instance and data.

This resulted in tech debt and a fear of doing anything new with the instance.

T

I took it upon myself to create a kube deployed instance which could be replaced and therefore upgraded.

A

I created a decrarative instance of the server based on an existing jenkins helm recipie. I then tweaked it to have custom values and secrets.

The secrets were applied via a api call to the jenkins server (decrypt KMS).

R

We were able to migrate to the latest version of Jenkins.

Skynet - delivering results for hackday project

ihr #deliver results | #right alot

right alot: choosing java and solving part of the process was the correct small step to allow for adoption

deliver results: deliver a project despite it not being the original all encompassing solution

metrics

reduced manual testing from 10-30 minutes to seconds
removed human error during testing
created a extensible framework, initially targeting 2 types of test
agnostic testing framework for both android and ios

S

This was during a hack week project and the goal of our 3 person team was to help automate QA-testing done for apps at the company.

The current way to do testing was to open the app, in a simulator and use charles proxy to collect http traffic logs. The logs would then be parsed manually.

This could take entire days and sometimes resulted in incomplete testing if an urgent release was scheduled. It also meant long hours and sometimes weekend work for the QA team.

The team was composed of a non-technical QA member, an iOS dev and myself (backend).

T

The initial goal was to use a UI testing framework for iOS and therefore automate the collection and then verification of the tests. We spent 1.5 days trying to get the UI framework to work, however it was unpredictable on different computers and eventually just didnt work.

A

Realizing that we were at risk of not having any product I decided to take a step back and see if we could produce something useful.

I took half the day to create a simple POC; to tackle the testing and not the data gathering portion.

This was a much simpler and predictable problem to solve. The log data could be ingested as Json and a series of tests could be run on it. Java was chosen since it was a language most people would be comfortable with, it was typed and the non-technical QA memebers could augment it.

We created a somewhat abstract testing framework. It had 3 broad scopes: ingest data from file, filter data based on test to be run (user id, header), verify data based on custom test rules.

At some point I devoted majority of my time to training and guiding the QA member on understanding and augmenting to the codebase. Transfering owenership to the QA team was an important goal for me if the project was to succeed.

R

We were also tackle time sensitive and manually arduous tests (token refresh every 5min). By the end of the week we were able to execute 6 different tests, implement 2 different testing scenarios (find a value, stateful testing).

Needless to say the QA team was very happy and with the testing abstraction in place they were then able to implement more tests themselves.

Templating kube config

ihr #think big | #disagree and commit

metrics

goods:
- used helm template command to avoid creating new tool
- would be able to slowly template and move over existing configs
- poc worked and was able to represent current config
bads:
- would be 1 additional tool to install
- helm template was no longer supported
- go templating had complicated syntax
failures:
- project got hijacked from under me (loss of trust)
- should have consulted seniors more (get buy in)
- should have demonstrated small example rather go for the cake
- should have asked for more feedback from team
- should have demonstrated the stability of heml template

S

instead of updating configs across multiple envs and regions (qa, stg, us, au, nz) I wanted to create a template that would allow us to update the value in a single file. The remaining content could then be defined once.

T

A

R

Once the implementation had was done and the team clearly supported the other implementatino, I made it a point to verbally commend it in a meeting and show support.

PR to rust lang - infer outlives predicates

rust #learn and be curious | #ownership | #think big

metrics

took 7 months. actual work after mentorship took 4 months to release
build times were up to 2hrs. incremental builds were 10-30 min
added 48 different test scenarios
working on the rust codebase was a exponential step forward for me
- 1,043,243 lines of code total in the project
- touched up to 45 files
added 4 additional fixes post PR

S

I wanted to get involved in the OSS community, learn and give back. To force myself to do this I got involved during the impl days and claimed a feature that offered mentorship.

T

Understand the codebase. Understand the context around the feature. Understand the feature. Learn to work with the codebase. Commit a PR to implement the feature.

A

Learned to build and work in a large codebase. Took 7 months but from the moment I asked to work on the feature I knew that I had to finish it.

R

Added feature to infer predicates and thus made a small contibution to the ergonomics of the langugage. Added docs for the feature.

Own production outage when on-call was in transit

ihr #ownership | #customer obsession

ownership: the api is owned by the team and not only the responsibility of an individual.

customer obsession: prod outage means users are being affected

metrics

less than 1/2 hr outage
restarted 3-4 failing services
restarted up to 15 pods across all services that were in a bad state

S

Towards the end of the work day a sestamatic outage started to happen. The oncall member was in-transit and not available to handle the issue.

Later we found out that the outage was due to a combination of AWS upgrade event and the weave CNI not being able to maintain a network mesh.

T

Step up and represent the API team in fixing the outage.

A

Restarted services that were showing errors in prometheus. Also tracked each service individually to ensure that there were no lingering bad pods.

R

Outage lasted less than 1/2 hour. Monitored the state of the system for a total of 1 hour.

Android rotation work took much longer than expected

s&j #customer obsession | #highest standard

customer obsession: customer was my client... he would ned to read. and develop code afterwards

highest standard: rather than hack a solution, there was clearly. a better way but more time consuming way to code

metrics:

S

T

A

R

Starting and leading the Rust meetup

rust #ownership | #develop the best

ownership: take charge and take on responsibility

develop the best: promote people to get involved and grow replacement

metrics

started as 1 old and 2 new co-organizers
old dropped and found 1 additional
hosted at 5 different companies
found 1 repeat sponsor
found and organized approx 36 speakers
gathered approx 10 core repeating members
gave 2 talks and led 1 learning session

S

Others bailed when it came time to actually put in time and organize the meetup. The tasks were no very rewarding and involved coordinating with companies, and speakers and trying to get sponsors.

T

I decided to take charge and try and build a sustainable community.

A

I created a list of companies. I spoke to attendees and convinced a few to speak or otherwise host. I invited everyone to speak on their side projects. I organized a un-meetup and volunteered to teach the begineer session.

R

I was able to organize a meetup each month for approx 1.5 years before transitioning it to a co-organizer. We had average 30 people attending per meetup with upwards of 60 at a few events. We had a home for the meetup where we could each month and a sponsor who would provide food and drinks. There were approx 10 consistent member who would show up very often and would help carry it forward.

Sub sections

What are Brag documents

Table of content

S

T

A

R

summary

metrics

S

T

A

R

metrics

S

T

A

R

S

T

A

R

S

T

A

R

metrics

S

T

A

R

S

T

A

R

S

T

A

R

metrics

metrics

S

T

A

R

metrics

S

T

A

R

metrics

S

T

A

R

metrics

S

T

A

R

S

T

A

R

Extra info:

metrics

S

T

A

R

metrics

S

T

A

R

metrics

S

T

A

R