Sub sections
What are Brag documents
Table of content
https://github.com/aws/s2n-netbench/tree/main/netbench-orchestrator
https://github.com/aws/s2n-quic/pull/1986
S
- Add mitigation for optimistic acks
- The rfc was clear how to do this (skip packets), but also vague because it didnt mention how many packets to skip and how often.
T
- Come up with a strategy for skipping packet and mitigating the attack.
- Implement the mitigation in s2n-quic.
A
- Audited other QUIC implementations and conducted analysis to answer two key questions:
- How many packets to skip?
- How often should packets be skipped?
- There were 2 other implementations that were using a "static" (skipping did not evolve with cwnd)
approach. Their strategy was to:
- Track skipping 1 pn. Overwrite value when a new pn needs to be skipped.
- Skip a random pn in some static range.
- The "static" approach was network dependent (DC vs public wifi):
- Does overriting skip pn nullify the mitigation?
- Is a static skip range effective for all networks?
- A cwnd is very different in a DS vs public internet
- Considered the option of skipping multiple packets:
- Pro: This would allow us to skip more frequenly
- Con: Require storing multiple skip pn. How many pn should we store?
- Analyzed the purpose of the mitigation to come up with an optimal solution.
- The goal of the mitigation was to prevent cwnd bloat.
- Which could be done if packets were acked prior to the peer receiving them.
- By basing skip range on the number of packets that could be received with an cwnd.
- Solution: evolve skip packet range based on cwnd and only store 1 skip pn.
- Calculate range based on packets we expect to send in a single period.
pkt_per_cwnd = cwnd / mtu
rand = pkt_per_cwnd/2..pkt_per_cwnd*2
R
- Successfully implemented the mitigation.
- Only had to store 1 skip pn.
- By evolving the skip range based on cwnd, s2n-quic would scale to all networks.
summary
- tracking issue: https://github.com/aws/s2n-quic/issues/1276
- cpu increase: https://github.com/aws/s2n-quic/pull/1298
- revert pr with some analysis shows : https://github.com/aws/s2n-quic/pull/1368
- measure batching multi-packets: https://user-images.githubusercontent.com/4350690/174196918-6af428e4-9ab7-4458-b3b9-e27ed89c3318.png
metrics
S
- Explore the ACK delay RFC and see if s2n-quic could benefit from implementing it.
Pros of delay:
- send/recv acks is CPU expensive
- on the send heavy server it takes 24.33%
- asymmetric links like satellite
Cons of delay:
- progress: delayed acknowledgment can delay data being sent from peer
- the RFC lets you adjust thresholds
- ECN: congestion signal by the network
- ack packets with ECN markings immediately
- loss recovery: detect a packet was not received and retransmit
- cc: regular acks help establish good RTT estimates and drive CC
- BBR: requires a high precision signal to work so it was unclear how delaying acks would affect this
RFC features:
- negotiating the extension: min_ack_delay. the minimum amount of time, that the endpoint sending this value is willing to delay an acknowledgment
- ACK_FREQUENCY frame:
- Ack-Eliciting Threshold: max ack eliciting packets received before sending an ack
- Requested Max Ack Delay: max time to wait before sending an ack
- Reordering Threshold: max number of reordered packet before sending an ack
- IMMEDIATE_ACK frame: asks peer to send an immediate ack
- Expedite ECN: ecn marked packets should not be delayed
T
Choose on an implementation.
Options:
- single round batching (10 packet GSO payload)
- easy to implement (1-2 sprints)
- this solution creates building blocks for others after it
- multi round batching (multi-gso payload) (2-4 sprints)
- medium difficulty to implement
- requires tuning with application, production usecase, traffic pattern
- implement Ack delay RFC (4-8 sprints)
- difficult to implement
- requires tuning with application, production usecase, traffic pattern
- requires negotiating the extension
A
Impl: Batch ACK Processing (single round)
- single round batching (10 packet GSO payload)
- connection can signal interest in processing acks
- store pending acks on connection
- refactor CC and LR to accept signals from batched ack processing (timestamp)
- swap from single processing to batched processing
- emit metrics
R
- Flamegraph result 24.33% -> 25.81% https://github.com/aws/s2n-quic/pull/1298
- Batching different amount of packets (2, 10, 40, 80): https://user-images.githubusercontent.com/4350690/174196918-6af428e4-9ab7-4458-b3b9-e27ed89c3318.png
lessons learned:
- we had to be cautious about delaying acks because we were operating on public internet
- I wonder if an environment like DC would be better for delaying acks
- acks are a signal within the noise of loss,delay,congestion,etc
- within a DC there is less noise so it makes sense that we can have less signal and get away with it
metrics
- 4000 tests
- ~1500 scoped down tls 1.2 tests
- ~33 "default" policy tests
S
- include tls 1.3 support by default
- good modern default. tls 1.3 is provably secure, modern ciphers,
T
- which can be risky because customers will see change in behavior
- it would also affect test coverage for tls 1.2
- tests that test the tls 1.2 policy
- can pass when switched to tls 1.3 because written previous to tls 1.3 support and might not assert protocol
- tests that test the "default" policy
- tests that test the tls 1.2 policy
It is not possible to detect the difference between these.
A
- assess different options
- balance the risk of security vs not making progress
- lay out the options, make a case for it
R
- manually audit 4000 tests
- scope bad tls 1.2 to certain files
- pin all tests that are using default to a numbered policy
- test regression for "default" policy tests
- don't pin tests
- test regression for tls 1.2 tests
- run all tests twice (different platform, libcrypto, fuzz, valgrind)
- run a single tls 1.2 test and accept a minimal risk of tls 1.2 test regression
https://github.com/aws/s2n-quic/issues/1137
S
Currently s2n-quic does the certificate lookup/loading operations synchronously.
Non ideal for application which server multiple domains concurrently and need to load multiple certificates, since that would block the thread.
T
Allowed for certificate lookup/loading operations to be performed asynchronously and enable non-blocking behavior.
A
-
s2n-quic:
- pass the connection waker down to the tls libraries so that they could wake on progress
-
s2n-tls:
- The work involved converting the callback only-invoked-once to a poll-the-callback model in s2n-tls.
- s2n-tls by default did not allow for poll callbacks.
- s2n-tls previously only called the callback once, which not the Rust model
and has quite a few drawbacks.
- Polling only once means the application/s2n-quic has to schedule (separate thread possibly) the completion of the callback.
- It also needs to manage the state associated with the callback separately.
- The Rust polling model allows for all state associated with the future to live within the object bing polled.
- Additionally, the future can make progress as part of the already existing runtime that s2n-quic starts with.
-
s2n-tls bindings:
- gluing the new callback polling behavior in an extensible way for other callbacks.
R
- setup the framework for future async callbacks in s2n-tls
S
The previous implementation involved closing the connection by default if the bytes transferred dropped below some user specified amount.
T
While simple and effective implementation, this seems like an sharp edge and made me uneasy. The reason for this was:
- the user specified value could become stale
- the default action closing the connection could be an availability risk
- in a worse case scenario this could lead to a stampede of closing all connections
A
R
summary
metrics
- conduct initial review of doc written by SDE1
- provide feedback on structure and clarity of doc
S
T
A
R
Project plan: https://quip-amazon.com/Wa4iADaP4cqI/Intern-Project-LM-Detective
S
- The Dispatcher is a critical component in LM but our team lacked metrics.
- The intern project "LM Detective" would give us:
- better understanding of Dispatcher Service and ongoing migrations
- allow LM to take a data driven approach to improving the Dispatcher service and scheduling
T
- Draft the intern project plan. Omitting some information to probe for understanding.
- Help onboard the intern to AWS and LM specific technology.
- Educate the intern about LM technology and help guide to a successful delivery.
A
- Met with the intern multiple times a week to ensure progress.
- Guided the intern to set milestones and define a project plan.
- Helped define the stretch goals and helped prepare their final presentation.
R
- The intern successfuly completed the project and received a return offer.
- LM could use the project to answer questions such as:
- What is the state of the dispatcher given a specific migration id?
- For all migrations in pending/executing state, what are the src droplets and tgt droplets?
- Why is a particular migration in status pending?
- The intern also completed the stretch goal for the project
- Create a UI (graph viz) to visualize the data
Project plan: https://quip-amazon.com/NtptAZ55eOpb/LM-in-PDT-Work
S
- LM was tasked to make the service available in PDT.
- PDT should be the last region; LM should therefore also be available in all other regions.
T
- Build the entire LM stack in PDT to meet gov compliance. (canaries, dispatcher, alarms, metrics)
- Created a project plan to identify all components and missing regions.
- Automate the infrastructure to make builds reproducible.
A
- Cleaned up our LPT rules to make service launches reqproducible, extensible and maintainable.
- Synthesized new alarms, cleaning-up/creating dashboards along the way.
R
- Full LM availability in all regions, including PDT (met gov compliance).
- A clean, normalized LPT stack for all services owned by LM.
metrics
-
helped surface questions and drive conversation for support in restricted regions
-
create and manage doc to track work for PDT region build
-
debug various issues as they came up during rebuild
metrics
-
while fixing workflow in PDT, i took the initiative to simplify metrics for all regions
-
removed 2 metrics packages
-
replaced packaged code with ~100 lines of code to capture the same metrics in a less concoluted manner
-
dug into old profiler library codebase to understand implementation
- verified metrics in igraph against what the code was posting
S
T
A
R
bias for action: post lse, i took the initiative to dig into customer impact. created a doc to track impact, pulled in senior engineers for advice and drafted a remediation script. helped discover another race condition bug which was customer impacting
metrics
- remediated ~40 customer impacting instances
S
T
A
R
metrics
-
conducted initial deep research into ICEing and its various causes
- categoried data into actionable and non-actionable
-
started conversations with cross teams to help gather more metrics
S
T
A
R
metrics
-
auto deploy rules
-
integrate with workflow
-
write requirement doc for deprecating old andon
-
change matching rules from regext o string (eliminate errors)
-
created pipeline for auto releasing the andon rules
-
added alarms to catch if rules were not being read from lm workflow
-
added validation logic to better catch malformed rules during code review
S
T
A
R
- invent and simplify: re use hibernate codebase, especially the tests and validation. also simplify the logic to make it more adaptable.
- bias for action: research other team's usecase; simplify and structure codebase to allow for extensibility. drove consensus when technical issues(java type generics) or framework choices made a right choice non obvious.
FAQ: https://quip-amazon.com/3ujuAnetkBr1 Dashboard: https://w.amazon.com/bin/view/EC2/LiveMigration/Dashboards/DropletFingerprintService/
S
- KaOS owns fingerprint rules.
- KaOS deletes and then creates a DynamoDB table with updated rules. This causes elevated ICEing.
- LM consumed these rules.
T
- Create a single highly available service to serve fingerprint rules.
- De-risk KaOS rolling out new fingerprint rules.
- Take the opportunity to change the format of fingerprint rules. This was a high risk item since a change in format made it difficult to detect mistakes in parsing logic.
- Emit metrics for fingerprint matching rules to track the flow migrations.
A
- Wrote a FAQ to convince leadership: https://quip-amazon.com/3ujuAnetkBr1
- Coordinated with the KaOS team and launch a multi-region LM service.
- Wrote a lot of tests to ensure finger format change were equivalent.
- Mentored junior teammate (karthiga) on creating a detailed service health dashboard.
- Conducted ORR to ensure service best practices.
- Created a fully automated service using LPT, managed fleets, and paperwork.
R
- Launched the service under MCM.
- There was zero impact to LM during the migration (due to the detailed metrics and tests).
- The ORR and automation served as a template for future service launches for the team.
Extra info:
- wrote prfaq to convince leadership
- wrote a deserializer to safely transition from old parser + old rule format ->
new parser + new rule format
- multiple unit tests to ensure rules were exactly the same during migration
- automated release of service with pipeline + rollback alarms
- replaced MCM practice (manual inspection of metrics over ~2 weeks)
- regional dashboards along with alarms for automated and manual debugging
- set a template for regional dashboards
- managed SDE1 who helped create alarms for the service
- learn and bootstrap out paperwork
- unlocking team to move from substrate to prod
- unblocked self by communicating cross teams for paperwork in substrate
- provision tod workers for lm team
- lpt managed
- quilt pipelines
- conducted an ORR for service pre/post launch (balancing the risk of doing some task post launch)
- successfully transfered the ownership of 'rules' to sister team (8hr time
difference)
- email + meetings + ongoing support
- released under MCM
- set team standards for not having false alarms prior to launch
disagree and commit: present the risks, push back on execution and technical ability. support decision and provide help going forward
metrics
- disagree
- found an alternative contractor
- interviewed 2 contractors
- quantified the actual costs vs quoted costs from the contractor (asked for cost breakdown rather than lump sum)
- pointed out faults with architecture (lambda is ephemeral and not good for DB connections)
- US based was a positive for contractor
- nudged them to find another US based contractor for comparison
- found an alternative contractor
- commit
- decision was based on business needs (far greater than future tech challenges)
- learned the value and a process of how to end a stalemate!
- reviewed future API spec and noted on the strengths of the contractor
- kept trust so that I could prove useful in helping them hire a full-time person to augment contractor
- decision was based on business needs (far greater than future tech challenges)
S
I was a technical advisor for SP. They had a business goal which was to choose a contractor to help them build a website, and back and also conduct some user surveys.
T
I started the process by first identifying the overall architecture of the applicaiton. The second step was to find a second contractor and then interview both based on their technical expertise.
A
After speaking with both and reviewing the technical spec I was of the opinion that the original contractor did not have the technical expertise however they did have the best business insight.
I recommended that there was a high risk of technical debt and that we seek out the second contractor.
In the end my client decided to go with the first contractor. Their primary concern was to reduce the number of parties involved and since the first contractor was someone they were familiar with they choose the more expensive and less technically stable option.
R
Once the decision was made I supported the decision. The primarily goal was to move forward and do everything in our power to allow the first contractor to succeed.
Recently my gut feeling has proven correct, in that the contractor is not very flexible technically or process wise and very expensive.
invent and simplify: come up with a solution and then simplify
earn trust: negotiate a solution with DS to further the relationship
frugality: reuse dynamo, jenkins rather than invent new solution
metrics
- nightly job; system designed for time agnostic release
- kept 7 days of backups
- provide interface for testing
- reuse existing infrastructure (dynamo, jenkins, schema models)
- poll for new dataset every 5 min
S
- Data Science (DS) ran a nightly job to generate music recommendations for users.
- The dataset would live in DynamoDB and the old DS workflow was to rewrite the same dynamo table with new dataset each night.
- This was a high risk operation for the my team (APIs). Additionally it caused some outages due to accidental schema changes.
T
- I was in charge of creating a HA and resilient workflow.
- The ownership of the data would reside with them.
- We simply wanted assurances that there were some sanity for newly published data.
- Treat data as immutable
- Enforce that DS publish the dataset to a new table each night.
- Maintain a backlog of 7 datasets into the past.
- This gives the benefit of 'rollback' if a new dataset was broken
- Given multiple datasets (a,b,c); DS can 'point' to latest dataset
- DS can maintain a few versions of the data. "Version Table"
- DS can run tests on new datasets to confirm schema compatibility
- rollbacks are as easy as pointing to an older known dataset
- Maintain a log of actions (publish new dataset, test pass/fail, point to new dataset)
A
- Work closely with DS to come up with the system design.
- Added a Jenkins tests which could be triggered by DS
- used to verify that data schema would not be a breaking change
- extensible model for other types of tests
- potential to track failing/passing metrics
- Poll based mechanism within API code to look at "Version Table" and start
using the latest dataset.
- A poll interval of 5 min was used since 'older' data would still produce good enough data.
- Log when a new dataset was detected (help correlate errors with new dataset)
R
- We saw a system that could be used to audit the creation of new datasets.
- Outages due to breaking schema's were eliminated.
right alot: pin point the differences between stg and prod
bias for action: execute a non invasive solution, which could go out quick while still preventing issues
metrics
- errors occured at 1-2 week increment
- random 500 errors for a particular api only
- different behavior across stg and prod envs
- 5 max connections
- fixed issue for 2 additional micro services
S
An API that I wrote and owned would start to 500 errors sporadically. When Additionally, when I inspected the stg vs prod environments the results would differ and not always align and would happen at seemingly random times.
Restarting the container would fix the issue however.
T
After 2-3 more occurances, it became apparent that there was a deeper issue. In response I decided to add logging to the service around the relevant code.
At the next occurance we noticed from the logs that the IP of the service was did not match the one being requested.
A
I then took a look at mechanism that was doing service discovery and added additionally logs. This showed that not all the kube watch events were working correctly.
Further dive revealed that this condition was met at a convenient number of 5 connections. Diving further into the http library used by the service discovery library(pandorum) I discovered that OkHttp has a default limit of 5 persistent connections.
R
I added a config change to increase the max connection limit. Additionally, I added a check at startup to confirm that the limit matched the number of services we were trying to connect to so as to prevent future failures.
highest standard: rather than fix the symptom at hand fix the core issue
earn trust: outages mean lower oncall morale. rolling out the changes in % meant gaining the trust of the team
customer obsession: also bad for customers
metrics
- approx 8 blacklist rules
- added approx 15 whitelist rules
- rolled out in increments of 10%-20%
- rolled out over 3 weeks
- maintained the hitratio of approx 89% over the roolout
S
Realized that our recs were being incorrectly cached. The reason for this was the historic configuration of specifying the blacklisted paths.
T
The inital fix for the recs service was an easy config change. However to avoid such a mistake in the future I proposed that we should change from blacklist to whitelist.
A
Initially this seems like a very risky manuver, expecially when changing this for live production traffic. Therefore I took a few precautions to eliminate the risk.
Used randomint() VCL function to distribute the traffic.
I decided the a % based rollout, and then progressed to do it over 3 weeks.
R
At each 10% increase I gained additional confidence, about the new configs. Watching the cache hit ratio (89%), the traffic pattern also remained the same. Additionally, by communicating with the rest of my team as well as being aware of other company events, I was able to aviod being suspect of any failures by mistake.
The result was a clean migration, a more resilient system and probably a higher morale since there were no outages.
learn and be curious: learn how to do proc macros. read other source code
metrics
- added 2 methods
- selectively replaced old code with new features
- fixed broken functionality with tokio-postgres
- reduced code update locations from 3 to 1
- hit 0 runtime errors because of mismatch fields thereafter
- technique allowed for slow adoption rather than breaking code
S
The crate was a side project written by someone to provide some simple deserialization capabilites for postgres in rust. It addressed a important need but failed to do any field name checks at compile time.
T
Add additional methods to reduce runtime errors.
A
Added simple methods that read annotations and provided two methogs get_sql_table
and get_fields
. this was something that the compiler could do really well and not error.
R
As a result, the runtime errors due to table migrations went down to zero.
bias for action: took action with limited info, while evaluating risk.
metrics
- 10K messages queued
- 10pm with no response for rest of team
- silenced alarms for 2 hr increments
- after 3 occurance and 2am alarm made decision to terminate job
S
I was on call got a page at approx 10pm. There was a rabbitMq that had backlogged and was alerting. The first hurdle was that there was no login/password information about how to access the queue and see what was going on.
T
I tried to reach the secondary in the hopes of gathering more information but to no avail. I tried the tertiary(boss) and also the primary on the ingestion team but received no answer.
My task at this point became deciding how to proceed.
A
I was able to get login information from the SRE oncall and able to inspect the queue. At this point I saw that a message was causing a backlog so I cleared it manually.
Once the queue started draining I silenced the specific alarm and went about the night. However the error happened again and I noticed that there was another message now causing the backlog.
It became apparent that manually skipping the error message was not a solution. There were approx 10k messages queued, which was the limit of the queue but I suspected that if the backlog continued it could fill up the drives causing more damage.
R
It was also apparent that the batch job was not correct, in the sense that it was unable to handle all message types, which was resulting in the backlog. I therefore decided to cancel the batch job and drain the queue.
Since it was a batch job and it was being developed (not customer facing), there was little harm in stopping the alert and job, which could be looked at in the morning.
customer obsession: subscription is the core of customer experience
dive deep: look at neighbouring code for amazon, google and apple
metrics
- solved issue across 2 different services (amazon and heroku)
- added a test to cover the edge case
- traced and rebated a few thousand customers affected
S
Subscription micro-service was responsible for determining subscription status for users. A CS report indicated that a customer was charged for subscription but was not seeing premium features.
T
Verify that the customer was actually experiencing an error. Figure out a fix to the issue.
A
First I verified that the error was actually happening. This was done using a combination of two internal endpoints: subscription status and paid status of the user. I also verified that the subscription had not expired.
The error was occurring for a Amazon user.
Explored the other subscription services since the code should be similar, including Amazon, Apple, Google...
Oddly enough, the code was slightly different for Amazon and Heroku users compared to the Google subscription code. Looking at git history one could see that the two had been added afterwards.
R
I was able to fix the subscription code and fix user experience for future users. This fix got applied not only to Amazon, but also Heroku users.
A follow up task was to track other users affected and provide them with rebates/extended experience.
disagree and commit: point out the lack of resources. end up at a compromise
develop the best: involve and promote the design lead as the project leader
metrics
- attrition of 2 prod manager, 1 ios, 1 android
- rollout/feedback taking up to 2 weeks
- perliminary work being delayed for over a quarter
- raised issue with 1 prod manager, my manager, 2 design, 1 coworker
S
The company had experimented with a new team structure "vertical team". The goal of the team would be to focus on improving a KPI.
However, due to some employee churn and other org changes, the squads had started to lose direction.
The recs team did not have enough resources but was expected to produce results. This worked alright for previously started projects however new projects were being proposed.
T
I had a strong hunch that without more product involvement and dev resources the squad would not be able to succeed in its mission.
A
I took it upon myself to speak with the different stake holders involved. This involved the design team, a product manager I had a good relationship, a coworker and my manager.
After speaking with individuals I was able to propose a meeting where we could come together and discuss risks, and goals. We also took this time to re-evaluate goals under current circumstances.
R
I was able to convince my manager and org to reach a middle compromise. There would be a shift in roles. The designer, who had the best idea about the product direction would assume a temporary product manager role. Additionally, we would be able to get an iOS dev for 2 sprints to comsume and execute the feature.
With a reduction of scope and a more focused involvement of those part of the squad, I was able to shift my focus on other work that needed to be done while wrapping up and supporting previous features.
think big: improve reliability and reduce work exponentially
invent and simplify: utilize helm chart and then customize for internal tools
metrics
- able to upgrade jenkins from 1x to 2x
- able to upgrade all plugins with confidence
- able to address security warning in jenkins
- launched 3 slaves and restrict jobs to specific slaves
- later more slaves were added addition slaves for rust
- time to launch a new jenkins server took < 2 hrs
- added detailed documentation so others would be able to contribute
S
Jenkins was the test automation server we used. However it was deployed on on a VM without any way to recover, upgrade or replicate the instance and data.
This resulted in tech debt and a fear of doing anything new with the instance.
T
I took it upon myself to create a kube deployed instance which could be replaced and therefore upgraded.
A
I created a decrarative instance of the server based on an existing jenkins helm recipie. I then tweaked it to have custom values and secrets.
The secrets were applied via a api call to the jenkins server (decrypt KMS).
R
We were able to migrate to the latest version of Jenkins.
right alot: choosing java and solving part of the process was the correct small step to allow for adoption
deliver results: deliver a project despite it not being the original all encompassing solution
metrics
- reduced manual testing from 10-30 minutes to seconds
- removed human error during testing
- created a extensible framework, initially targeting 2 types of test
- agnostic testing framework for both android and ios
S
This was during a hack week project and the goal of our 3 person team was to help automate QA-testing done for apps at the company.
The current way to do testing was to open the app, in a simulator and use charles proxy to collect http traffic logs. The logs would then be parsed manually.
This could take entire days and sometimes resulted in incomplete testing if an urgent release was scheduled. It also meant long hours and sometimes weekend work for the QA team.
The team was composed of a non-technical QA member, an iOS dev and myself (backend).
T
The initial goal was to use a UI testing framework for iOS and therefore automate the collection and then verification of the tests. We spent 1.5 days trying to get the UI framework to work, however it was unpredictable on different computers and eventually just didnt work.
A
Realizing that we were at risk of not having any product I decided to take a step back and see if we could produce something useful.
I took half the day to create a simple POC; to tackle the testing and not the data gathering portion.
This was a much simpler and predictable problem to solve. The log data could be ingested as Json and a series of tests could be run on it. Java was chosen since it was a language most people would be comfortable with, it was typed and the non-technical QA memebers could augment it.
We created a somewhat abstract testing framework. It had 3 broad scopes: ingest data from file, filter data based on test to be run (user id, header), verify data based on custom test rules.
At some point I devoted majority of my time to training and guiding the QA member on understanding and augmenting to the codebase. Transfering owenership to the QA team was an important goal for me if the project was to succeed.
R
We were also tackle time sensitive and manually arduous tests (token refresh every 5min). By the end of the week we were able to execute 6 different tests, implement 2 different testing scenarios (find a value, stateful testing).
Needless to say the QA team was very happy and with the testing abstraction in place they were then able to implement more tests themselves.
metrics
- goods:
- used
helm template
command to avoid creating new tool - would be able to slowly template and move over existing configs
- poc worked and was able to represent current config
- used
- bads:
- would be 1 additional tool to install
- helm template was no longer supported
- go templating had complicated syntax
- failures:
- project got hijacked from under me (loss of trust)
- should have consulted seniors more (get buy in)
- should have demonstrated small example rather go for the cake
- should have asked for more feedback from team
- should have demonstrated the stability of heml template
S
instead of updating configs across multiple envs and regions (qa, stg, us, au, nz) I wanted to create a template that would allow us to update the value in a single file. The remaining content could then be defined once.
T
A
R
Once the implementation had was done and the team clearly supported the other implementatino, I made it a point to verbally commend it in a meeting and show support.
metrics
- took 7 months. actual work after mentorship took 4 months to release
- build times were up to 2hrs. incremental builds were 10-30 min
- added 48 different test scenarios
- working on the rust codebase was a exponential step forward for me
- 1,043,243 lines of code total in the project
- touched up to 45 files
- added 4 additional fixes post PR
S
I wanted to get involved in the OSS community, learn and give back. To force myself to do this I got involved during the impl days and claimed a feature that offered mentorship.
T
Understand the codebase. Understand the context around the feature. Understand the feature. Learn to work with the codebase. Commit a PR to implement the feature.
A
Learned to build and work in a large codebase. Took 7 months but from the moment I asked to work on the feature I knew that I had to finish it.
R
Added feature to infer predicates and thus made a small contibution to the ergonomics of the langugage. Added docs for the feature.
ownership: the api is owned by the team and not only the responsibility of an individual.
customer obsession: prod outage means users are being affected
metrics
- less than 1/2 hr outage
- restarted 3-4 failing services
- restarted up to 15 pods across all services that were in a bad state
S
Towards the end of the work day a sestamatic outage started to happen. The oncall member was in-transit and not available to handle the issue.
Later we found out that the outage was due to a combination of AWS upgrade event and the weave CNI not being able to maintain a network mesh.
T
Step up and represent the API team in fixing the outage.
A
Restarted services that were showing errors in prometheus. Also tracked each service individually to ensure that there were no lingering bad pods.
R
Outage lasted less than 1/2 hour. Monitored the state of the system for a total of 1 hour.
customer obsession: customer was my client... he would ned to read. and develop code afterwards
highest standard: rather than hack a solution, there was clearly. a better way but more time consuming way to code
metrics:
S
T
A
R
ownership: take charge and take on responsibility
develop the best: promote people to get involved and grow replacement
metrics
- started as 1 old and 2 new co-organizers
- old dropped and found 1 additional
- hosted at 5 different companies
- found 1 repeat sponsor
- found and organized approx 36 speakers
- gathered approx 10 core repeating members
- gave 2 talks and led 1 learning session
S
Others bailed when it came time to actually put in time and organize the meetup. The tasks were no very rewarding and involved coordinating with companies, and speakers and trying to get sponsors.
T
I decided to take charge and try and build a sustainable community.
A
I created a list of companies. I spoke to attendees and convinced a few to speak or otherwise host. I invited everyone to speak on their side projects. I organized a un-meetup and volunteered to teach the begineer session.
R
I was able to organize a meetup each month for approx 1.5 years before transitioning it to a co-organizer. We had average 30 people attending per meetup with upwards of 60 at a few events. We had a home for the meetup where we could each month and a sponsor who would provide food and drinks. There were approx 10 consistent member who would show up very often and would help carry it forward.