lps
----------
netbench orchestrator
amzn_crypto
ktls
amzn_crypto
s2n-quic ack freq analysis
amzn_crypto
s2n-quic async client hello
amzn_crypto
s2n-quic rustls testing and parity
amzn_crypto
s2n-quic advocate better slowloris mitigation
amzn_crypto
s2n-quic handshake status
amzn_crypto
s2n-quic path challenge
amzn_crypto
s2n-quic client implementation
amzn_crypto
s2n-quic connection migration
amzn_crypto
Mentor auto pulling andon
amzn_lm
PDT build
amzn_lm
Simplify metrics
amzn_lm #invent and simplify | #dive deep
Customer impact remediation
amzn_lm #bias for action
ICE investication
amzn_lm
Granular andon
amzn_lm
Detect stuck migrations
amzn_lm
Fingerprint service
amzn_lm #invent and simplify | #bias for action
Advise SP on technical contractor
shatterproof #right alot | #disagree and commit
Access and expose recs generated by Data Science
ihr #invent and simplify | #frugality | #earn trust
Max connections and failing kube watch
ihr #right alot | #bias for action | #dive deep
Fastly blacklist to whitelist migration
ihr #highest standard | #earn trust
PostgresMapper - increasing code realibility
rust #learn and be curious | #think big
Batch job alerting and little knowledge of system (on-call)
ihr #bias for action
Amazon subscription not enabling features
ihr #customer obsession | #dive deep
Vertical squad lacking resources
ihr #disagree and commit | #develop the best
Migrate jenkins from standalone to k8s
ihr #think big | #invent and simplify
Skynet - delivering results for hackday project
ihr #deliver results | #right alot
Templating kube config
ihr #think big | #disagree and commit
PR to rust lang - infer outlives predicates
rust #learn and be curious | #ownership | #think big
Own production outage when on-call was in transit
ihr #ownership | #customer obsession
Android rotation work took much longer than expected
s&j #customer obsession | #highest standard
Starting and leading the Rust meetup
rust #ownership | #develop the best
Replaced Miata timing belt and other components
life #learn and be curious | #frugality
_list_
#highest standard
---------- ----------
netbench orchestrator
amzn_crypto

summary https://github.com/toidiu/netbench_orchestrator

metrics

  • bla

S

T

A

R

ktls
amzn_crypto

summary https://github.com/aws/s2n-tls/issues/3711

metrics

  • bla

S

T

A

R

s2n-quic ack freq analysis
amzn_crypto

summary https://github.com/aws/s2n-quic/issues/1276

metrics

  • bla

S

T

A

R

s2n-quic async client hello
amzn_crypto

summary

metrics

  • bla

S

T

A

R

s2n-quic rustls testing and parity
amzn_crypto

summary

metrics

  • bla

S

T

A

R

s2n-quic advocate better slowloris mitigation
amzn_crypto

summary

metrics

  • bla

S

The previous implementation involved closing the connection by default if the bytes transferred dropped below some user specified amount.

T

While simple and effective implementation, this seems like an sharp edge and made me uneasy. The reason for this was:

  • the user specified value could become stale
  • the default action closing the connection could be an availability risk
  • in a worse case scenario this could lead to a stampede of closing all connections

A

R

s2n-quic handshake status
amzn_crypto

summary

metrics

  • bla

S

T

A

R

s2n-quic path challenge
amzn_crypto

summary

metrics

  • bla

S

T

A

R

s2n-quic client implementation
amzn_crypto

summary

metrics

  • bla

S

T

A

R

s2n-quic connection migration
amzn_crypto

summary

metrics

  • bla

S

T

A

R

Mentor auto pulling andon
amzn_lm

summary

metrics

  • conduct initial review of doc written by SDE1
  • provide feedback on structure and clarity of doc

S

T

A

R

PDT build
amzn_lm

metrics

  • helped surface questions and drive conversation for support in restricted regions

  • create and manage doc to track work for PDT region build

  • debug various issues as they came up during rebuild

S

T

A

R

Simplify metrics
amzn_lm #invent and simplify | #dive deep

metrics

  • while fixing workflow in PDT, i took the initiative to simplify metrics for all regions

  • removed 2 metrics packages

  • replaced packaged code with ~100 lines of code to capture the same metrics in a less concoluted manner

  • dug into old profiler library codebase to understand implementation

    • verified metrics in igraph against what the code was posting

S

T

A

R

Customer impact remediation
amzn_lm #bias for action

bias for action: post lse, i took the initiative to dig into customer impact. created a doc to track impact, pulled in senior engineers for advice and drafted a remediation script. helped discover another race condition bug which was customer impacting

metrics

  • remediated ~40 customer impacting instances

S

T

A

R

ICE investication
amzn_lm

metrics

  • conducted initial deep research into ICEing and its various causes

    • categoried data into actionable and non-actionable
  • started conversations with cross teams to help gather more metrics

S

T

A

R

Granular andon
amzn_lm

summary

metrics

  • auto deploy rules

  • integrate with workflow

  • write requirement doc for deprecating old andon

  • change matching rules from regext o string (eliminate errors)

  • created pipeline for auto releasing the andon rules

  • added alarms to catch if rules were not being read from lm workflow

  • added validation logic to better catch malformed rules during code review

S

T

A

R

Detect stuck migrations
amzn_lm

summary

metrics

  • bla

S

T

A

R

Fingerprint service
amzn_lm #invent and simplify | #bias for action

invent and simplify: re use hibernate codebase, especially the tests and validation. also simplify the logic to make it more adaptable.

bias for action: research other team's usecase; simplify and structure codebase to allow for extensibility. drove consensus when technical issues(java type generics) or framework choices made a right choice non obvious.

metrics

  • eliminate 5-6 abstractions across 30 files

  • structure code to support various types of fingerprints for various teams

  • deliver a working first version of the API and provide value to the team. added testing for places where java type erasure made it difficult to add fully typed checked code.

  • wrote prfaq to convince leadership

  • wrote a deserializer to safely transition from old parser + old rule format -> new parser + new rule format

    • multiple unit tests to ensure rules were exactly the same during migration
  • automated release of service with pipeline + rollback alarms

    • replaced MCM practice (manual inspection of metrics over ~2 weeks)
  • regional dashboards along with alarms for automated and manual debugging

    • set a template for regional dashboards
  • managed SDE1 who helped with some tasks

  • learn and bootstrap out paperwork

    • unlocking team to move from substrate to prod
  • unblocked self by communicating cross teams for paperwork in substrate

  • provision tod workers for lm team

    • lpt managed
    • quilt pipelines
  • conducted an ORR for service pre/post launch (balancing the risk of doing some task post launch)

  • successfully transfered the ownership of 'rules' to sister team (8hr time difference)

    • email + meetings + ongoing support
  • released under MCM

    • set team standards for not having false alarms prior to launch

S

T

A

R

Advise SP on technical contractor
shatterproof #right alot | #disagree and commit

disagree and commit: present the risks, push back on execution and technical ability. support decision and provide help going forward

metrics

  • disagree
    • found an alternative contractor
      • interviewed 2 contractors
    • quantified the actual costs vs quoted costs from the contractor (asked for cost breakdown rather than lump sum)
    • pointed out faults with architecture (lambda is ephemeral and not good for DB connections)
    • US based was a positive for contractor
      • nudged them to find another US based contractor for comparison
  • commit
    • decision was based on business needs (far greater than future tech challenges)
      • learned the value and a process of how to end a stalemate!
    • reviewed future API spec and noted on the strengths of the contractor
    • kept trust so that I could prove useful in helping them hire a full-time person to augment contractor

S

I was a technical advisor for SP. They had a business goal which was to choose a contractor to help them build a website, and back and also conduct some user surveys.

T

I started the process by first identifying the overall architecture of the applicaiton. The second step was to find a second contractor and then interview both based on their technical expertise.

A

After speaking with both and reviewing the technical spec I was of the opinion that the original contractor did not have the technical expertise however they did have the best business insight.

I recommended that there was a high risk of technical debt and that we seek out the second contractor.

In the end my client decided to go with the first contractor. Their primary concern was to reduce the number of parties involved and since the first contractor was someone they were familiar with they choose the more expensive and less technically stable option.

R

Once the decision was made I supported the decision. The primarily goal was to move forward and do everything in our power to allow the first contractor to succeed.

Recently my gut feeling has proven correct, in that the contractor is not very flexible technically or process wise and very expensive.

Access and expose recs generated by Data Science
ihr #invent and simplify | #frugality | #earn trust

invent and simplify: come up with a solution and then simplify

earn trust: negotiate a solution with DS to further the relationship

frugality: reuse dynamo, jenkins rather than invent new solution

metrics

  • nightly job; system designed for time agnostic release
  • kept 7 days of backups
  • provide interface for testing
  • reuse existing infrastructure (dynamo, jenkins, schema models)
  • poll for new dataset every 5 min

S

Data Science (DS) ran a nightly job to generate music recommendations for users. The dataset would live in DynamoDB and the old DS workflow was to rewrite the same dynamo table with new dataset each night. This was a high risk operation for the my team (APIs). Additionally it caused some outages due to accidental schema changes.

T

I was in charge of creating a HA and resilient workflow. The ownership of the data would reside with them. We simply wanted assurances that there were some sanity for newly published data. Treat data as immutable Enforce that DS publish the dataset to a new table each night. Maintain a backlog of 7 datasets into the past. This gives the benefit of 'rollback' if a new dataset was broken Given multiple datasets (a,b,c); DS can 'point' to latest dataset DS can maintain a few versions of the data. "Version Table" DS can run tests on new datasets to confirm schema compatibility rollbacks are as easy as pointing to an older known dataset Maintain a log of actions (publish new dataset, test pass/fail, point to new dataset)

A

Work closely with DS to come up with the system design. Added a Jenkins tests which could be triggered by DS used to verify that data schema would not be a breaking change extensible model for other types of tests potential to track failing/passing metrics Poll based mechanism within API code to look at "Version Table" and start using the latest dataset. A poll interval of 5 min was used since 'older' data would still produce good enough data. Log when a new dataset was detected (help correlate errors with new dataset)

R

We saw a system that could be used to audit the creation of new datasets. Outages due to breaking schema's were eliminated.

Max connections and failing kube watch
ihr #right alot | #bias for action | #dive deep

right alot: pin point the differences between stg and prod

bias for action: execute a non invasive solution, which could go out quick while still preventing issues

metrics

  • errors occured at 1-2 week increment
  • random 500 errors for a particular api only
  • different behavior across stg and prod envs
  • 5 max connections
  • fixed issue for 2 additional micro services

S

An API that I wrote and owned would start to 500 errors sporadically. When Additionally, when I inspected the stg vs prod environments the results would differ and not always align and would happen at seemingly random times.

Restarting the container would fix the issue however.

T

After 2-3 more occurances, it became apparent that there was a deeper issue. In response I decided to add logging to the service around the relevant code.

At the next occurance we noticed from the logs that the IP of the service was did not match the one being requested.

A

I then took a look at mechanism that was doing service discovery and added additionally logs. This showed that not all the kube watch events were working correctly.

Further dive revealed that this condition was met at a convenient number of 5 connections. Diving further into the http library used by the service discovery library(pandorum) I discovered that OkHttp has a default limit of 5 persistent connections.

R

I added a config change to increase the max connection limit. Additionally, I added a check at startup to confirm that the limit matched the number of services we were trying to connect to so as to prevent future failures.

Fastly blacklist to whitelist migration
ihr #highest standard | #earn trust

highest standard: rather than fix the symptom at hand fix the core issue

earn trust: outages mean lower oncall morale. rolling out the changes in % meant gaining the trust of the team

customer obsession: also bad for customers

metrics

  • approx 8 blacklist rules
  • added approx 15 whitelist rules
  • rolled out in increments of 10%-20%
  • rolled out over 3 weeks
  • maintained the hitratio of approx 89% over the roolout

S

Realized that our recs were being incorrectly cached. The reason for this was the historic configuration of specifying the blacklisted paths.

T

The inital fix for the recs service was an easy config change. However to avoid such a mistake in the future I proposed that we should change from blacklist to whitelist.

A

Initially this seems like a very risky manuver, expecially when changing this for live production traffic. Therefore I took a few precautions to eliminate the risk.

Used randomint() VCL function to distribute the traffic.

I decided the a % based rollout, and then progressed to do it over 3 weeks.

R

At each 10% increase I gained additional confidence, about the new configs. Watching the cache hit ratio (89%), the traffic pattern also remained the same. Additionally, by communicating with the rest of my team as well as being aware of other company events, I was able to aviod being suspect of any failures by mistake.

The result was a clean migration, a more resilient system and probably a higher morale since there were no outages.

PostgresMapper - increasing code realibility
rust #learn and be curious | #think big

learn and be curious: learn how to do proc macros. read other source code

metrics

  • added 2 methods
  • selectively replaced old code with new features
  • fixed broken functionality with tokio-postgres
  • reduced code update locations from 3 to 1
    • hit 0 runtime errors because of mismatch fields thereafter
  • technique allowed for slow adoption rather than breaking code

S

The crate was a side project written by someone to provide some simple deserialization capabilites for postgres in rust. It addressed a important need but failed to do any field name checks at compile time.

T

Add additional methods to reduce runtime errors.

A

Added simple methods that read annotations and provided two methogs get_sql_table and get_fields. this was something that the compiler could do really well and not error.

R

As a result, the runtime errors due to table migrations went down to zero.

Batch job alerting and little knowledge of system (on-call)
ihr #bias for action

bias for action: took action with limited info, while evaluating risk.

metrics

  • 10K messages queued
  • 10pm with no response for rest of team
  • silenced alarms for 2 hr increments
  • after 3 occurance and 2am alarm made decision to terminate job

S

I was on call got a page at approx 10pm. There was a rabbitMq that had backlogged and was alerting. The first hurdle was that there was no login/password information about how to access the queue and see what was going on.

T

I tried to reach the secondary in the hopes of gathering more information but to no avail. I tried the tertiary(boss) and also the primary on the ingestion team but received no answer.

My task at this point became deciding how to proceed.

A

I was able to get login information from the SRE oncall and able to inspect the queue. At this point I saw that a message was causing a backlog so I cleared it manually.

Once the queue started draining I silenced the specific alarm and went about the night. However the error happened again and I noticed that there was another message now causing the backlog.

It became apparent that manually skipping the error message was not a solution. There were approx 10k messages queued, which was the limit of the queue but I suspected that if the backlog continued it could fill up the drives causing more damage.

R

It was also apparent that the batch job was not correct, in the sense that it was unable to handle all message types, which was resulting in the backlog. I therefore decided to cancel the batch job and drain the queue.

Since it was a batch job and it was being developed (not facing), there was little harm in stopping the alert and job, which could be looked at in the morning.

Amazon subscription not enabling features
ihr #customer obsession | #dive deep

customer obsession: subscription is the core of customer experience

dive deep: look at neighbouring code for amazon, google and apple

metrics

  • solved issue across 2 different services (amazon and heroku)
  • added a test to cover the edge case
  • traced and rebated a few thousand customers affected

S

Subscription miroservie was responsible for determining subscription status for users. A CS report indicated that a customer was charged for subscription but was not seeing premium features.

T

Verify that the customer was actually experiencing an error. Figure out a fix to the issue.

A

First I verified that the error was actually happening. This took using a combination of two internal endpoints to check on the subscription and paid status of the user. I also verified that the subscription had not expired.

The error was occuring for a Amazon user.

Explored the other subscription services since the code should be similar, including Amazon, Apple, Google...

Oddly enough, the code was slightly different for Amazon and Heroku users compared to the Google subscription code. Looking at git history one could see that the two had been added afterwards.

The problem involved finding a shared piece of code logic that was incorrect.

R

I was able to fix the subscription code and fix user experience for future users. This fix got applied not only to Amazon, but also Amazon users.

A follow up task was created to track other users affected and provide them with rebates/extended experience.

Vertical squad lacking resources
ihr #disagree and commit | #develop the best

disagree and commit: point out the lack of resources. end up at a compromise

develop the best: involve and promote the design lead as the project leader

metrics

  • attrition of 2 prod manager, 1 ios, 1 android
  • rollout/feedback taking up to 2 weeks
  • perliminary work being delayed for over a quarter
  • raised issue with 1 prod manager, my manager, 2 design, 1 coworker

S

The company had experimented with a new team structure "vertical team". The goal of the team would be to focus on improving a KPI.

However, due to some employee churn and other org changes, the squads had started to lose direction.

The recs team did not have enough resources but was expected to produce results. This worked alright for previously started projects however new projects were being proposed.

T

I had a strong hunch that without more product involvement and dev resources the squad would not be able to succeed in its mission.

A

I took it upon myself to speak with the different stake holders involved. This involved the design team, a product manager I had a good relationship, a coworker and my manager.

After speaking with individuals I was able to propose a meeting where we could come together and discuss risks, and goals. We also took this time to re-evaluate goals under current circumstances.

R

I was able to convince my manager and org to reach a middle compromise. There would be a shift in roles. The designer, who had the best idea about the product direction would assume a temporary product manager role. Additionally, we would be able to get an iOS dev for 2 sprints to comsume and execute the feature.

With a reduction of scope and a more focused involvement of those part of the squad, I was able to shift my focus on other work that needed to be done while wrapping up and supporting previous features.

Migrate jenkins from standalone to k8s
ihr #think big | #invent and simplify

think big: improve reliability and reduce work exponentially

invent and simplify: utilize helm chart and then customize for internal tools

metrics

  • able to upgrade jenkins from 1x to 2x
  • able to upgrade all plugins with confidence
  • able to address security warning in jenkins
  • launched 3 slaves and restrict jobs to specific slaves
    • later more slaves were added addition slaves for rust
  • time to launch a new jenkins server took < 2 hrs
  • added detailed documentation so others would be able to contribute

S

Jenkins was the test automation server we used. However it was deployed on on a VM without any way to recover, upgrade or replicate the instance and data.

This resulted in tech debt and a fear of doing anything new with the instance.

T

I took it upon myself to create a kube deployed instance which could be replaced and therefore upgraded.

A

I created a decrarative instance of the server based on an existing jenkins helm recipie. I then tweaked it to have custom values and secrets.

The secrets were applied via a api call to the jenkins server (decrypt KMS).

R

We were able to migrate to the latest version of Jenkins.

Skynet - delivering results for hackday project
ihr #deliver results | #right alot

right alot: choosing java and solving part of the process was the correct small step to allow for adoption

deliver results: deliver a project despite it not being the original all encompassing solution

metrics

  • reduced manual testing from 10-30 minutes to seconds
  • removed human error during testing
  • created a extensible framework, initially targeting 2 types of test
  • agnostic testing framework for both android and ios

S

This was during a hack week project and the goal of our 3 person team was to help automate QA-testing done for apps at the company.

The current way to do testing was to open the app, in a simulator and use charles proxy to collect http traffic logs. The logs would then be parsed manually.

This could take entire days and sometimes resulted in incomplete testing if an urgent release was scheduled. It also meant long hours and sometimes weekend work for the QA team.

The team was composed of a non-technical QA member, an iOS dev and myself (backend).

T

The initial goal was to use a UI testing framework for iOS and therefore automate the collection and then verification of the tests. We spent 1.5 days trying to get the UI framework to work, however it was unpredictable on different computers and eventually just didnt work.

A

Realizing that we were at risk of not having any product I decided to take a step back and see if we could produce something useful.

I took half the day to create a simple POC; to tackle the testing and not the data gathering portion.

This was a much simpler and predictable problem to solve. The log data could be ingested as Json and a series of tests could be run on it. Java was chosen since it was a language most people would be comfortable with, it was typed and the non-technical QA memebers could augment it.

We created a somewhat abstract testing framework. It had 3 broad scopes: ingest data from file, filter data based on test to be run (user id, header), verify data based on custom test rules.

At some point I devoted majority of my time to training and guiding the QA member on understanding and augmenting to the codebase. Transfering owenership to the QA team was an important goal for me if the project was to succeed.

R

We were also tackle time sensitive and manually arduous tests (token refresh every 5min). By the end of the week we were able to execute 6 different tests, implement 2 different testing scenarios (find a value, stateful testing).

Needless to say the QA team was very happy and with the testing abstraction in place they were then able to implement more tests themselves.

Templating kube config
ihr #think big | #disagree and commit

metrics

  • goods:
    • used helm template command to avoid creating new tool
    • would be able to slowly template and move over existing configs
    • poc worked and was able to represent current config
  • bads:
    • would be 1 additional tool to install
    • helm template was no longer supported
    • go templating had complicated syntax
  • failures:
    • project got hijacked from under me (loss of trust)
    • should have consulted seniors more (get buy in)
    • should have demonstrated small example rather go for the cake
    • should have asked for more feedback from team
    • should have demonstrated the stability of heml template

S

instead of updating configs across multiple envs and regions (qa, stg, us, au, nz) I wanted to create a template that would allow us to update the value in a single file. The remaining content could then be defined once.

T

A

R

Once the implementation had was done and the team clearly supported the other implementatino, I made it a point to verbally commend it in a meeting and show support.

PR to rust lang - infer outlives predicates
rust #learn and be curious | #ownership | #think big

metrics

  • took 7 months. actual work after mentorship took 4 months to release
  • build times were up to 2hrs. incremental builds were 10-30 min
  • added 48 different test scenarios
  • working on the rust codebase was a exponential step forward for me
    • 1,043,243 lines of code total in the project
    • touched up to 45 files
  • added 4 additional fixes post PR

S

I wanted to get involved in the OSS community, learn and give back. To force myself to do this I got involved during the impl days and claimed a feature that offered mentorship.

T

Understand the codebase. Understand the context around the feature. Understand the feature. Learn to work with the codebase. Commit a PR to implement the feature.

A

Learned to build and work in a large codebase. Took 7 months but from the moment I asked to work on the feature I knew that I had to finish it.

R

Added feature to infer predicates and thus made a small contibution to the ergonomics of the langugage. Added docs for the feature.

Own production outage when on-call was in transit
ihr #ownership | #customer obsession

ownership: the api is owned by the team and not only the responsibility of an individual.

customer obsession: prod outage means users are being affected

metrics

  • less than 1/2 hr outage
  • restarted 3-4 failing services
  • restarted up to 15 pods across all services that were in a bad state

S

Towards the end of the work day a sestamatic outage started to happen. The oncall member was in-transit and not available to handle the issue.

Later we found out that the outage was due to a combination of AWS upgrade event and the weave CNI not being able to maintain a network mesh.

T

Step up and represent the API team in fixing the outage.

A

Restarted services that were showing errors in prometheus. Also tracked each service individually to ensure that there were no lingering bad pods.

R

Outage lasted less than 1/2 hour. Monitored the state of the system for a total of 1 hour.

Android rotation work took much longer than expected
s&j #customer obsession | #highest standard

customer obsession: customer was my client... he would ned to read. and develop code afterwards

highest standard: rather than hack a solution, there was clearly. a better way but more time consuming way to code

metrics:

S

T

A

R

Starting and leading the Rust meetup
rust #ownership | #develop the best

ownership: take charge and take on responsibility

develop the best: promote people to get involved and grow replacement

metrics

  • started as 1 old and 2 new co-organizers
  • old dropped and found 1 additional
  • hosted at 5 different companies
  • found 1 repeat sponsor
  • found and organized approx 36 speakers
  • gathered approx 10 core repeating members
  • gave 2 talks and led 1 learning session

S

Others bailed when it came time to actually put in time and organize the meetup. The tasks were no very rewarding and involved coordinating with companies, and speakers and trying to get sponsors.

T

I decided to take charge and try and build a sustainable community.

A

I created a list of companies. I spoke to attendees and convinced a few to speak or otherwise host. I invited everyone to speak on their side projects. I organized a un-meetup and volunteered to teach the begineer session.

R

I was able to organize a meetup each month for approx 1.5 years before transitioning it to a co-organizer. We had average 30 people attending per meetup with upwards of 60 at a few events. We had a home for the meetup where we could each month and a sponsor who would provide food and drinks. There were approx 10 consistent member who would show up very often and would help carry it forward.

Replaced Miata timing belt and other components
life #learn and be curious | #frugality

learn and be curious: curiosity of how things work. what is a timing belt. what is the effort involved

frugality: save money by doing the changes myself and then also in the future

metrics

  • car was at 60k miles
  • $200 in tools(reusable) and $200 in parts
  • took 5 days of work
  • changed parts
    • timing belt
    • radiator fluid
    • water pump
    • serpentine belts
    • enginer gasket
    • shifter oil
  • asked for help with 1 portion (turning the crankshaft bolt)

S

My first car was a mazda miata. I got it at 30K miles and at 60K it needed a timing belt changed.

T

I could have taken it to a mechanic but instead of paying upwards of hundreds of dollars, I decided to read and fix the car myself.

The parts cost around $200 and the tools cost another $200.

A

I confirmed that the risk was not too high. The engine was a noninterference engine so even if I messed up the timing, there would be no risk of distroying the engine.

Two other things that aided me was the amazing miata.net blog. I had a friend who I could ask for help when needed. He actually had a large wrench that I borrowed to turn the crankshaft bolt.

I read the instructions multiple times as well as comments from various other owners to predict all the things that could go wrong.

R

I was able to replace the belts and get the car running. I felt more comfortable debugging further problems with the car as they came up rather than trust a mechanic.

At the end I had the knowledge, the tools, and a working car.

_list_
#highest standard
  • jvm debugging

  • commiting to rust lang issue and following thru on it (7 months)

  • migrating old recs api to new micro service logic

  • debugging and optimizing Mongo query performance across different services

  • write unit and mongo embed tests across the many services

  • I interface with kubernetes on a daily basis. k8s is quite complex so many of the debugging tasks involve creating and refining the mental picture of how it actually works. This is then used to debug and further improve upon service performance.