Sub sections

  • Brag
  • Interview Prep
  • What are Brag documents

    STAR
  • Situation: Describe the situation or task you faced
  • Task: Explain what you were required to achieve
  • Action: Describe the actions you took to complete the task
  • Result: Explain the outcome of your actions and what you learned from the experience
  • Amazon Leadership Principles

    Table of content

    ktls
    aws_crypto
    s2n-quic ack delay
    aws_crypto
    Mentor Intern
    aws_lm
    PDT build
    aws_lm
    Simplify metrics
    aws_lm
    #invent and simplify | #dive deep
    Customer impact remediation
    aws_lm
    #bias for action
    Fingerprint service
    aws_lm
    #invent and simplify | #bias for action
    Advise SP on technical contractor
    shatterproof
    #right alot | #disagree and commit
    Access and expose recs generated by Data Science
    ihr
    #invent and simplify | #frugality | #earn trust
    Max connections and failing kube watch
    ihr
    #right alot | #bias for action | #dive deep
    Fastly blacklist to whitelist migration
    ihr
    #highest standard | #earn trust
    PostgresMapper - increasing code realibility
    rust
    #learn and be curious | #think big
    Amazon subscription not enabling features
    ihr
    #customer obsession | #dive deep
    Vertical squad lacking resources
    ihr
    #disagree and commit | #develop the best
    Migrate jenkins from standalone to k8s
    ihr
    #think big | #invent and simplify
    Skynet - delivering results for hackday project
    ihr
    #deliver results | #right alot
    Templating kube config
    ihr
    #think big | #disagree and commit
    PR to rust lang - infer outlives predicates
    rust
    #learn and be curious | #ownership | #think big
    Own production outage when on-call was in transit
    ihr
    #ownership | #customer obsession
    Android rotation work took much longer than expected
    s&j
    #customer obsession | #highest standard
    Starting and leading the Rust meetup
    rust
    #ownership | #develop the best
    ---------- ----------
    netbench orchestrator
    aws_crypto

    https://github.com/aws/s2n-netbench/tree/main/netbench-orchestrator

    ktls
    aws_crypto

    https://github.com/aws/s2n-tls/issues/3711

    s2n-quic optimistic ack mitigation
    aws_crypto

    https://github.com/aws/s2n-quic/pull/1986

    S

    • Add mitigation for optimistic acks
    • The rfc was clear how to do this (skip packets), but also vague because it didnt mention how many packets to skip and how often.

    T

    • Come up with a strategy for skipping packet and mitigating the attack.
    • Implement the mitigation in s2n-quic.

    A

    • Audited other QUIC implementations and conducted analysis to answer two key questions:
      • How many packets to skip?
      • How often should packets be skipped?
    • There were 2 other implementations that were using a "static" (skipping did not evolve with cwnd) approach. Their strategy was to:
      • Track skipping 1 pn. Overwrite value when a new pn needs to be skipped.
      • Skip a random pn in some static range.
    • The "static" approach was network dependent (DC vs public wifi):
      • Does overriting skip pn nullify the mitigation?
      • Is a static skip range effective for all networks?
        • A cwnd is very different in a DS vs public internet
    • Considered the option of skipping multiple packets:
      • Pro: This would allow us to skip more frequenly
      • Con: Require storing multiple skip pn. How many pn should we store?
    • Analyzed the purpose of the mitigation to come up with an optimal solution.
      • The goal of the mitigation was to prevent cwnd bloat.
      • Which could be done if packets were acked prior to the peer receiving them.
      • By basing skip range on the number of packets that could be received with an cwnd.
    • Solution: evolve skip packet range based on cwnd and only store 1 skip pn.
    • Calculate range based on packets we expect to send in a single period.
      • pkt_per_cwnd = cwnd / mtu
      • rand = pkt_per_cwnd/2..pkt_per_cwnd*2

    R

    • Successfully implemented the mitigation.
    • Only had to store 1 skip pn.
    • By evolving the skip range based on cwnd, s2n-quic would scale to all networks.
    s2n-quic ack delay
    aws_crypto

    summary

    • tracking issue: https://github.com/aws/s2n-quic/issues/1276
    • cpu increase: https://github.com/aws/s2n-quic/pull/1298
    • revert pr with some analysis shows : https://github.com/aws/s2n-quic/pull/1368
    • measure batching multi-packets: https://user-images.githubusercontent.com/4350690/174196918-6af428e4-9ab7-4458-b3b9-e27ed89c3318.png

    metrics

    S

    • Explore the ACK delay RFC and see if s2n-quic could benefit from implementing it.

    Pros of delay:

    • send/recv acks is CPU expensive
      • on the send heavy server it takes 24.33%
    • asymmetric links like satellite

    Cons of delay:

    • progress: delayed acknowledgment can delay data being sent from peer
      • the RFC lets you adjust thresholds
    • ECN: congestion signal by the network
      • ack packets with ECN markings immediately
    • loss recovery: detect a packet was not received and retransmit
    • cc: regular acks help establish good RTT estimates and drive CC
    • BBR: requires a high precision signal to work so it was unclear how delaying acks would affect this

    RFC features:

    • negotiating the extension: min_ack_delay. the minimum amount of time, that the endpoint sending this value is willing to delay an acknowledgment
    • ACK_FREQUENCY frame:
      • Ack-Eliciting Threshold: max ack eliciting packets received before sending an ack
      • Requested Max Ack Delay: max time to wait before sending an ack
      • Reordering Threshold: max number of reordered packet before sending an ack
    • IMMEDIATE_ACK frame: asks peer to send an immediate ack
    • Expedite ECN: ecn marked packets should not be delayed

    T

    Choose on an implementation.

    Options:

    • single round batching (10 packet GSO payload)
      • easy to implement (1-2 sprints)
      • this solution creates building blocks for others after it
    • multi round batching (multi-gso payload) (2-4 sprints)
      • medium difficulty to implement
      • requires tuning with application, production usecase, traffic pattern
    • implement Ack delay RFC (4-8 sprints)
      • difficult to implement
      • requires tuning with application, production usecase, traffic pattern
      • requires negotiating the extension

    A

    Impl: Batch ACK Processing (single round)

    • single round batching (10 packet GSO payload)
      • connection can signal interest in processing acks
      • store pending acks on connection
      • refactor CC and LR to accept signals from batched ack processing (timestamp)
      • swap from single processing to batched processing
      • emit metrics

    R

    • Flamegraph result 24.33% -> 25.81% https://github.com/aws/s2n-quic/pull/1298
    • Batching different amount of packets (2, 10, 40, 80): https://user-images.githubusercontent.com/4350690/174196918-6af428e4-9ab7-4458-b3b9-e27ed89c3318.png

    lessons learned:

    • we had to be cautious about delaying acks because we were operating on public internet
    • I wonder if an environment like DC would be better for delaying acks
    • acks are a signal within the noise of loss,delay,congestion,etc
    • within a DC there is less noise so it makes sense that we can have less signal and get away with it
    s2n-tls default TLS 1.3 support
    aws_crypto

    metrics

    • 4000 tests
    • ~1500 scoped down tls 1.2 tests
    • ~33 "default" policy tests

    S

    • include tls 1.3 support by default
    • good modern default. tls 1.3 is provably secure, modern ciphers,

    T

    • which can be risky because customers will see change in behavior
    • it would also affect test coverage for tls 1.2
      • tests that test the tls 1.2 policy
        • can pass when switched to tls 1.3 because written previous to tls 1.3 support and might not assert protocol
      • tests that test the "default" policy

    It is not possible to detect the difference between these.

    A

    • assess different options
    • balance the risk of security vs not making progress
    • lay out the options, make a case for it

    R

    • manually audit 4000 tests
    • scope bad tls 1.2 to certain files
    • pin all tests that are using default to a numbered policy
      • test regression for "default" policy tests
    • don't pin tests
      • test regression for tls 1.2 tests
    • run all tests twice (different platform, libcrypto, fuzz, valgrind)
    • run a single tls 1.2 test and accept a minimal risk of tls 1.2 test regression
    s2n-quic async client hello
    aws_crypto

    https://github.com/aws/s2n-quic/issues/1137

    S

    Currently s2n-quic does the certificate lookup/loading operations synchronously.

    Non ideal for application which server multiple domains concurrently and need to load multiple certificates, since that would block the thread.

    T

    Allowed for certificate lookup/loading operations to be performed asynchronously and enable non-blocking behavior.

    A

    • s2n-quic:

      • pass the connection waker down to the tls libraries so that they could wake on progress
    • s2n-tls:

      • The work involved converting the callback only-invoked-once to a poll-the-callback model in s2n-tls.
      • s2n-tls by default did not allow for poll callbacks.
      • s2n-tls previously only called the callback once, which not the Rust model and has quite a few drawbacks.
        • Polling only once means the application/s2n-quic has to schedule (separate thread possibly) the completion of the callback.
        • It also needs to manage the state associated with the callback separately.
        • The Rust polling model allows for all state associated with the future to live within the object bing polled.
        • Additionally, the future can make progress as part of the already existing runtime that s2n-quic starts with.
    • s2n-tls bindings:

      • gluing the new callback polling behavior in an extensible way for other callbacks.

    R

    • setup the framework for future async callbacks in s2n-tls
    s2n-quic rustls testing and parity
    aws_crypto
    s2n-quic advocate better slowloris mitigation
    aws_crypto

    S

    The previous implementation involved closing the connection by default if the bytes transferred dropped below some user specified amount.

    T

    While simple and effective implementation, this seems like an sharp edge and made me uneasy. The reason for this was:

    • the user specified value could become stale
    • the default action closing the connection could be an availability risk
    • in a worse case scenario this could lead to a stampede of closing all connections

    A

    R

    s2n-quic handshake status
    aws_crypto
    s2n-quic path challenge
    aws_crypto
    s2n-quic client implementation
    aws_crypto
    s2n-quic connection migration
    aws_crypto
    Mentor auto pulling andon
    aws_lm

    summary

    metrics

    • conduct initial review of doc written by SDE1
    • provide feedback on structure and clarity of doc

    S

    T

    A

    R

    Mentor Intern
    aws_lm

    Project plan: https://quip-amazon.com/Wa4iADaP4cqI/Intern-Project-LM-Detective

    S

    • The Dispatcher is a critical component in LM but our team lacked metrics.
    • The intern project "LM Detective" would give us:
      • better understanding of Dispatcher Service and ongoing migrations
      • allow LM to take a data driven approach to improving the Dispatcher service and scheduling

    T

    • Draft the intern project plan. Omitting some information to probe for understanding.
    • Help onboard the intern to AWS and LM specific technology.
    • Educate the intern about LM technology and help guide to a successful delivery.

    A

    • Met with the intern multiple times a week to ensure progress.
    • Guided the intern to set milestones and define a project plan.
    • Helped define the stretch goals and helped prepare their final presentation.

    R

    • The intern successfuly completed the project and received a return offer.
    • LM could use the project to answer questions such as:
      • What is the state of the dispatcher given a specific migration id?
      • For all migrations in pending/executing state, what are the src droplets and tgt droplets?
      • Why is a particular migration in status pending?
    • The intern also completed the stretch goal for the project
      • Create a UI (graph viz) to visualize the data
    PDT build
    aws_lm

    Project plan: https://quip-amazon.com/NtptAZ55eOpb/LM-in-PDT-Work

    S

    • LM was tasked to make the service available in PDT.
    • PDT should be the last region; LM should therefore also be available in all other regions.

    T

    • Build the entire LM stack in PDT to meet gov compliance. (canaries, dispatcher, alarms, metrics)
    • Created a project plan to identify all components and missing regions.
    • Automate the infrastructure to make builds reproducible.

    A

    • Cleaned up our LPT rules to make service launches reqproducible, extensible and maintainable.
    • Synthesized new alarms, cleaning-up/creating dashboards along the way.

    R

    • Full LM availability in all regions, including PDT (met gov compliance).
    • A clean, normalized LPT stack for all services owned by LM.

    metrics

    • helped surface questions and drive conversation for support in restricted regions

    • create and manage doc to track work for PDT region build

    • debug various issues as they came up during rebuild

    Simplify metrics
    aws_lm #invent and simplify | #dive deep

    metrics

    • while fixing workflow in PDT, i took the initiative to simplify metrics for all regions

    • removed 2 metrics packages

    • replaced packaged code with ~100 lines of code to capture the same metrics in a less concoluted manner

    • dug into old profiler library codebase to understand implementation

      • verified metrics in igraph against what the code was posting

    S

    T

    A

    R

    Customer impact remediation
    aws_lm #bias for action

    bias for action: post lse, i took the initiative to dig into customer impact. created a doc to track impact, pulled in senior engineers for advice and drafted a remediation script. helped discover another race condition bug which was customer impacting

    metrics

    • remediated ~40 customer impacting instances

    S

    T

    A

    R

    ICE investication
    aws_lm

    metrics

    • conducted initial deep research into ICEing and its various causes

      • categoried data into actionable and non-actionable
    • started conversations with cross teams to help gather more metrics

    S

    T

    A

    R

    Granular andon
    aws_lm

    metrics

    • auto deploy rules

    • integrate with workflow

    • write requirement doc for deprecating old andon

    • change matching rules from regext o string (eliminate errors)

    • created pipeline for auto releasing the andon rules

    • added alarms to catch if rules were not being read from lm workflow

    • added validation logic to better catch malformed rules during code review

    S

    T

    A

    R

    Detect stuck migrations
    aws_lm
    Fingerprint service
    aws_lm #invent and simplify | #bias for action
    • invent and simplify: re use hibernate codebase, especially the tests and validation. also simplify the logic to make it more adaptable.
    • bias for action: research other team's usecase; simplify and structure codebase to allow for extensibility. drove consensus when technical issues(java type generics) or framework choices made a right choice non obvious.

    FAQ: https://quip-amazon.com/3ujuAnetkBr1 Dashboard: https://w.amazon.com/bin/view/EC2/LiveMigration/Dashboards/DropletFingerprintService/

    S

    • KaOS owns fingerprint rules.
    • KaOS deletes and then creates a DynamoDB table with updated rules. This causes elevated ICEing.
    • LM consumed these rules.

    T

    • Create a single highly available service to serve fingerprint rules.
    • De-risk KaOS rolling out new fingerprint rules.
    • Take the opportunity to change the format of fingerprint rules. This was a high risk item since a change in format made it difficult to detect mistakes in parsing logic.
    • Emit metrics for fingerprint matching rules to track the flow migrations.

    A

    • Wrote a FAQ to convince leadership: https://quip-amazon.com/3ujuAnetkBr1
    • Coordinated with the KaOS team and launch a multi-region LM service.
    • Wrote a lot of tests to ensure finger format change were equivalent.
    • Mentored junior teammate (karthiga) on creating a detailed service health dashboard.
    • Conducted ORR to ensure service best practices.
    • Created a fully automated service using LPT, managed fleets, and paperwork.

    R

    • Launched the service under MCM.
    • There was zero impact to LM during the migration (due to the detailed metrics and tests).
    • The ORR and automation served as a template for future service launches for the team.

    Extra info:

    • wrote prfaq to convince leadership
    • wrote a deserializer to safely transition from old parser + old rule format -> new parser + new rule format
      • multiple unit tests to ensure rules were exactly the same during migration
    • automated release of service with pipeline + rollback alarms
      • replaced MCM practice (manual inspection of metrics over ~2 weeks)
    • regional dashboards along with alarms for automated and manual debugging
      • set a template for regional dashboards
    • managed SDE1 who helped create alarms for the service
    • learn and bootstrap out paperwork
      • unlocking team to move from substrate to prod
    • unblocked self by communicating cross teams for paperwork in substrate
    • provision tod workers for lm team
      • lpt managed
      • quilt pipelines
    • conducted an ORR for service pre/post launch (balancing the risk of doing some task post launch)
    • successfully transfered the ownership of 'rules' to sister team (8hr time difference)
      • email + meetings + ongoing support
    • released under MCM
      • set team standards for not having false alarms prior to launch
    Advise SP on technical contractor
    shatterproof #right alot | #disagree and commit

    disagree and commit: present the risks, push back on execution and technical ability. support decision and provide help going forward

    metrics

    • disagree
      • found an alternative contractor
        • interviewed 2 contractors
      • quantified the actual costs vs quoted costs from the contractor (asked for cost breakdown rather than lump sum)
      • pointed out faults with architecture (lambda is ephemeral and not good for DB connections)
      • US based was a positive for contractor
        • nudged them to find another US based contractor for comparison
    • commit
      • decision was based on business needs (far greater than future tech challenges)
        • learned the value and a process of how to end a stalemate!
      • reviewed future API spec and noted on the strengths of the contractor
      • kept trust so that I could prove useful in helping them hire a full-time person to augment contractor

    S

    I was a technical advisor for SP. They had a business goal which was to choose a contractor to help them build a website, and back and also conduct some user surveys.

    T

    I started the process by first identifying the overall architecture of the applicaiton. The second step was to find a second contractor and then interview both based on their technical expertise.

    A

    After speaking with both and reviewing the technical spec I was of the opinion that the original contractor did not have the technical expertise however they did have the best business insight.

    I recommended that there was a high risk of technical debt and that we seek out the second contractor.

    In the end my client decided to go with the first contractor. Their primary concern was to reduce the number of parties involved and since the first contractor was someone they were familiar with they choose the more expensive and less technically stable option.

    R

    Once the decision was made I supported the decision. The primarily goal was to move forward and do everything in our power to allow the first contractor to succeed.

    Recently my gut feeling has proven correct, in that the contractor is not very flexible technically or process wise and very expensive.

    Access and expose recs generated by Data Science
    ihr #invent and simplify | #frugality | #earn trust

    invent and simplify: come up with a solution and then simplify

    earn trust: negotiate a solution with DS to further the relationship

    frugality: reuse dynamo, jenkins rather than invent new solution

    metrics

    • nightly job; system designed for time agnostic release
    • kept 7 days of backups
    • provide interface for testing
    • reuse existing infrastructure (dynamo, jenkins, schema models)
    • poll for new dataset every 5 min

    S

    • Data Science (DS) ran a nightly job to generate music recommendations for users.
    • The dataset would live in DynamoDB and the old DS workflow was to rewrite the same dynamo table with new dataset each night.
    • This was a high risk operation for the my team (APIs). Additionally it caused some outages due to accidental schema changes.

    T

    • I was in charge of creating a HA and resilient workflow.
    • The ownership of the data would reside with them.
    • We simply wanted assurances that there were some sanity for newly published data.
    • Treat data as immutable
      • Enforce that DS publish the dataset to a new table each night.
      • Maintain a backlog of 7 datasets into the past.
      • This gives the benefit of 'rollback' if a new dataset was broken
    • Given multiple datasets (a,b,c); DS can 'point' to latest dataset
      • DS can maintain a few versions of the data. "Version Table"
      • DS can run tests on new datasets to confirm schema compatibility
      • rollbacks are as easy as pointing to an older known dataset
    • Maintain a log of actions (publish new dataset, test pass/fail, point to new dataset)

    A

    • Work closely with DS to come up with the system design.
    • Added a Jenkins tests which could be triggered by DS
      • used to verify that data schema would not be a breaking change
      • extensible model for other types of tests
      • potential to track failing/passing metrics
    • Poll based mechanism within API code to look at "Version Table" and start using the latest dataset.
      • A poll interval of 5 min was used since 'older' data would still produce good enough data.
      • Log when a new dataset was detected (help correlate errors with new dataset)

    R

    • We saw a system that could be used to audit the creation of new datasets.
    • Outages due to breaking schema's were eliminated.
    Max connections and failing kube watch
    ihr #right alot | #bias for action | #dive deep

    right alot: pin point the differences between stg and prod

    bias for action: execute a non invasive solution, which could go out quick while still preventing issues

    metrics

    • errors occured at 1-2 week increment
    • random 500 errors for a particular api only
    • different behavior across stg and prod envs
    • 5 max connections
    • fixed issue for 2 additional micro services

    S

    An API that I wrote and owned would start to 500 errors sporadically. When Additionally, when I inspected the stg vs prod environments the results would differ and not always align and would happen at seemingly random times.

    Restarting the container would fix the issue however.

    T

    After 2-3 more occurances, it became apparent that there was a deeper issue. In response I decided to add logging to the service around the relevant code.

    At the next occurance we noticed from the logs that the IP of the service was did not match the one being requested.

    A

    I then took a look at mechanism that was doing service discovery and added additionally logs. This showed that not all the kube watch events were working correctly.

    Further dive revealed that this condition was met at a convenient number of 5 connections. Diving further into the http library used by the service discovery library(pandorum) I discovered that OkHttp has a default limit of 5 persistent connections.

    R

    I added a config change to increase the max connection limit. Additionally, I added a check at startup to confirm that the limit matched the number of services we were trying to connect to so as to prevent future failures.

    Fastly blacklist to whitelist migration
    ihr #highest standard | #earn trust

    highest standard: rather than fix the symptom at hand fix the core issue

    earn trust: outages mean lower oncall morale. rolling out the changes in % meant gaining the trust of the team

    customer obsession: also bad for customers

    metrics

    • approx 8 blacklist rules
    • added approx 15 whitelist rules
    • rolled out in increments of 10%-20%
    • rolled out over 3 weeks
    • maintained the hitratio of approx 89% over the roolout

    S

    Realized that our recs were being incorrectly cached. The reason for this was the historic configuration of specifying the blacklisted paths.

    T

    The inital fix for the recs service was an easy config change. However to avoid such a mistake in the future I proposed that we should change from blacklist to whitelist.

    A

    Initially this seems like a very risky manuver, expecially when changing this for live production traffic. Therefore I took a few precautions to eliminate the risk.

    Used randomint() VCL function to distribute the traffic.

    I decided the a % based rollout, and then progressed to do it over 3 weeks.

    R

    At each 10% increase I gained additional confidence, about the new configs. Watching the cache hit ratio (89%), the traffic pattern also remained the same. Additionally, by communicating with the rest of my team as well as being aware of other company events, I was able to aviod being suspect of any failures by mistake.

    The result was a clean migration, a more resilient system and probably a higher morale since there were no outages.

    PostgresMapper - increasing code realibility
    rust #learn and be curious | #think big

    learn and be curious: learn how to do proc macros. read other source code

    metrics

    • added 2 methods
    • selectively replaced old code with new features
    • fixed broken functionality with tokio-postgres
    • reduced code update locations from 3 to 1
      • hit 0 runtime errors because of mismatch fields thereafter
    • technique allowed for slow adoption rather than breaking code

    S

    The crate was a side project written by someone to provide some simple deserialization capabilites for postgres in rust. It addressed a important need but failed to do any field name checks at compile time.

    T

    Add additional methods to reduce runtime errors.

    A

    Added simple methods that read annotations and provided two methogs get_sql_table and get_fields. this was something that the compiler could do really well and not error.

    R

    As a result, the runtime errors due to table migrations went down to zero.

    Batch job alerting and little knowledge of system (on-call)
    ihr #bias for action

    bias for action: took action with limited info, while evaluating risk.

    metrics

    • 10K messages queued
    • 10pm with no response for rest of team
    • silenced alarms for 2 hr increments
    • after 3 occurance and 2am alarm made decision to terminate job

    S

    I was on call got a page at approx 10pm. There was a rabbitMq that had backlogged and was alerting. The first hurdle was that there was no login/password information about how to access the queue and see what was going on.

    T

    I tried to reach the secondary in the hopes of gathering more information but to no avail. I tried the tertiary(boss) and also the primary on the ingestion team but received no answer.

    My task at this point became deciding how to proceed.

    A

    I was able to get login information from the SRE oncall and able to inspect the queue. At this point I saw that a message was causing a backlog so I cleared it manually.

    Once the queue started draining I silenced the specific alarm and went about the night. However the error happened again and I noticed that there was another message now causing the backlog.

    It became apparent that manually skipping the error message was not a solution. There were approx 10k messages queued, which was the limit of the queue but I suspected that if the backlog continued it could fill up the drives causing more damage.

    R

    It was also apparent that the batch job was not correct, in the sense that it was unable to handle all message types, which was resulting in the backlog. I therefore decided to cancel the batch job and drain the queue.

    Since it was a batch job and it was being developed (not customer facing), there was little harm in stopping the alert and job, which could be looked at in the morning.

    Amazon subscription not enabling features
    ihr #customer obsession | #dive deep

    customer obsession: subscription is the core of customer experience

    dive deep: look at neighbouring code for amazon, google and apple

    metrics

    • solved issue across 2 different services (amazon and heroku)
    • added a test to cover the edge case
    • traced and rebated a few thousand customers affected

    S

    Subscription micro-service was responsible for determining subscription status for users. A CS report indicated that a customer was charged for subscription but was not seeing premium features.

    T

    Verify that the customer was actually experiencing an error. Figure out a fix to the issue.

    A

    First I verified that the error was actually happening. This was done using a combination of two internal endpoints: subscription status and paid status of the user. I also verified that the subscription had not expired.

    The error was occurring for a Amazon user.

    Explored the other subscription services since the code should be similar, including Amazon, Apple, Google...

    Oddly enough, the code was slightly different for Amazon and Heroku users compared to the Google subscription code. Looking at git history one could see that the two had been added afterwards.

    R

    I was able to fix the subscription code and fix user experience for future users. This fix got applied not only to Amazon, but also Heroku users.

    A follow up task was to track other users affected and provide them with rebates/extended experience.

    Vertical squad lacking resources
    ihr #disagree and commit | #develop the best

    disagree and commit: point out the lack of resources. end up at a compromise

    develop the best: involve and promote the design lead as the project leader

    metrics

    • attrition of 2 prod manager, 1 ios, 1 android
    • rollout/feedback taking up to 2 weeks
    • perliminary work being delayed for over a quarter
    • raised issue with 1 prod manager, my manager, 2 design, 1 coworker

    S

    The company had experimented with a new team structure "vertical team". The goal of the team would be to focus on improving a KPI.

    However, due to some employee churn and other org changes, the squads had started to lose direction.

    The recs team did not have enough resources but was expected to produce results. This worked alright for previously started projects however new projects were being proposed.

    T

    I had a strong hunch that without more product involvement and dev resources the squad would not be able to succeed in its mission.

    A

    I took it upon myself to speak with the different stake holders involved. This involved the design team, a product manager I had a good relationship, a coworker and my manager.

    After speaking with individuals I was able to propose a meeting where we could come together and discuss risks, and goals. We also took this time to re-evaluate goals under current circumstances.

    R

    I was able to convince my manager and org to reach a middle compromise. There would be a shift in roles. The designer, who had the best idea about the product direction would assume a temporary product manager role. Additionally, we would be able to get an iOS dev for 2 sprints to comsume and execute the feature.

    With a reduction of scope and a more focused involvement of those part of the squad, I was able to shift my focus on other work that needed to be done while wrapping up and supporting previous features.

    Migrate jenkins from standalone to k8s
    ihr #think big | #invent and simplify

    think big: improve reliability and reduce work exponentially

    invent and simplify: utilize helm chart and then customize for internal tools

    metrics

    • able to upgrade jenkins from 1x to 2x
    • able to upgrade all plugins with confidence
    • able to address security warning in jenkins
    • launched 3 slaves and restrict jobs to specific slaves
      • later more slaves were added addition slaves for rust
    • time to launch a new jenkins server took < 2 hrs
    • added detailed documentation so others would be able to contribute

    S

    Jenkins was the test automation server we used. However it was deployed on on a VM without any way to recover, upgrade or replicate the instance and data.

    This resulted in tech debt and a fear of doing anything new with the instance.

    T

    I took it upon myself to create a kube deployed instance which could be replaced and therefore upgraded.

    A

    I created a decrarative instance of the server based on an existing jenkins helm recipie. I then tweaked it to have custom values and secrets.

    The secrets were applied via a api call to the jenkins server (decrypt KMS).

    R

    We were able to migrate to the latest version of Jenkins.

    Skynet - delivering results for hackday project
    ihr #deliver results | #right alot

    right alot: choosing java and solving part of the process was the correct small step to allow for adoption

    deliver results: deliver a project despite it not being the original all encompassing solution

    metrics

    • reduced manual testing from 10-30 minutes to seconds
    • removed human error during testing
    • created a extensible framework, initially targeting 2 types of test
    • agnostic testing framework for both android and ios

    S

    This was during a hack week project and the goal of our 3 person team was to help automate QA-testing done for apps at the company.

    The current way to do testing was to open the app, in a simulator and use charles proxy to collect http traffic logs. The logs would then be parsed manually.

    This could take entire days and sometimes resulted in incomplete testing if an urgent release was scheduled. It also meant long hours and sometimes weekend work for the QA team.

    The team was composed of a non-technical QA member, an iOS dev and myself (backend).

    T

    The initial goal was to use a UI testing framework for iOS and therefore automate the collection and then verification of the tests. We spent 1.5 days trying to get the UI framework to work, however it was unpredictable on different computers and eventually just didnt work.

    A

    Realizing that we were at risk of not having any product I decided to take a step back and see if we could produce something useful.

    I took half the day to create a simple POC; to tackle the testing and not the data gathering portion.

    This was a much simpler and predictable problem to solve. The log data could be ingested as Json and a series of tests could be run on it. Java was chosen since it was a language most people would be comfortable with, it was typed and the non-technical QA memebers could augment it.

    We created a somewhat abstract testing framework. It had 3 broad scopes: ingest data from file, filter data based on test to be run (user id, header), verify data based on custom test rules.

    At some point I devoted majority of my time to training and guiding the QA member on understanding and augmenting to the codebase. Transfering owenership to the QA team was an important goal for me if the project was to succeed.

    R

    We were also tackle time sensitive and manually arduous tests (token refresh every 5min). By the end of the week we were able to execute 6 different tests, implement 2 different testing scenarios (find a value, stateful testing).

    Needless to say the QA team was very happy and with the testing abstraction in place they were then able to implement more tests themselves.

    Templating kube config
    ihr #think big | #disagree and commit

    metrics

    • goods:
      • used helm template command to avoid creating new tool
      • would be able to slowly template and move over existing configs
      • poc worked and was able to represent current config
    • bads:
      • would be 1 additional tool to install
      • helm template was no longer supported
      • go templating had complicated syntax
    • failures:
      • project got hijacked from under me (loss of trust)
      • should have consulted seniors more (get buy in)
      • should have demonstrated small example rather go for the cake
      • should have asked for more feedback from team
      • should have demonstrated the stability of heml template

    S

    instead of updating configs across multiple envs and regions (qa, stg, us, au, nz) I wanted to create a template that would allow us to update the value in a single file. The remaining content could then be defined once.

    T

    A

    R

    Once the implementation had was done and the team clearly supported the other implementatino, I made it a point to verbally commend it in a meeting and show support.

    PR to rust lang - infer outlives predicates
    rust #learn and be curious | #ownership | #think big

    metrics

    • took 7 months. actual work after mentorship took 4 months to release
    • build times were up to 2hrs. incremental builds were 10-30 min
    • added 48 different test scenarios
    • working on the rust codebase was a exponential step forward for me
      • 1,043,243 lines of code total in the project
      • touched up to 45 files
    • added 4 additional fixes post PR

    S

    I wanted to get involved in the OSS community, learn and give back. To force myself to do this I got involved during the impl days and claimed a feature that offered mentorship.

    T

    Understand the codebase. Understand the context around the feature. Understand the feature. Learn to work with the codebase. Commit a PR to implement the feature.

    A

    Learned to build and work in a large codebase. Took 7 months but from the moment I asked to work on the feature I knew that I had to finish it.

    R

    Added feature to infer predicates and thus made a small contibution to the ergonomics of the langugage. Added docs for the feature.

    Own production outage when on-call was in transit
    ihr #ownership | #customer obsession

    ownership: the api is owned by the team and not only the responsibility of an individual.

    customer obsession: prod outage means users are being affected

    metrics

    • less than 1/2 hr outage
    • restarted 3-4 failing services
    • restarted up to 15 pods across all services that were in a bad state

    S

    Towards the end of the work day a sestamatic outage started to happen. The oncall member was in-transit and not available to handle the issue.

    Later we found out that the outage was due to a combination of AWS upgrade event and the weave CNI not being able to maintain a network mesh.

    T

    Step up and represent the API team in fixing the outage.

    A

    Restarted services that were showing errors in prometheus. Also tracked each service individually to ensure that there were no lingering bad pods.

    R

    Outage lasted less than 1/2 hour. Monitored the state of the system for a total of 1 hour.

    Android rotation work took much longer than expected
    s&j #customer obsession | #highest standard

    customer obsession: customer was my client... he would ned to read. and develop code afterwards

    highest standard: rather than hack a solution, there was clearly. a better way but more time consuming way to code

    metrics:

    S

    T

    A

    R

    Starting and leading the Rust meetup
    rust #ownership | #develop the best

    ownership: take charge and take on responsibility

    develop the best: promote people to get involved and grow replacement

    metrics

    • started as 1 old and 2 new co-organizers
    • old dropped and found 1 additional
    • hosted at 5 different companies
    • found 1 repeat sponsor
    • found and organized approx 36 speakers
    • gathered approx 10 core repeating members
    • gave 2 talks and led 1 learning session

    S

    Others bailed when it came time to actually put in time and organize the meetup. The tasks were no very rewarding and involved coordinating with companies, and speakers and trying to get sponsors.

    T

    I decided to take charge and try and build a sustainable community.

    A

    I created a list of companies. I spoke to attendees and convinced a few to speak or otherwise host. I invited everyone to speak on their side projects. I organized a un-meetup and volunteered to teach the begineer session.

    R

    I was able to organize a meetup each month for approx 1.5 years before transitioning it to a co-organizer. We had average 30 people attending per meetup with upwards of 60 at a few events. We had a home for the meetup where we could each month and a sponsor who would provide food and drinks. There were approx 10 consistent member who would show up very often and would help carry it forward.