qwertyboss

The only difference between ordinary and extraordinary is just that little "extra".

Day 10/21

- **MLOps with k8s - twiml (Page 16 /31)**
- Steps to consider: data acquisition, preprocessing, experiment management, model development, deployment and monitoring(reporting).
- ML at scale, focus on eliminating the *incidental* complexity
- *incidental* complexity of machine learning ⇒ getting access to data, setting up servers, and scaling model training and inference.
- As opposed to its *intrinsic* complexity ⇒ identifying the right model, selecting the right features, and tuning the model’s performance to meet business goals.

- Key requirements
- **Multi-tenancy:** Establishing a group of hardware to a specific team is inefficient, rather create a shared environment for concurrent projects.
- **Elasticity:** The hardware should expand/shirk based on the requirement of workload.
- **Immediacy:** It should have self-service access to the Data scientists.
- **Programmability:** APIs to enable automated provisioning and maximise utilisation.
- Cloud does meet the above requirements, however, latency and economics can be optimised significantly if on-prem. If you want to know more about a hybrid approach watch “How Dukaan moved from cloud to on-prem” - Asli Engineering [link](https://www.youtube.com/watch?v=vFxQyZX84Ro)

- Container and K8s
- K8’s hierarchy - a declarative system
- Cluster, Master → multiple worker(nodes), kubelet (agent),
- Kubeflow is one of the options that utilises K8s to deliver mlops capabilities.
- Other solutions: [TWIML Solutions Guide](https://twimlai.com/solutions/)
- in general, ephemeral in nature
- Volumes (available until the pod exists), Persistence volume (lifecycle managed by the cluster)

![Untitled](https://prod-files-secure.s3.us-west-2.amazonaws.com/d2df9e4d-9311-4c0c-9701-1e0536a3aba8/d49fc8f2-e8f8-4d09-aafc-dc57f87b24ea/Untitled.png)

pg 17

- CSI and other others - Custom resources, operators, schedule extensions, CNI(container network interface), Device plugins
- **Exercise idea:** Containerise the training and inference part of a simple machine learning use case and orchestrate the process using K8s.


**Extras:**

1. Read: What if the load balancer goes down? [Saurabh Dashora on X](https://twitter.com/ProgressiveCod2/status/1735561521869283339)
1. Remove a single point of failure using Floating IP and Active-passive switchover.
2. Completed AWS LI assessment: [LinkedIn](https://www.linkedin.com/skill-assessments/Amazon%20Web%20Services%20(AWS)/quiz-intro/)

**Retro**

- Progress >>>> Feelings: You don’t have to “feel like” doing the thing but if you know deep inside it is good for you in the long/short run “just embrace the pain” and do it anyways or else you will have to endure the pain of regret. Choose your pain wisely.

Day 9/21:

- LC 133 - Clone graph
- Core idea: Iterate through the original graph with a mechanism to keep track of graph nodes and new graph nodes => dictionary (oldNode, newNode)

- Gradio UI + Falcon connection
- Hugging Face's Performance and scalability section is beautiful: https://huggingface.co/docs/transformers/performance
- Wasted time making flash attention work on my local system but eventually got it working on a remote instance. https://huggingface.co/docs/transformers/perf_infer_gpu_one

Retro
- Take breaks when stuck, and let that mind rest. Sometimes tackling a problem head-on in one stretch doesn't work.

Extra
- Atomic habit: It doesn’t matter how successful or unsuccessful you are right now. What matters is whether your habits are leading you to success. You should be far more concerned with your current trajectory than with your current results.

Day 8/21: Chapter 3, SD Volume 1 Alex Xu

**Chapter 3: Framework for System Design**

- Step 1: Understand the problem statement (Requirement, Assumption) - 3 to 10 mins
- Step 2: Propose a high-level design and get buy-in: 10-15 mins
- Step 3: Design Deep dive: 10-25 mins
- Step 4: Wrap-up (+Discuss bottlenecks): 3-5 mins

**Notes**

- Not expected to build a perfect large-scale design in 1 hour but you should be able to Defend design choices.
- It simulates real-life problem-solving with co-workers on ambiguous problem statements.
- Red flags: Over-engineering, narrow-mindedness, stubbornness, etc.
- In an SD interview, giving out an answer quickly without thinking gives no bonus points. Slow down. Think deeply and ask clarification questions before giving the final design.
- State your assumptions and clarify them with the interviewer.
- Suggest multiple approaches if possible.
- Most importantly → Never give up, fight until the end.

**Extras**

- Attended Google Applied AI Summit - Found one interesting work: Talking papers with Mistral 7B on Kaggle: https://www.kaggle.com/code/philculliton/talking-papers-with-mistral-7b/notebook
- Revamped CV

Day 7: Code and containerise Gradio Chat UI for LLM

- Familiarised and curated a list of top 100 startups across various domains in the UK
- Schedule LI content

An interesting piece about habit building from Rajan Singh, Habitstrong

It is harder to stop a Netflix binge. It is easier to not start

It is harder to stop smoking. It is easier to not start.

Anything that gives instant gratification-stopping is harder than not starting

Human beings have limited willpower. So use it early on

E.g., to avoid eating ice cream, use your willpower in the supermarket, not after buying ice cream and stuffing it in your refrigerator

Day 6/21: Falcon model Inference optimisations

Retro: Chunking task lists helped reduce friction in work.

Day 5/21: Scheduled some LinkedIn content

Not so deep work:

Read about the LLM hallucination index, Insights from the article:
Q&A with RAG: GPT-3.5-turbo-0613
Q&A without RAG: GPT-4–0613
Long-form text generation: Llama-2–70b-chat
Reference: LLM Hallucination Index. Galileo released an LLM Hallucination… | by Cobus Greyling | Nov, 2023 | Medium

Retro:
It was a lazy a** day
Reason: overwhelming task list and doom scrolling

Solutions:
Made the tasks smaller
Replace scrolling with quick workouts & re-reading the book Atomic Habits

Day 4/21: Wrote Terraform scripts for infra provisioning

- Read about GitOps practices
- Looked into ArgoCD which enables the separation of CI and CD for Kubernetes-based deployments. Its multi-cluster (same argo instance can manage a fleet of clusters) and multi-environment (overlays using kustomize) deployment is interesting.

MLOps training

Completed
1.5.2 Streaming data inference step

Next
2. Kubeflow pipelines
3. E2E example
4. [Optional] TFX pipeline

Training content prep

Completed
1.3 Hyperparameter tuning
1.4 Kubeflow pipeline
1.5.1 Streaming data training

Plan
1. Steps to build production ml system
2. E2E example
3. Kubeflow pipelines
4. [Optional] TFX pipeline
qwertyboss Author

Right now for company training but youtube sounds like a good idea, i will try that in future.

0 Likes
Daniel

for youtube ?

0 Likes

MLOps

1.1 (Prep data from BQ to GCS) and 1.2 (training job) done

Plan
1. Steps to build production ml system
2. E2E example
3. Kubeflow pipelines
4. [Optional] TFX pipeline

MLOps training

1. Steps to build production ml system
2. E2E example
3. Kubeflow pipelines
4. [Optional] TFX pipeline