Here's some information that can help you to know about me, let's go!
TLDR; Check out this pdf or image versions of my CV.
- 01/2022 -> present: Data Engineer at MoMo (M_service). From MoMo Talents Program.
- 2018 -> 05/2022: Student at University of Science - Viet Nam National University, Ho Chi Minh City. Faculty of Information Technology. Data Science major in the field of Computer Science. GPA: 8.5
- Sep 13, 2023 -> MEP: A Comprehensive Medicines Extraction System on Prescriptions in ICCCI 2023: Computational Collective Intelligence.
- Sep 21, 2022 -> Medical Prescription Recognition Using Heuristic Clustering and Similarity Search in ICCCI 2022: Computational Collective Intelligence.
- Agile / Scrum concept
- Programming Languages (C/C++, Java, Kotlin, Python, SQL,...)
- MS SQL Server / Oracle OCI / Bigquery / Vertica / Trino
- Open Table Format (Delta Lake / Apache Iceberg)
- Command Line (with or without Linux/Unix system)
- Git and Version Control
- CI / CD
- Shell / Linux
- Docker
- Kubernetes
- ETL / ELT
- Spark Application
- Data modeling
- Data Observability / Data Quality / Data Catalog / Data Security
- Data Governance
- Google Cloud Platform (Bigquery / PubSub / Dataproc / GKE / GCS / Cloud Functions / Resource monitoring / Looker / GCP gRPC API)
- Oracle APEX
- Scikit-learn
- Machine Learning Algorithms
- Generative AI
- MS Office
- Kubectl / Helm / Skaffold
- Bazel
- Infrastructure as code (IaC) with pulumi
- Policy as code
- Great Expectations
- dbt
- Airflow
- Datahub
- Oracle APEX
- Trino
- Apache Spark
- Apache Ranger
- SQLGlot
- Data Visualization Tools
- Machine Learning Tools
- LangChain
-
Golden Record - Process to achieve high-value Data Mart at MoMo
Build tools and services on top of open-source projects to control the data model's quality, freshness, and extensionality. Golden Record currently serves many dataflows such as events and transactions of the MoMo Super App.
Used: dbt, Great Expectations, Airflow, Gitlab, Kubernetes, Oracle OCI, and Oracle APEX. -
Cost Optimization - Reduce cost on GCP
Support other teams to optimize queries: move services, ETL, and ELT to on-premise Kubernetes. Try to shift from Bigquery to Vertica. Manage GCP resources for each team in MoMo by the divide-and-conquer principle.
Conclusion: 40% cost saved without any stuck workload.
Fluent in: Bigquery, Vertica, Kubernetes, Oracle APEX, GCP gRPC API. -
Data Observability - Data Governance
Just a project which helps end-user monitor five pillars of data: Freshness, Volume, Quality, Schema, and Lineage. This project aims to reduce the workload of the data-platform team in responsiveness to data for both info and incident.
Fluent in: Datahub, dbt, Great Expectations, Airflow. -
Data Lakehouse
Collaborate with the team to build a lakehouse solution to reduce the cost of all workloads at Momo. Trino/Spark run on GKE as a query engine to process large batch data stored in GCS. Reduce up to 70% cost per workload thanks to Spot instance without any data SLA.
Fluent in: Trino, Spark, GKE, GCS, Bigquery Storage, dbt, Airflow, Apache Ranger, Delta Lake, Apache Iceberg -
Data Pipeline Migration
Build a transpiling tool based on top of open-source projects to help end-to-end migrate SQL from current production environment to the Lakehouse, reduce up to 90% human cost of the migration phase at Momo.
Fluent in: SQLGlot, Trino/Presto, Bigquery, Airflow
-
Citizens problems detection
Deep Learning @ AI4VN -
Predict Covid19
Machine Learning/Data Visualization/Data Analyst -
Plant Pathology
Deep Learning -
Hospital Inpatient Discharges
Data Visualization/Data Analyst -
Image Color Compression using Kmeans
Image Processing -
Image Transformation
Image Processing/Image Transformation -
Data Preprocessing Toolkits from scratch (python)
Data Processing
There are a lot of badges (with AI, Machine Learning, Deep Learning, and Data Scientist) I have reached from that base on Google Cloud Platform.
Let's check out my Qwiklabs Public Profile.
- Email: [email protected]
Github Page: viplazylmht.github.io