Skip to the content.
AI FOUNDATIONS & DATA ENGINEERING

Applied ML & Data Engineering Foundations

Large-scale data pipelines, applied ML workflows, and distributed data processing systems that built the foundation for my current GenAI engineering work.

This work reflects my background in large-scale ETL, data transformation, and applied analytics, which now informs how I design robust AI systems and retrieval pipelines.

Government Data Processing — India Data Portal

ISB Hyderabad | Large-Scale ETL

Processed and transformed 500GB+ of public-sector data from multiple government portals, including MNREGA physical, financial, and mandays datasets.

Built end-to-end ETL workflows using PySpark for distributed processing and Pandas for downstream transformation, including long-to-wide restructuring for portal integration.

Stored processed outputs in Parquet format on cloud object storage to support reliable large-scale data access and downstream analytics.

500GB+ Data Processed
20+ Datasets Built
30% Efficiency Gain
PySpark Pandas Selenium BeautifulSoup Parquet Wasabi

Import Export Data — National Trade ETL Pipeline

ISB Hyderabad | 16-Year Dataset

Built an ETL pipeline extracting import-export data from the Trade Statistics portal of the Ministry of Commerce, Government of India. Produced a 100GB+ dataset spanning 16 years of monthly trade data at the country level. Implemented ThreadPoolExecutor for parallel data processing, significantly reducing execution time.

100GB+ Dataset Size
16yrs Data Span
Python Pandas ThreadPoolExecutor boto3 Wasabi

Real-Time Stock Market Data with Kafka

Streaming Data Engineering

Built a real-time stock market data system using Apache Kafka for streaming ingestion. Leveraged AWS Glue for schema management and Athena for SQL-based analytics on streaming data. Implemented performance optimization and error handling for reliable real-time processing.

Apache Kafka AWS Glue AWS Athena Python SQL

Drone Detection Model

Defense Sector | YOLOv3 Computer Vision

Designed and trained a real-time drone detection model for defense applications using YOLOv3 with PTZ camera and sensor integration. Built the image dataset via automated web scraping, integrated the detection system with jammer hardware, and reduced PTZ system cost by one-third versus alternatives.

35% Jammer Enhancement
YOLOv3 OpenCV Python NumPy PTZ Camera

Melanoma Detection

Medical AI | CNN Classification

Built a CNN-based melanoma detection model from skin images using TensorFlow. Implemented multiclass classification with custom architecture, handled class imbalance through targeted sampling, and evaluated using ROC-AUC and confusion matrices.

TensorFlow CNN Python ROC-AUC

Automatic Ticket Classification

NLP | Multi-Algorithm Comparison

Built an NLP classification system for automatic customer complaint routing. Implemented and compared RNN, LSTM, GRU, Random Forest, and SVM with word2vec and GloVe embeddings. Delivered 91.2% accuracy with a preprocessing pipeline including multilingual translation and lemmatization.

91.2% Accuracy
RNN LSTM Word2Vec GloVe NLP

Image Captioning with Transformers

Vision + Language | Streamlit App

Built a transformer-based image captioning system using VisionEncoderDecoderModel. Deployed as an interactive Streamlit application on HuggingFace Spaces with ViTImageProcessor and AutoTokenizer for preprocessing.

Transformers ViT Streamlit HuggingFace