This work reflects my background in large-scale ETL, data transformation, and applied analytics, which now informs how I design robust AI systems and retrieval pipelines.
Government Data Processing — India Data Portal
ISB Hyderabad | Large-Scale ETL
Processed and transformed 500GB+ of public-sector data from multiple government portals, including MNREGA physical, financial, and mandays datasets.
Built end-to-end ETL workflows using PySpark for distributed processing and Pandas for downstream transformation, including long-to-wide restructuring for portal integration.
Stored processed outputs in Parquet format on cloud object storage to support reliable large-scale data access and downstream analytics.
Import Export Data — National Trade ETL Pipeline
ISB Hyderabad | 16-Year Dataset
Built an ETL pipeline extracting import-export data from the Trade Statistics portal of the Ministry of Commerce, Government of India. Produced a 100GB+ dataset spanning 16 years of monthly trade data at the country level. Implemented ThreadPoolExecutor for parallel data processing, significantly reducing execution time.
Real-Time Stock Market Data with Kafka
Streaming Data Engineering
Built a real-time stock market data system using Apache Kafka for streaming ingestion. Leveraged AWS Glue for schema management and Athena for SQL-based analytics on streaming data. Implemented performance optimization and error handling for reliable real-time processing.
Drone Detection Model
Defense Sector | YOLOv3 Computer Vision
Designed and trained a real-time drone detection model for defense applications using YOLOv3 with PTZ camera and sensor integration. Built the image dataset via automated web scraping, integrated the detection system with jammer hardware, and reduced PTZ system cost by one-third versus alternatives.
Melanoma Detection
Medical AI | CNN Classification
Built a CNN-based melanoma detection model from skin images using TensorFlow. Implemented multiclass classification with custom architecture, handled class imbalance through targeted sampling, and evaluated using ROC-AUC and confusion matrices.
Automatic Ticket Classification
NLP | Multi-Algorithm Comparison
Built an NLP classification system for automatic customer complaint routing. Implemented and compared RNN, LSTM, GRU, Random Forest, and SVM with word2vec and GloVe embeddings. Delivered 91.2% accuracy with a preprocessing pipeline including multilingual translation and lemmatization.
Image Captioning with Transformers
Vision + Language | Streamlit App
Built a transformer-based image captioning system using VisionEncoderDecoderModel. Deployed as an interactive Streamlit application on HuggingFace Spaces with ViTImageProcessor and AutoTokenizer for preprocessing.