Skip to the content.

Data Engineer Projects

Title: Government Data Processing and Integration for India Data Portal

Technology Stack:

Problem Statement:

The project aimed to process and clean over 500GB of data sourced from various government websites, specifically focusing on datasets such as MNREGA (Mahatma Gandhi National Rural Employment Guarantee Act) Physical, Financial, Mandays, and others. The primary challenge was to efficiently gather, clean, and transform the data into a suitable format for integration with the India Data Portal.

Steps Followed:

  1. Data Collection:
    • Utilized web scraping techniques with BeautifulSoup (bs4) and Selenium to extract data from government websites.
  2. Data Processing:
    • Leveraged PySpark for distributed data processing to handle the large dataset efficiently.
    • Used Pandas for certain data manipulation tasks, ensuring accuracy and ease of handling.
  3. Quality Check:
    • Conducted a rigorous quality check by comparing processed data with the source to ensure accuracy and consistency.
  4. Data Transformation:
    • Applied the necessary transformations, including converting data from long to wide format, aligning with the requirements of the India Data Portal.
  5. Data Storage:
    • Converted the final processed data into the Parquet file format, optimizing storage and query performance.
  6. Cloud Storage and Accessibility:
    • Uploaded the processed data to Wasabi, making it readily available and accessible for integration with the India Data Portal.
  7. Project Naming:
    • The processed datasets were categorized into distinct project names, such as MNREGA Physical, MNREGA Financial, MNREGA Mandays, and others, facilitating easy identification and organization.

Conclusion:

This project successfully addressed the challenges of handling massive government datasets, ensuring data accuracy, and transforming it into a format compatible with the India Data Portal. The use of a robust technology stack, including PySpark and Pandas, along with cloud storage on Wasabi, enabled efficient processing and accessibility of the valuable government data.

Title:Flask API for LGD Data Processing

Technology Stack

Description Developed a Flask API for processing LGD-related JSON data in India. The API facilitates tasks such as processing LGD data, retrieving state mappings, and creating a mapped dataset based on user-defined inputs.

Key Features

Endpoints

POST /process_json:

GET /state_mappings:

POST /create_mapped_dataset:

Examples Processing LGD Data

Retrieving State Mappings

Creating Mapped Dataset

Title: Import Export Data

Title: Real-Time Stock Market Data with Kafka

Title: MinDepExpScrapper

Title: PMAY Data Scrapper

back