diff --git a/docs/database_utils.md b/docs/database_utils.md new file mode 100644 index 0000000..fd1162f --- /dev/null +++ b/docs/database_utils.md @@ -0,0 +1,487 @@ +# Database Utilities – Grade Snapshot System + +## Overview + +This module implements a **ClickHouse-backed grade snapshot and diffing system**. +It ingests grade data from an external API, persists **immutable snapshots**, tracks **stable entities** (users, classes, assignments), and computes **changes over time** (new / updated / removed grades). + +The design emphasizes: + +* **Idempotent ingestion** +* **Historical accuracy** +* **Efficient change detection** +* **Append-only semantics** (ClickHouse-friendly) + +All functionality lives in the `database_utils` namespace. + +--- + +## Core Concepts + +### 1. Stable Entities vs Snapshots + +| Concept | Description | +| ----------------------- | ---------------------------------------------------------- | +| **User** | A logical account | +| **Class** | A course belonging to a user (stable across time) | +| **Assignment** | A specific graded item within a class (stable across time) | +| **Snapshot (response)** | A point-in-time capture of all grades returned by the API | +| **Grade history** | Per-assignment grades linked to a snapshot | +| **Diffs** | Computed changes between two snapshots | + +Stable entities are **created once and reused**. +Snapshots are **immutable** and **time-ordered**. + +--- + +## UUID Utilities + +### `parse_uuid(string) → UUID` + +Parses a standard UUID string (`xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`) into ClickHouse’s `UUID { high, low }` format. + +* Validates format +* Throws on malformed input +* Used everywhere UUIDs enter the DB + +### `uuid_to_string(UUID) → string` + +Converts ClickHouse UUIDs back into standard string format. + +--- + +## Date Handling + +### `parse_date_to_clickhouse(string) → uint16_t` + +Converts an API date string (`YYYY-MM-DD`) into ClickHouse `Date` format (days since Unix epoch). + +* Empty or invalid dates → epoch (`0`) +* Logs parsing failures instead of throwing + +--- + +## Database Handle + +```cpp +using CHClient = std::shared_ptr; +``` + +All functions accept a shared ClickHouse client to allow: + +* Connection reuse +* Thread-safe sharing +* Easy dependency injection + +--- + +## User Operations + +### `get_all_users()` + +Returns all users with: + +* `user_id` +* `username` +* `password` + +Primarily for administration/debugging. + +--- + +### `register_user(username, password)` + +Inserts a new user row. + +⚠️ **Note:** Passwords are currently stored in plaintext. +Hashing should be added before production use. + +--- + +### `authenticate_user(username, password)` + +Returns `true` if a matching user exists. + +* Uses `count()` for minimal payload +* Simple boolean authentication check + +--- + +### `get_user_uuid(username)` + +Returns the user’s UUID if found. + +--- + +## Snapshot Insertion Flow + +### `insert_grade_snapshot(user_id, api_response) → response_id` + +This is the **main ingestion pipeline**. + +#### Step-by-Step Flow + +1. **Insert `grade_responses`** + + * One row per API fetch + * Contains metadata (`success`, `total_classes`, timestamp) + +2. **Fetch generated `response_id`** + + * Most recent response for that user + +3. **Process each class** + + * `get_or_create_class()` + * Ensures a stable `class_id` + * Links class to the response (`response_classes`) + +4. **Process assignments** + + * `get_or_create_assignment()` + * Ensures stable `assignment_id` + +5. **Insert grade history** + + * Batched inserts into `assignment_grade_history` + * Each row ties: + + * response + * assignment + * score + * attempts + +Snapshots are **never updated**, only appended. + +--- + +## Stable Entity Management + +### `get_or_create_class(user_id, class_data) → class_id` + +* Searches by `(user_id, class_name)` +* If found: + + * Updates metadata (teacher, period, category) +* If not found: + + * Inserts a new class record +* Returns the stable `class_id` + +--- + +### `get_or_create_assignment(user_id, class_id, assignment_data) → assignment_id` + +* Searches by `(user_id, class_id, assignment_name)` +* Updates due date / major flag if it exists +* Otherwise inserts a new assignment +* Returns stable `assignment_id` + +--- + +## Snapshot Loading + +### `load_latest_snapshot(user_id)` + +Loads the most recent snapshot for a user. + +Returns `std::nullopt` if none exists. + +--- + +### `load_snapshot_by_id(response_id)` + +Loads a **fully hydrated snapshot**, including: + +* User +* Classes +* Assignments +* Grades + +#### Result Structure + +```text +GradeSnapshot +├── response_id +├── user_id +├── classes[class_name] -> ClassRecord +├── assignments[class::assignment] -> AssignmentRecord +└── grades[assignment_id] -> GradeRecord +``` + +Used for: + +* Diffing +* UI display +* Historical comparisons + +--- + +## Change Detection + +### `has_changes(user_id, new_api_response) → bool` + +Fast pre-check before inserting a new snapshot. + +Detects: + +* New assignments +* Removed assignments +* Score changes +* Attempt changes + +If no prior snapshot exists → **changes detected**. + +--- + +## Snapshot Diffing + +### `diff_snapshots(old, new) → vector` + +Computes **semantic differences** between two snapshots. + +#### Change Types + +| Type | Meaning | +| --------- | ------------------------------- | +| `NEW` | Assignment did not exist before | +| `UPDATED` | Score or attempts changed | +| `REMOVED` | Assignment disappeared | + +Each diff includes: + +* Assignment ID +* Class name +* Assignment name +* Old grade (optional) +* New grade + +--- + +## Grade Update Logging + +### `insert_grade_updates(user_id, old_response_id, new_response_id, diffs)` + +Persists diffs into `grade_updates`. + +#### Features + +* Nullable old values for new assignments +* Placeholder values for removed assignments +* Compact enum encoding for change type +* Batched insert for efficiency + +This table provides a **clear audit trail** of grade changes over time. + +--- + +## Assignment Key Strategy + +```cpp +"class_name::assignment_name" +``` + +Used as a human-stable lookup key when comparing snapshots. + +* Avoids relying on database IDs during diffing +* Keeps logic resilient to ID reuse or refactors + +--- + +## Design Guarantees + +* ✅ Snapshots are immutable +* ✅ Stable IDs persist across time +* ✅ Changes are explicitly logged +* ✅ Efficient ClickHouse-friendly inserts +* ⚠️ SQL string concatenation is used (should be parameterized later) +* ⚠️ Password hashing not implemented + +# `database_utils` – Function Documentation + +--- + +## **UUID Utilities** + +### `clickhouse::UUID parse_uuid(const std::string& str)` + +* **Purpose:** Converts a UUID string (`xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`) into a ClickHouse `UUID` object. +* **Input:** `str` – UUID string. +* **Output:** `clickhouse::UUID` (high/low 64-bit parts). +* **Behavior:** + + * Throws `std::runtime_error` if string is not 36 characters or malformed. + * Removes hyphens and parses hex into two 64-bit integers. +* **Usage:** Any place UUID strings need to be stored in ClickHouse. + +--- + +### `std::string uuid_to_string(const clickhouse::UUID& u)` + +* **Purpose:** Converts a ClickHouse `UUID` back into a human-readable UUID string. +* **Input:** `u` – ClickHouse UUID. +* **Output:** Standard UUID string. +* **Behavior:** Formats `high` and `low` 64-bit integers as a zero-padded 36-character string with hyphens. + +--- + +## **Date Utilities** + +### `uint16_t parse_date_to_clickhouse(const std::string& date_str)` + +* **Purpose:** Converts a date string from the API into ClickHouse `Date` format. +* **Input:** `date_str` – string in `YYYY-MM-DD` format. +* **Output:** `uint16_t` representing days since 1970-01-01. +* **Behavior:** + + * Empty or invalid strings return `0` (epoch). + * Logs warnings for parse failures. +* **Usage:** Assignments’ `due_date` conversion. + +--- + +## **User Operations** + +### `std::vector get_all_users(const CHClient& client)` + +* **Purpose:** Retrieves all users from the database. +* **Input:** `client` – ClickHouse client. +* **Output:** Vector of `UserRecord` (includes `user_id`, `login` info). +* **Behavior:** Logs batch size and total retrieved. + +--- + +### `bool register_user(const CHClient& client, const std::string& username, const std::string& password)` + +* **Purpose:** Inserts a new user into the database. +* **Input:** `username`, `password`. +* **Output:** `true` on success. +* **Behavior:** Currently stores passwords in plaintext; logs success. + +--- + +### `bool authenticate_user(const CHClient& client, const std::string& username, const std::string& password)` + +* **Purpose:** Checks if a user exists with given credentials. +* **Input:** `username`, `password`. +* **Output:** `true` if valid, `false` otherwise. +* **Behavior:** Uses `SELECT count()` for boolean check. Logs results. + +--- + +### `std::optional get_user_uuid(const CHClient& client, const std::string& username)` + +* **Purpose:** Retrieves a user’s UUID based on their username. +* **Input:** `username`. +* **Output:** `optional`; empty if user not found. +* **Behavior:** Uses `LIMIT 1` for efficiency. + +--- + +## **Class & Assignment Management** + +### `std::string get_or_create_class(const CHClient& client, const std::string& user_id, const api_utils::ClassGrades& class_data)` + +* **Purpose:** Ensures a stable `class_id` exists for a user. +* **Input:** `user_id`, `class_data` (name, teacher, period, category). +* **Output:** `class_id` as string. +* **Behavior:** + + * Searches for existing class. + * Updates metadata if found. + * Inserts new record if not found. + * Links class to the user in `user_classes`. + +--- + +### `std::string get_or_create_assignment(const CHClient& client, const std::string& user_id, const std::string& class_id, const api_utils::AssignmentGrade& assignment_data)` + +* **Purpose:** Ensures a stable `assignment_id` exists within a class. +* **Input:** `user_id`, `class_id`, `assignment_data` (name, dueDate, isMajorGrade). +* **Output:** `assignment_id` as string. +* **Behavior:** + + * Updates existing assignment if found. + * Inserts a new assignment if not found. + * Uses `parse_date_to_clickhouse()` for due date. + +--- + +## **Snapshot Insertion** + +### `std::string insert_grade_snapshot(const CHClient& client, const std::string& user_id, const api_utils::GradesResponse& api_response)` + +* **Purpose:** Inserts a complete snapshot of grades from the API. +* **Input:** `user_id`, `api_response` (success flag, total classes, grades per class/assignment). +* **Output:** `response_id` of the inserted snapshot; empty string on failure. +* **Behavior:** + + 1. Inserts metadata into `grade_responses`. + 2. Retrieves `response_id`. + 3. Processes each class: + + * Calls `get_or_create_class()` + * Links to response in `response_classes` + 4. Processes each assignment: + + * Calls `get_or_create_assignment()` + * Inserts grades into `assignment_grade_history`. +* **Notes:** Immutable snapshots; append-only. + +--- + +## **Snapshot Loading** + +### `std::optional load_latest_snapshot(const CHClient& client, const std::string& user_id)` + +* **Purpose:** Loads the most recent snapshot for a user. +* **Output:** Fully populated `GradeSnapshot` or `nullopt` if none exists. +* **Behavior:** Uses `fetched_at DESC LIMIT 1`. + +--- + +### `std::optional load_snapshot_by_id(const CHClient& client, const std::string& response_id)` + +* **Purpose:** Loads a snapshot by `response_id`. +* **Output:** `GradeSnapshot` including: + + * Classes + * Assignments + * Grades +* **Behavior:** Joins `user_classes`, `user_assignments`, `assignment_grade_history`, `response_classes`. + +--- + +## **Diffing & Change Detection** + +### `bool has_changes(const CHClient& client, const std::string& user_id, const api_utils::GradesResponse& new_api_response)` + +* **Purpose:** Detects if new API response differs from the latest snapshot. +* **Output:** `true` if changes exist, `false` otherwise. +* **Checks for:** + + * New assignments + * Removed assignments + * Score/attempt changes + +--- + +### `std::vector diff_snapshots(const GradeSnapshot& old_snapshot, const GradeSnapshot& new_snapshot)` + +* **Purpose:** Returns detailed differences between two snapshots. +* **Output:** Vector of `AssignmentDiff`. +* **Change Types:** `NEW`, `UPDATED`, `REMOVED`. +* **Behavior:** Compares old and new grades per assignment key. + +--- + +### `void insert_grade_updates(const CHClient& client, const std::string& user_id, const std::string& old_response_id, const std::string& new_response_id, const std::vector& diffs)` + +* **Purpose:** Inserts diffs into `grade_updates` table. +* **Behavior:** + + * Maps `AssignmentDiff` to ClickHouse columns. + * Handles nullable old values for new assignments. + * Uses placeholder values for removed assignments. + * Logs number of inserted updates. +