Files
skyward-analysis-daemon/docs/database_utils.md

488 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Database Utilities Grade Snapshot System
## Overview
This module implements a **ClickHouse-backed grade snapshot and diffing system**.
It ingests grade data from an external API, persists **immutable snapshots**, tracks **stable entities** (users, classes, assignments), and computes **changes over time** (new / updated / removed grades).
The design emphasizes:
* **Idempotent ingestion**
* **Historical accuracy**
* **Efficient change detection**
* **Append-only semantics** (ClickHouse-friendly)
All functionality lives in the `database_utils` namespace.
---
## Core Concepts
### 1. Stable Entities vs Snapshots
| Concept | Description |
| ----------------------- | ---------------------------------------------------------- |
| **User** | A logical account |
| **Class** | A course belonging to a user (stable across time) |
| **Assignment** | A specific graded item within a class (stable across time) |
| **Snapshot (response)** | A point-in-time capture of all grades returned by the API |
| **Grade history** | Per-assignment grades linked to a snapshot |
| **Diffs** | Computed changes between two snapshots |
Stable entities are **created once and reused**.
Snapshots are **immutable** and **time-ordered**.
---
## UUID Utilities
### `parse_uuid(string) → UUID`
Parses a standard UUID string (`xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`) into ClickHouses `UUID { high, low }` format.
* Validates format
* Throws on malformed input
* Used everywhere UUIDs enter the DB
### `uuid_to_string(UUID) → string`
Converts ClickHouse UUIDs back into standard string format.
---
## Date Handling
### `parse_date_to_clickhouse(string) → uint16_t`
Converts an API date string (`YYYY-MM-DD`) into ClickHouse `Date` format (days since Unix epoch).
* Empty or invalid dates → epoch (`0`)
* Logs parsing failures instead of throwing
---
## Database Handle
```cpp
using CHClient = std::shared_ptr<clickhouse::Client>;
```
All functions accept a shared ClickHouse client to allow:
* Connection reuse
* Thread-safe sharing
* Easy dependency injection
---
## User Operations
### `get_all_users()`
Returns all users with:
* `user_id`
* `username`
* `password`
Primarily for administration/debugging.
---
### `register_user(username, password)`
Inserts a new user row.
⚠️ **Note:** Passwords are currently stored in plaintext.
Hashing should be added before production use.
---
### `authenticate_user(username, password)`
Returns `true` if a matching user exists.
* Uses `count()` for minimal payload
* Simple boolean authentication check
---
### `get_user_uuid(username)`
Returns the users UUID if found.
---
## Snapshot Insertion Flow
### `insert_grade_snapshot(user_id, api_response) → response_id`
This is the **main ingestion pipeline**.
#### Step-by-Step Flow
1. **Insert `grade_responses`**
* One row per API fetch
* Contains metadata (`success`, `total_classes`, timestamp)
2. **Fetch generated `response_id`**
* Most recent response for that user
3. **Process each class**
* `get_or_create_class()`
* Ensures a stable `class_id`
* Links class to the response (`response_classes`)
4. **Process assignments**
* `get_or_create_assignment()`
* Ensures stable `assignment_id`
5. **Insert grade history**
* Batched inserts into `assignment_grade_history`
* Each row ties:
* response
* assignment
* score
* attempts
Snapshots are **never updated**, only appended.
---
## Stable Entity Management
### `get_or_create_class(user_id, class_data) → class_id`
* Searches by `(user_id, class_name)`
* If found:
* Updates metadata (teacher, period, category)
* If not found:
* Inserts a new class record
* Returns the stable `class_id`
---
### `get_or_create_assignment(user_id, class_id, assignment_data) → assignment_id`
* Searches by `(user_id, class_id, assignment_name)`
* Updates due date / major flag if it exists
* Otherwise inserts a new assignment
* Returns stable `assignment_id`
---
## Snapshot Loading
### `load_latest_snapshot(user_id)`
Loads the most recent snapshot for a user.
Returns `std::nullopt` if none exists.
---
### `load_snapshot_by_id(response_id)`
Loads a **fully hydrated snapshot**, including:
* User
* Classes
* Assignments
* Grades
#### Result Structure
```text
GradeSnapshot
├── response_id
├── user_id
├── classes[class_name] -> ClassRecord
├── assignments[class::assignment] -> AssignmentRecord
└── grades[assignment_id] -> GradeRecord
```
Used for:
* Diffing
* UI display
* Historical comparisons
---
## Change Detection
### `has_changes(user_id, new_api_response) → bool`
Fast pre-check before inserting a new snapshot.
Detects:
* New assignments
* Removed assignments
* Score changes
* Attempt changes
If no prior snapshot exists → **changes detected**.
---
## Snapshot Diffing
### `diff_snapshots(old, new) → vector<AssignmentDiff>`
Computes **semantic differences** between two snapshots.
#### Change Types
| Type | Meaning |
| --------- | ------------------------------- |
| `NEW` | Assignment did not exist before |
| `UPDATED` | Score or attempts changed |
| `REMOVED` | Assignment disappeared |
Each diff includes:
* Assignment ID
* Class name
* Assignment name
* Old grade (optional)
* New grade
---
## Grade Update Logging
### `insert_grade_updates(user_id, old_response_id, new_response_id, diffs)`
Persists diffs into `grade_updates`.
#### Features
* Nullable old values for new assignments
* Placeholder values for removed assignments
* Compact enum encoding for change type
* Batched insert for efficiency
This table provides a **clear audit trail** of grade changes over time.
---
## Assignment Key Strategy
```cpp
"class_name::assignment_name"
```
Used as a human-stable lookup key when comparing snapshots.
* Avoids relying on database IDs during diffing
* Keeps logic resilient to ID reuse or refactors
---
## Design Guarantees
* ✅ Snapshots are immutable
* ✅ Stable IDs persist across time
* ✅ Changes are explicitly logged
* ✅ Efficient ClickHouse-friendly inserts
* ⚠️ SQL string concatenation is used (should be parameterized later)
* ⚠️ Password hashing not implemented
# `database_utils` Function Documentation
---
## **UUID Utilities**
### `clickhouse::UUID parse_uuid(const std::string& str)`
* **Purpose:** Converts a UUID string (`xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`) into a ClickHouse `UUID` object.
* **Input:** `str` UUID string.
* **Output:** `clickhouse::UUID` (high/low 64-bit parts).
* **Behavior:**
* Throws `std::runtime_error` if string is not 36 characters or malformed.
* Removes hyphens and parses hex into two 64-bit integers.
* **Usage:** Any place UUID strings need to be stored in ClickHouse.
---
### `std::string uuid_to_string(const clickhouse::UUID& u)`
* **Purpose:** Converts a ClickHouse `UUID` back into a human-readable UUID string.
* **Input:** `u` ClickHouse UUID.
* **Output:** Standard UUID string.
* **Behavior:** Formats `high` and `low` 64-bit integers as a zero-padded 36-character string with hyphens.
---
## **Date Utilities**
### `uint16_t parse_date_to_clickhouse(const std::string& date_str)`
* **Purpose:** Converts a date string from the API into ClickHouse `Date` format.
* **Input:** `date_str` string in `YYYY-MM-DD` format.
* **Output:** `uint16_t` representing days since 1970-01-01.
* **Behavior:**
* Empty or invalid strings return `0` (epoch).
* Logs warnings for parse failures.
* **Usage:** Assignments `due_date` conversion.
---
## **User Operations**
### `std::vector<UserRecord> get_all_users(const CHClient& client)`
* **Purpose:** Retrieves all users from the database.
* **Input:** `client` ClickHouse client.
* **Output:** Vector of `UserRecord` (includes `user_id`, `login` info).
* **Behavior:** Logs batch size and total retrieved.
---
### `bool register_user(const CHClient& client, const std::string& username, const std::string& password)`
* **Purpose:** Inserts a new user into the database.
* **Input:** `username`, `password`.
* **Output:** `true` on success.
* **Behavior:** Currently stores passwords in plaintext; logs success.
---
### `bool authenticate_user(const CHClient& client, const std::string& username, const std::string& password)`
* **Purpose:** Checks if a user exists with given credentials.
* **Input:** `username`, `password`.
* **Output:** `true` if valid, `false` otherwise.
* **Behavior:** Uses `SELECT count()` for boolean check. Logs results.
---
### `std::optional<clickhouse::UUID> get_user_uuid(const CHClient& client, const std::string& username)`
* **Purpose:** Retrieves a users UUID based on their username.
* **Input:** `username`.
* **Output:** `optional<UUID>`; empty if user not found.
* **Behavior:** Uses `LIMIT 1` for efficiency.
---
## **Class & Assignment Management**
### `std::string get_or_create_class(const CHClient& client, const std::string& user_id, const api_utils::ClassGrades& class_data)`
* **Purpose:** Ensures a stable `class_id` exists for a user.
* **Input:** `user_id`, `class_data` (name, teacher, period, category).
* **Output:** `class_id` as string.
* **Behavior:**
* Searches for existing class.
* Updates metadata if found.
* Inserts new record if not found.
* Links class to the user in `user_classes`.
---
### `std::string get_or_create_assignment(const CHClient& client, const std::string& user_id, const std::string& class_id, const api_utils::AssignmentGrade& assignment_data)`
* **Purpose:** Ensures a stable `assignment_id` exists within a class.
* **Input:** `user_id`, `class_id`, `assignment_data` (name, dueDate, isMajorGrade).
* **Output:** `assignment_id` as string.
* **Behavior:**
* Updates existing assignment if found.
* Inserts a new assignment if not found.
* Uses `parse_date_to_clickhouse()` for due date.
---
## **Snapshot Insertion**
### `std::string insert_grade_snapshot(const CHClient& client, const std::string& user_id, const api_utils::GradesResponse& api_response)`
* **Purpose:** Inserts a complete snapshot of grades from the API.
* **Input:** `user_id`, `api_response` (success flag, total classes, grades per class/assignment).
* **Output:** `response_id` of the inserted snapshot; empty string on failure.
* **Behavior:**
1. Inserts metadata into `grade_responses`.
2. Retrieves `response_id`.
3. Processes each class:
* Calls `get_or_create_class()`
* Links to response in `response_classes`
4. Processes each assignment:
* Calls `get_or_create_assignment()`
* Inserts grades into `assignment_grade_history`.
* **Notes:** Immutable snapshots; append-only.
---
## **Snapshot Loading**
### `std::optional<GradeSnapshot> load_latest_snapshot(const CHClient& client, const std::string& user_id)`
* **Purpose:** Loads the most recent snapshot for a user.
* **Output:** Fully populated `GradeSnapshot` or `nullopt` if none exists.
* **Behavior:** Uses `fetched_at DESC LIMIT 1`.
---
### `std::optional<GradeSnapshot> load_snapshot_by_id(const CHClient& client, const std::string& response_id)`
* **Purpose:** Loads a snapshot by `response_id`.
* **Output:** `GradeSnapshot` including:
* Classes
* Assignments
* Grades
* **Behavior:** Joins `user_classes`, `user_assignments`, `assignment_grade_history`, `response_classes`.
---
## **Diffing & Change Detection**
### `bool has_changes(const CHClient& client, const std::string& user_id, const api_utils::GradesResponse& new_api_response)`
* **Purpose:** Detects if new API response differs from the latest snapshot.
* **Output:** `true` if changes exist, `false` otherwise.
* **Checks for:**
* New assignments
* Removed assignments
* Score/attempt changes
---
### `std::vector<AssignmentDiff> diff_snapshots(const GradeSnapshot& old_snapshot, const GradeSnapshot& new_snapshot)`
* **Purpose:** Returns detailed differences between two snapshots.
* **Output:** Vector of `AssignmentDiff`.
* **Change Types:** `NEW`, `UPDATED`, `REMOVED`.
* **Behavior:** Compares old and new grades per assignment key.
---
### `void insert_grade_updates(const CHClient& client, const std::string& user_id, const std::string& old_response_id, const std::string& new_response_id, const std::vector<AssignmentDiff>& diffs)`
* **Purpose:** Inserts diffs into `grade_updates` table.
* **Behavior:**
* Maps `AssignmentDiff` to ClickHouse columns.
* Handles nullable old values for new assignments.
* Uses placeholder values for removed assignments.
* Logs number of inserted updates.