docs for the database utils brought to you 100% by claude

This commit is contained in:
2025-12-18 18:24:55 -06:00
parent d305a96ca3
commit 9a31225360

487
docs/database_utils.md Normal file
View File

@@ -0,0 +1,487 @@
# Database Utilities Grade Snapshot System
## Overview
This module implements a **ClickHouse-backed grade snapshot and diffing system**.
It ingests grade data from an external API, persists **immutable snapshots**, tracks **stable entities** (users, classes, assignments), and computes **changes over time** (new / updated / removed grades).
The design emphasizes:
* **Idempotent ingestion**
* **Historical accuracy**
* **Efficient change detection**
* **Append-only semantics** (ClickHouse-friendly)
All functionality lives in the `database_utils` namespace.
---
## Core Concepts
### 1. Stable Entities vs Snapshots
| Concept | Description |
| ----------------------- | ---------------------------------------------------------- |
| **User** | A logical account |
| **Class** | A course belonging to a user (stable across time) |
| **Assignment** | A specific graded item within a class (stable across time) |
| **Snapshot (response)** | A point-in-time capture of all grades returned by the API |
| **Grade history** | Per-assignment grades linked to a snapshot |
| **Diffs** | Computed changes between two snapshots |
Stable entities are **created once and reused**.
Snapshots are **immutable** and **time-ordered**.
---
## UUID Utilities
### `parse_uuid(string) → UUID`
Parses a standard UUID string (`xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`) into ClickHouses `UUID { high, low }` format.
* Validates format
* Throws on malformed input
* Used everywhere UUIDs enter the DB
### `uuid_to_string(UUID) → string`
Converts ClickHouse UUIDs back into standard string format.
---
## Date Handling
### `parse_date_to_clickhouse(string) → uint16_t`
Converts an API date string (`YYYY-MM-DD`) into ClickHouse `Date` format (days since Unix epoch).
* Empty or invalid dates → epoch (`0`)
* Logs parsing failures instead of throwing
---
## Database Handle
```cpp
using CHClient = std::shared_ptr<clickhouse::Client>;
```
All functions accept a shared ClickHouse client to allow:
* Connection reuse
* Thread-safe sharing
* Easy dependency injection
---
## User Operations
### `get_all_users()`
Returns all users with:
* `user_id`
* `username`
* `password`
Primarily for administration/debugging.
---
### `register_user(username, password)`
Inserts a new user row.
⚠️ **Note:** Passwords are currently stored in plaintext.
Hashing should be added before production use.
---
### `authenticate_user(username, password)`
Returns `true` if a matching user exists.
* Uses `count()` for minimal payload
* Simple boolean authentication check
---
### `get_user_uuid(username)`
Returns the users UUID if found.
---
## Snapshot Insertion Flow
### `insert_grade_snapshot(user_id, api_response) → response_id`
This is the **main ingestion pipeline**.
#### Step-by-Step Flow
1. **Insert `grade_responses`**
* One row per API fetch
* Contains metadata (`success`, `total_classes`, timestamp)
2. **Fetch generated `response_id`**
* Most recent response for that user
3. **Process each class**
* `get_or_create_class()`
* Ensures a stable `class_id`
* Links class to the response (`response_classes`)
4. **Process assignments**
* `get_or_create_assignment()`
* Ensures stable `assignment_id`
5. **Insert grade history**
* Batched inserts into `assignment_grade_history`
* Each row ties:
* response
* assignment
* score
* attempts
Snapshots are **never updated**, only appended.
---
## Stable Entity Management
### `get_or_create_class(user_id, class_data) → class_id`
* Searches by `(user_id, class_name)`
* If found:
* Updates metadata (teacher, period, category)
* If not found:
* Inserts a new class record
* Returns the stable `class_id`
---
### `get_or_create_assignment(user_id, class_id, assignment_data) → assignment_id`
* Searches by `(user_id, class_id, assignment_name)`
* Updates due date / major flag if it exists
* Otherwise inserts a new assignment
* Returns stable `assignment_id`
---
## Snapshot Loading
### `load_latest_snapshot(user_id)`
Loads the most recent snapshot for a user.
Returns `std::nullopt` if none exists.
---
### `load_snapshot_by_id(response_id)`
Loads a **fully hydrated snapshot**, including:
* User
* Classes
* Assignments
* Grades
#### Result Structure
```text
GradeSnapshot
├── response_id
├── user_id
├── classes[class_name] -> ClassRecord
├── assignments[class::assignment] -> AssignmentRecord
└── grades[assignment_id] -> GradeRecord
```
Used for:
* Diffing
* UI display
* Historical comparisons
---
## Change Detection
### `has_changes(user_id, new_api_response) → bool`
Fast pre-check before inserting a new snapshot.
Detects:
* New assignments
* Removed assignments
* Score changes
* Attempt changes
If no prior snapshot exists → **changes detected**.
---
## Snapshot Diffing
### `diff_snapshots(old, new) → vector<AssignmentDiff>`
Computes **semantic differences** between two snapshots.
#### Change Types
| Type | Meaning |
| --------- | ------------------------------- |
| `NEW` | Assignment did not exist before |
| `UPDATED` | Score or attempts changed |
| `REMOVED` | Assignment disappeared |
Each diff includes:
* Assignment ID
* Class name
* Assignment name
* Old grade (optional)
* New grade
---
## Grade Update Logging
### `insert_grade_updates(user_id, old_response_id, new_response_id, diffs)`
Persists diffs into `grade_updates`.
#### Features
* Nullable old values for new assignments
* Placeholder values for removed assignments
* Compact enum encoding for change type
* Batched insert for efficiency
This table provides a **clear audit trail** of grade changes over time.
---
## Assignment Key Strategy
```cpp
"class_name::assignment_name"
```
Used as a human-stable lookup key when comparing snapshots.
* Avoids relying on database IDs during diffing
* Keeps logic resilient to ID reuse or refactors
---
## Design Guarantees
* ✅ Snapshots are immutable
* ✅ Stable IDs persist across time
* ✅ Changes are explicitly logged
* ✅ Efficient ClickHouse-friendly inserts
* ⚠️ SQL string concatenation is used (should be parameterized later)
* ⚠️ Password hashing not implemented
# `database_utils` Function Documentation
---
## **UUID Utilities**
### `clickhouse::UUID parse_uuid(const std::string& str)`
* **Purpose:** Converts a UUID string (`xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`) into a ClickHouse `UUID` object.
* **Input:** `str` UUID string.
* **Output:** `clickhouse::UUID` (high/low 64-bit parts).
* **Behavior:**
* Throws `std::runtime_error` if string is not 36 characters or malformed.
* Removes hyphens and parses hex into two 64-bit integers.
* **Usage:** Any place UUID strings need to be stored in ClickHouse.
---
### `std::string uuid_to_string(const clickhouse::UUID& u)`
* **Purpose:** Converts a ClickHouse `UUID` back into a human-readable UUID string.
* **Input:** `u` ClickHouse UUID.
* **Output:** Standard UUID string.
* **Behavior:** Formats `high` and `low` 64-bit integers as a zero-padded 36-character string with hyphens.
---
## **Date Utilities**
### `uint16_t parse_date_to_clickhouse(const std::string& date_str)`
* **Purpose:** Converts a date string from the API into ClickHouse `Date` format.
* **Input:** `date_str` string in `YYYY-MM-DD` format.
* **Output:** `uint16_t` representing days since 1970-01-01.
* **Behavior:**
* Empty or invalid strings return `0` (epoch).
* Logs warnings for parse failures.
* **Usage:** Assignments `due_date` conversion.
---
## **User Operations**
### `std::vector<UserRecord> get_all_users(const CHClient& client)`
* **Purpose:** Retrieves all users from the database.
* **Input:** `client` ClickHouse client.
* **Output:** Vector of `UserRecord` (includes `user_id`, `login` info).
* **Behavior:** Logs batch size and total retrieved.
---
### `bool register_user(const CHClient& client, const std::string& username, const std::string& password)`
* **Purpose:** Inserts a new user into the database.
* **Input:** `username`, `password`.
* **Output:** `true` on success.
* **Behavior:** Currently stores passwords in plaintext; logs success.
---
### `bool authenticate_user(const CHClient& client, const std::string& username, const std::string& password)`
* **Purpose:** Checks if a user exists with given credentials.
* **Input:** `username`, `password`.
* **Output:** `true` if valid, `false` otherwise.
* **Behavior:** Uses `SELECT count()` for boolean check. Logs results.
---
### `std::optional<clickhouse::UUID> get_user_uuid(const CHClient& client, const std::string& username)`
* **Purpose:** Retrieves a users UUID based on their username.
* **Input:** `username`.
* **Output:** `optional<UUID>`; empty if user not found.
* **Behavior:** Uses `LIMIT 1` for efficiency.
---
## **Class & Assignment Management**
### `std::string get_or_create_class(const CHClient& client, const std::string& user_id, const api_utils::ClassGrades& class_data)`
* **Purpose:** Ensures a stable `class_id` exists for a user.
* **Input:** `user_id`, `class_data` (name, teacher, period, category).
* **Output:** `class_id` as string.
* **Behavior:**
* Searches for existing class.
* Updates metadata if found.
* Inserts new record if not found.
* Links class to the user in `user_classes`.
---
### `std::string get_or_create_assignment(const CHClient& client, const std::string& user_id, const std::string& class_id, const api_utils::AssignmentGrade& assignment_data)`
* **Purpose:** Ensures a stable `assignment_id` exists within a class.
* **Input:** `user_id`, `class_id`, `assignment_data` (name, dueDate, isMajorGrade).
* **Output:** `assignment_id` as string.
* **Behavior:**
* Updates existing assignment if found.
* Inserts a new assignment if not found.
* Uses `parse_date_to_clickhouse()` for due date.
---
## **Snapshot Insertion**
### `std::string insert_grade_snapshot(const CHClient& client, const std::string& user_id, const api_utils::GradesResponse& api_response)`
* **Purpose:** Inserts a complete snapshot of grades from the API.
* **Input:** `user_id`, `api_response` (success flag, total classes, grades per class/assignment).
* **Output:** `response_id` of the inserted snapshot; empty string on failure.
* **Behavior:**
1. Inserts metadata into `grade_responses`.
2. Retrieves `response_id`.
3. Processes each class:
* Calls `get_or_create_class()`
* Links to response in `response_classes`
4. Processes each assignment:
* Calls `get_or_create_assignment()`
* Inserts grades into `assignment_grade_history`.
* **Notes:** Immutable snapshots; append-only.
---
## **Snapshot Loading**
### `std::optional<GradeSnapshot> load_latest_snapshot(const CHClient& client, const std::string& user_id)`
* **Purpose:** Loads the most recent snapshot for a user.
* **Output:** Fully populated `GradeSnapshot` or `nullopt` if none exists.
* **Behavior:** Uses `fetched_at DESC LIMIT 1`.
---
### `std::optional<GradeSnapshot> load_snapshot_by_id(const CHClient& client, const std::string& response_id)`
* **Purpose:** Loads a snapshot by `response_id`.
* **Output:** `GradeSnapshot` including:
* Classes
* Assignments
* Grades
* **Behavior:** Joins `user_classes`, `user_assignments`, `assignment_grade_history`, `response_classes`.
---
## **Diffing & Change Detection**
### `bool has_changes(const CHClient& client, const std::string& user_id, const api_utils::GradesResponse& new_api_response)`
* **Purpose:** Detects if new API response differs from the latest snapshot.
* **Output:** `true` if changes exist, `false` otherwise.
* **Checks for:**
* New assignments
* Removed assignments
* Score/attempt changes
---
### `std::vector<AssignmentDiff> diff_snapshots(const GradeSnapshot& old_snapshot, const GradeSnapshot& new_snapshot)`
* **Purpose:** Returns detailed differences between two snapshots.
* **Output:** Vector of `AssignmentDiff`.
* **Change Types:** `NEW`, `UPDATED`, `REMOVED`.
* **Behavior:** Compares old and new grades per assignment key.
---
### `void insert_grade_updates(const CHClient& client, const std::string& user_id, const std::string& old_response_id, const std::string& new_response_id, const std::vector<AssignmentDiff>& diffs)`
* **Purpose:** Inserts diffs into `grade_updates` table.
* **Behavior:**
* Maps `AssignmentDiff` to ClickHouse columns.
* Handles nullable old values for new assignments.
* Uses placeholder values for removed assignments.
* Logs number of inserted updates.