almost done

This commit is contained in:
2025-04-30 01:03:24 -05:00
parent a56dacdcca
commit a4e3e72863
13 changed files with 144 additions and 57 deletions

View File

@@ -68,7 +68,7 @@ February 28 2025\\
%%%%Title
\begin{center}
\vspace{1em}
Rule-based Tensor Mutations Embedded within LLMs for Low-Cost Mathematical Comptuation
Rule-based Tensor Mutations Embedded within LLMs for Low-Cost Mathematical Computation
\end{center}
@@ -92,9 +92,9 @@ In recent years, a specialized kind of Machine Learning models have hit the mark
These techniques were later commercialized with the advent of GPT-2, GPT-3, and BERT from AI labs like OpenAI and Google's DeepMind \parencite[3]{Wang2024}. With increased supply of Graphical Processing Units (GPUs) and Tensor Processing Units (TPUs), these models began snowballing in scale. This was especially evident starting in 2019 with an iteration of GPT-2 being released with a production size of 1.5 billion parameters. In 2020, GPT-3 scaled up to 175 billion parameters --- achieving true coherence in reasoning for the first time ever for a machine. GPT-4 was released by OpenAI in 2023, with an undisclosed scale in the trillions of parameters. Development investment also climbed into the hundreds of billions of dollars, with new firms such as Anthropic, Grok, etc. Open sourced projects also gained popularity, some backed by multi-billion dollar R\&D teams such as Meta's Llama series.
Functionally, there is no fundamental algorithmic difference between generative and classification models. Indeed, most LLMs are initially trained to generate new sequences of words by setting the loss function to expect the next word in the series of an existing corpus, through a process known as Casual Language Modeling (CLM). For the purposes of commercialization, they have been re-purposed to be prompted as chat-bots by users. This is done by performing backpropagation based on the generation of conversational sequences, with the LLM often instructed to act as if filling out a conversation's transcript.
Functionally, there is no fundamental algorithmic difference between generative and classification models. Indeed, most LLMs are initially trained to generate new sequences of words by setting the loss function to expect the next word in the series of an existing corpus, through a process known as Casual Language Modeling (CLM). For the purposes of commercialization, they have been re-purposed to be prompted as chat-bots by users. This is done by performing back propagation based on the generation of conversational sequences, with the LLM often instructed to act as if filling out a conversation's transcript.
Several underlying technologies are involved in the lifecycle of an LLM. The process of creating one usually starts with the definition of a vocabulary. Sequences of language are broken into tokens by algorithms called tokenizers. Tokenizers split text into smaller units, which are then encoded into a vector by another MLP. This is done to develop a sense of meaning via the mathematical similarity of similar words. The similarity of two vectors can be calculated using the cosine-similarity formula, which calculates the angle $\phi$ between two vectors.
Several underlying technologies are involved in the life cycle of an LLM. The process of creating one usually starts with the definition of a vocabulary. Sequences of language are broken into tokens by algorithms called tokenism's. Tokenizers split text into smaller units, which are then encoded into a vector by another MLP. This is done to develop a sense of meaning via the mathematical similarity of similar words. The similarity of two vectors can be calculated using the cosine-similarity formula, which calculates the angle $\phi$ between two vectors.
\[
\cos\phi=\frac{\vec{A}\cdot\vec{B}}{||\vec{A}||||\vec{B}||}
\]
@@ -145,14 +145,14 @@ This research aims to investigate the potential integration of rule-based tensor
\textbf{RQ:} How can deterministic rule-based tensor mutations be embedded within LLM architectures to enable more accurate and efficient mathematical operations?
\end{quote}
The significance of this line of inquiry lies in its potential to address a fundamental limitation of current generative AI systems like ChatGPT, Anthropic's Claude, etc. While specialized numeric compute systems exist (e.g. RAG with Wolphram Alpha), they operate independently of the SIMD, low-latency systems of LLMS, leading to sizable latency in communication. This is especially prevalent in workflows involving both mathematical and linguistic reasoning. The integration of computational resources required for such workflows within LLMs could substantially reduce the computational resources required for complex tasks that involve both natural and language processing and mathematical reasoning.
The significance of this line of inquiry lies in its potential to address a fundamental limitation of current generative AI systems like ChatGPT, Anthropic's Claude, etc. While specialized numeric compute systems exist (e.g. RAG with Wolfram Alpha), they operate independently of the SIMD, low-latency systems of LLMS, leading to sizable latency in communication. This is especially prevalent in workflows involving both mathematical and linguistic reasoning. The integration of computational resources required for such workflows within LLMs could substantially reduce the computational resources required for complex tasks that involve both natural and language processing and mathematical reasoning.
This infestation focuses specifically on the following mathematical operations:
This investigation focuses specifically on the following mathematical operations:
\begin{itemize}
\item Basic arithmetic (addition, subtraction, multiplication, division)
\item Matrix Operations (multiplication, inversion, determinant)
\item Binary Opertaions (XOR, AND, NAND, left shift, right shift, OR, complement)
\item Binary Operations (XOR, AND, NAND, left shift, right shift, OR, complement)
\item Array Operations (array sum, as well as the mean, median, mode, standard deviation, variance, and other single variable metrics of a data set)
\end{itemize}
@@ -188,10 +188,34 @@ The implementation architecture utilizes predetermined index relationships withi
Each mathematical operation type is assigned specific input and output indices within the tensor, creating a predictable computational graph that can be optimized during compilation using the CUDA compiler, \mintinline{c}|gcc|, and manual assembler optimization like with DeepSeekV3 \parencite[16]{deepseekai2025deepseekv3technicalreport}. Addition operations, for instance, use indices $(i,j)$ and $(i+1,j)$ as inputs, with results stored at $(i,j+d/2)$, effectively partitioning the embedding space into operand and result regions. Multiplication operations utilize indices $(i,j)$ and $(i,j+1)$ as inputs, with results projected to $(i+1,j+d/2)$, maintaining a consistent pattern of spatial relationships within the tensor. More complex operations like matrix determinant calculations employ a $3\times3$ submatrix starting at index $(i,j)$ with results consolidated at $(i+3,j)$. This systematic approach to index mapping enables highly efficient computation on GPU architectures, as the fixed patterns allow for optimized memory access patterns due to hard-coded indexing at compile time, and reduced cache thrashing during tensor operations. Modern GPUs excel at these fixed-pattern operations, particularly when they can be expressed as fused operations within CUDA kernels or optimized through tensor cores designed specifically for matrix multiplication\parencite{cuda_programming_guide_2025}.
The architecture maintains parallel processing paths that preserve the dual nature of the system's capabilities. The standard language processing path continues to leverage the probabilistic, statistical nature of the transformer architecture, preserving the original LLM capabilities that have proven effective for natural language understanding and generation. Simultaneously, the mathematical computation path applies fixed-index transformations for specific operations, creating a deterministic subsystem within the larger stochastically variant network. These parallel streams capitalize on the inherent parallelism of GPU architectures, allowing different CUDA cores and cache regions to process distinct streams simultaneously. The fixed-index nature of the mathematical operations enables compiler optimizations that can allocate dedicated tensor cores for these operations, maximizing throughput and minimizing latency. Existing models, as shown in Figure \ref{tb:model-sizes} tend to use far more VRAM than cores, leading to an allocation inefficient in terms of performance per millisecond of inference. The paths are later merged through concatenation and a projection layer, a process that similarly benefits from the warp-level primitives available in modern GPU architectures for efficient tensor manipulation.
The architecture maintains parallel processing paths that preserve the dual nature of the system's capabilities. The standard language processing path continues to leverage the probabilistic, statistical nature of the transformer architecture, preserving the original LLM capabilities that have proven effective for natural language understanding and generation. Simultaneously, the mathematical computation path applies fixed-index transformations for specific operations, creating a deterministic subsystem within the larger stochastically variant network. These parallel streams capitalize on the inherent parallelism of GPU architectures, allowing different CUDA cores and cache regions to process distinct streams simultaneously. The fixed-index nature of the mathematical operations enables compiler optimizations that can allocate dedicated tensor cores for these operations, maximizing throughput and minimizing latency. Existing models, as shown in Figure \ref{tab:model-sizes} tend to use far more VRAM than cores, leading to an allocation inefficient in terms of performance per millisecond of inference. The paths are later merged through concatenation and a projection layer, a process that similarly benefits from the warp-level primitives available in modern GPU architectures for efficient tensor manipulation.
The attention mechanism serves as a noise filter and integration component, allowing the model to selectively focus on either standard language representations or mathematically transformed representations based on input context. This selective focusing behavior effectively routes information through the appropriate pathway based on the input's semantic requirements. From a hardware acceleration perspective, this mechanism benefits from the recent advancements in GPU architecture specifically designed for transformer models. The attention operations leverage dedicated tensor cores in NVIDIA's Ampere and Hopper architectures, which provide specialized hardware acceleration for matrix multiplication and accumulation operations at various precisions. The fixed-index nature of the approach enables further optimization of these operations through persistent CUDA kernels that maintain tensor data in high-bandwidth on-chip memory (L3-L4 cache), reducing expensive global memory access operations during the attention computation phase.
\newpage
{\raggedright \normalsize \textbf{Implementation Hardware & Software}}
\section{Implementation in Rust Using Burn}
Rust was selected for its memory safety guarantees, zero-cost abstractions, and deterministic concurrency model. The neural network is implemented using the \mintinline{toml}{burn} crate, a modular, backend-agnostic deep learning framework designed for Rust. Burn enables explicit architectural definition via trait-based modules and supports GPU acceleration using backends such as \mintinline{toml}{burn-wgpu} and \mintinline{toml}{burn-candle}. This design aligns with IB Computer Science principles of modularity, abstraction, and system performance.
\begin{minted}{toml}
[dependencies]
burn = "0.12"
burn-wgpu = "0.12" log = "0.4"
env_logger = "0.10"
\end{minted}
The system targets an MSI RTX 4090 (24GB VRAM, 900W), utilizing \mintinline{toml}{burn-wgpu} to leverage WebGPU for training on tensor cores. This setup maximizes throughput for floating-point operations critical in gradient descent and backpropagation.
\begin{minted}{rust} use burn::module::Module; use burn::tensor::backend::WgpuBackend; use log::{info, warn};
fn main() {
env_logger::init(); info!("Training initialized");
}
\end{minted}
The \mintinline{toml}{log} crate provides structured runtime logging, while \mintinline{toml}{env_logger} parses environment variables to configure log levels. Logging supports traceability, a key aspect of IB standards emphasizing system reliability and maintainability. Modular logging also illustrates core software engineering practices, such as separation of concerns and system observability, during neural network training and mutation processes.
%%%%Works cited
\newpage
\begin{center}