Analysis Result

32%

Appears Human Written

Low Confidence

10/6/2025

39 views

Analyzed Text

%https://arxiv.org/pdf/2505.03977? \chapter{Materials and Methods} This chapter details the methodology for the Seriguela pipeline and is intended to describe: the data engineering process for generating, cleaning, and augmenting the mathematical expression dataset; the model's training methodology, including the initial supervised fine-tuning and the subsequent reinforcement learning refinement with Proximal Policy Optimization (PPO); the benchmark datasets and evaluation metrics used to assess accuracy and model complexity; and the essential implementation details regarding the software and hardware environment. \subsection{Conceptual Pipeline} \begin{figure}[htbp] \centering \includegraphics[width=0.9\linewidth]{figures/Overview.pdf} % Adjusted width for better margins \caption[End-to-end pipeline of the Seriguela Pipeline]{End-to-end pipeline of the Seriguela Pipeline.} \label{fig:pipeline} \end{figure} \section{Overview of the Proposed Seriguela Pipeline} As illustrated in Figure~\ref{fig:pipeline}, our approach implements a three-stage methodology for symbolic regression: \begin{enumerate} \item \textbf{Data Engineering Phase}: A dataset of expressions is generated, evaluated to remove invalid expressions, and finally augmented with prompt format and new variables. \item \textbf{Model Fine-Tuning}: A pre-trained language model (LM) is specialized through supervised learning on mathematical expressions, transforming it into a domain-specific expression generator. \item \textbf{Expression Discovery}: The fine-tuned Language Model is iteratively optimized using Proximal Policy Optimization (PPO). This process takes tabular data as input, with columns representing features and a designated target variable, to explore and refine mathematical expressions. The intended outcome is a mathematical expression that effectively fits the input data. \end{enumerate} The first and second blocks are applied only once. After fine-tuned, the LLM for Expression Generation is used in an optimization loop using the input data. For each input data set, the model restarts the process. \section{Dataset Engineering} \begin{figure}[htbp] \centering \includegraphics[width=0.9\linewidth]{figures/Data Engeneering.pdf} \caption[Data Engineering pipeline of the expression dataset]{Data Engineering pipeline of the expression dataset. From 1.1. Data Generation, followed by 1.2. Data Cleaning and finally 1.3. Prompt Engineering.} \label{fig:DataEngineering} \end{figure} Large Language Models (LLMs), such as GPT-2, are pre-trained on vast amounts of natural language text. However, their core objective of predicting the next most probable token is not inherently suited for generating the syntactically precise and computable language of mathematics. When prompted to formulate an equation, a base GPT-2 model often produces descriptive text, incorrect formats, or syntactically invalid expressions instead of a usable formula, as illustrated by the examples in Table \ref{tab:PromptGTP2}. To address this limitation, the model must be specialized through supervised fine-tuning on a dedicated dataset of mathematical expressions. This process recalibrates the model's internal weights, teaching it the specific structure and tokens required to generate valid formulas when prompted. The data engineering pipeline developed for this work consists of three main stages: Dataset Generation, Data Cleaning, and Prompt Preparation. This pipeline is visually outlined in Figure \ref{fig:DataEngineering}. \subsection{Dataset Generation} \begin{table}[h] \centering \caption{Equation generation configuration parameters} \label{tab:eq_config} \begin{tabular}{p{3cm}p{5cm}p{4cm}} \toprule \textbf{Parameter} & \textbf{Value} & \textbf{Description} \\ \midrule \texttt{max\_len} & 20 & Maximum token length for generated equations \\ \addlinespace \texttt{operators} & \begin{minipage}[t]{\linewidth} \texttt{add:10, mul:10, sub:5, div:5,} \texttt{sqrt:4, pow2:4,} \texttt{exp:4, sin:4, cos:4,} \texttt{tan:4, asin:2} \end{minipage} & Operator weights\\ \addlinespace \texttt{max\_ops} & 5 & Maximum operations per equation \\ \addlinespace \texttt{variables} & \texttt{x\_1, x\_2, x\_3, ..., x\_10}& \begin{minipage}[t]{\linewidth} \vspace{-0.5em} \begin{itemize}[leftmargin=*,nosep,noitemsep] \item Base variables \item Extensible via \texttt{x\_n} convention \end{itemize} \end{minipage} \\ \bottomrule \end{tabular} \end{table} The first step in engineering the dataset was to generate a large volume of diverse mathematical expressions for the fine-tuning process. In this work, mathematical expressions are treated as tree structures, where the internal nodes represent operators (e.g., $+, sin, ×$) and the leaves represent operands such as variables (e.g., $x_1 ,x_2$) or constants. To generate these expression trees randomly while avoiding a bias towards overly simple or complex structures, this work adopted the algorithm proposed by Lample and Charton, \cite{lample2019deep}. This method ensures that different tree structures have a more uniform probability of being generated, which is crucial for creating a balanced and diverse dataset. The algorithm operates on trees where internal nodes can have one child (unary operators like $sin$) or two children (binary operators like $+$). A detailed technical explanation of this generation algorithm is provided in Chapter 2. Using this approach, an initial dataset of 500,000 expressions was generated. The expressions were first created in prefix notation and then converted to the more common infix format for subsequent steps. The key parameters that configured the expression generation are outlined in Table \ref{tab:eq_config}. \begin{table}[h!] \centering \caption{Comparison of the GPT-2 model's output before and after fine-tuning, demonstrating its improved ability to generate syntactically correct expressions.} \label{tab:gpt2-comparison} \ttfamily \begin{tabularx}{\textwidth}{ >{\RaggedRight}X >{\RaggedRight}X } \toprule % Sub-header for the first section \multicolumn{2}{c}{\textbf{Base Model}} \\ \midrule \textbf{Prompt} & \textbf{Result} \\ \midrule % First example Generate a mathematical expression that represents the kinetic energy of an object with mass m and velocity v. & For example, if you want to calculate the kinetic energy of a baseball, you can use the following formula: $K=nx$ \\ \addlinespace % Second example Write a trigonometric expression involving sine and cosine that simplifies to 1. & Write a trigonometric expression involving sin and cosine that simplifies to 1. \\ \addlinespace % Third example Instruction: Generate a mathematical expression using variables [x\_1, x\_2, x\_3, x\_4], operands [+, -, *, exp], and [C] as a constant. Expression: & $x_1 = -(x_2 \cdot x_3) + (x_4 \cdot \exp)$ The expression above is a mathematical expression. \\ \midrule[1pt] % Sub-header for the second section \multicolumn{2}{c}{\textbf{After Fine-Tuning}} \\ \midrule % The prompt section - now consistent with the rest vars: x\_1, x\_2, x\_3, x\_4, x\_5, x\_6, x\_7, x\_8, x\_9 \\ oper: *, **, +, -, abs, asin, cos, exp, log, sin, sqrt, tan \\ cons: C \\ expr: & $x_7 + \exp(C \cdot x_2^C) + C$ \\ \bottomrule \end{tabularx} \end{table} \subsection{Data Cleaning} To ensure the dataset's integrity, expressions were validated and cleaned. A SymPy \cite{sympy} parser was utilized to check the syntactic validity of each expression, specifically identifying issues such as missing closing parentheses. Additionally, any duplicate expressions were removed to maintain uniqueness within the dataset. \subsection{Prompt Engineering} \label{sec:prompt_engineering} After the generation and cleaning stages, the dataset consisted of a simple list of valid mathematical expressions. However, this raw format was unsuitable for fine-tuning the LLM for two main reasons: \begin{enumerate} \item \textbf{Lack of a Guiding Structure:} The expressions alone did not provide a contextual prompt that could be used to guide the model's generation process during inference. \item \textbf{Limited Diversity:} The generation algorithm did not produce expressions with a wide range of variables (often limited to five or fewer), and the constants were not yet optimized for exploration. \end{enumerate} To address these issues, a prompt engineering phase was implemented to transform each raw expression into a structured training sample. The goal was to create a format that was \textbf{human-readable}, \textbf{token-efficient}, and provided the model with clear context about the elements available for generation. The final prompt structure aggregates the available variables, operators, and constants, followed by the target expression. This resulted in the following format: \begin{verbatim} vars: x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9 oper: *, **, +, -, abs, asin, cos, exp, log, sin, sqrt, tan cons: C expr: x_7 + exp(C*x_2**C) + C \end{verbatim} To enhance the diversity and robustness of the training data, the context provided in the \texttt{vars:} and \texttt{oper:} lists was intentionally expanded. For each training sample, a random number of additional variables and operators were added to these lists, beyond what was strictly required by the target expression. This strategy teaches the model to be flexible and generate expressions that utilize only a subset of the available elements, thereby better simulating real-world scenarios where not all features are relevant. Finally, to evaluate model performance on different notations, both the \textbf{infix} (standard notation) and \textbf{prefix} (Polish notation) representations of the expressions were retained in the final dataset. \section{Supervised Fine Tuning} \subsection{Pretraining Motivation} \begin{figure}[htbp] \centering \includegraphics[width=0.9\linewidth]{figures/Supervised-Fine-Tuning.pdf} \caption[Supervised Fine-tuning process]{An illustration of the supervised fine-tuning process for a machine learning model, showing the key stages from data preparation to model refinement.} \label{fig:SupFineTuning} \end{figure} Once a dataset of mathematical expressions is prepared, the GPT-2 model can be trained to generate expressions that adhere to a given prompt and context. As illustrated in Table \ref{tab:PromptGTP2}, a base GPT-2 model is not inherently capable of generating tokens in a format that allows us to extract an expression and compute. This limitation exists because the model's pretraining objective focuses on predicting the most probable next token in a sentence, rather than adhering to a specific output format. Therefore, to achieve the desired output, the model requires explicit examples of how to produce the targeted mathematical expressions. Then, its weights can be updated to produce the correct tokens when a prompt following the specified format is used as input. \subsection{Fine-tuning Setup} The dataset was partitioned into training, testing, and validation sets with an 80\%, 10\%, and 10\% distribution, respectively. For input processing, the original GPT-2 tokenizer was employed, aligning to adapt the pre-trained model to our specific task. No additional tokenizer modifications were required, as its existing vocabulary adequately covered our data. Figure \ref{fig:GPT2Token} illustrates the outcome of this tokenization process. Notably, words such as "var," "asin," and "sqrt" are split into multiple tokens. Similarly, variables like "$x_1$" are tokenized in parts (e.g., "$x$", "\_", "1"). This fine-grained tokenization is advantageous for this work, as it enables the model to generalize and generate variables beyond its initial training, for instance, "$x_{99}$". \label{subsec:pretraining}