College of Graduate Studies: Theses & Dissertations
Term of Award
Spring 2026
Degree Name
Master of Science, Computer Science (M.S.C.S.)
Document Type and Release Option
Thesis (open access)
Copyright Statement / License for Reuse

This work is licensed under a Creative Commons Attribution 4.0 License.
Department
Department of Computer Science
Committee Chair
Vijayalakshmi Ramasamy
Committee Member 1
Weitian Tong
Committee Member 2
Lixin Li
Abstract
Genome assembly — the reconstruction of a complete DNA sequence from short, overlapping reads — remains a fundamental challenge in computational biology. A central difficulty is distinguishing true genomic overlaps from spurious connections arising from repetitive sequences, a task that traditional assemblers address through hand-tuned heuristic rules applied to de Bruijn or overlap graphs. This thesis introduces COGRAM (Coggins–Ramasamy Assembly Method), a genome assembly pipeline that reframes sequence reconstruction as an edge classification task on a k-mer overlap graph, replacing heuristic graph cleaning with a learned model.
COGRAM constructs a directed overlap graph from raw sequencing reads using a k-mer length of 31, yielding approximately 4.95 million edges. Node representations are derived from three structural features — in-degree, out-degree, and k-mer coverage — and refined through a two-layer Graph Convolutional Network (GCN) with batch normalization and edge dropout regularization, producing 64-dimensional hidden representations and 3-dimensional node embeddings. Edge embeddings are formed via four-way concatenation of endpoint representations, capturing both absolute and relational signals in a 12-dimensional vector that is passed to a multilayer perceptron for binary classification.
The pipeline is evaluated on Escherichia coli K-12 MG1655, a well-characterized reference genome, with a stratified 70/15/15 train-validation-test split. Because true overlap edges constitute a small minority of all candidate edges, standard accuracy is a misleading metric; model performance is instead assessed using area under the precision-recall curve (AUPRC). COGRAM achieves an AUPRC of 0.9995, a precision of 99.99%, and a recall of 97.67% on the held-out test set. Assembly quality is evaluated using minimap2 alignment against the reference genome, yielding a genome fraction of 98.6%, three misassemblies, an N50 of 9,710 bp, and 1,365 contigs. The full pipeline trains on consumer-grade hardware (NVIDIA RTX 3070) in approximately seven hours, demonstrating practical feasibility without institutional infrastructure.
These results establish that a graph neural network can learn the structural signatures of true genomic adjacencies with sufficient accuracy to support high-quality sequence reconstruction. COGRAM contributes a proof of concept for GNN-based edge classification as an alternative to heuristic-driven graph cleaning in genome assembly pipelines.
OCLC Number
1588662443
Catalog Permalink
https://galileo-georgiasouthern.primo.exlibrisgroup.com/permalink/01GALI_GASOUTH/c9nn09/alma9916659742502950
Recommended Citation
Coggins, William. "COGRAM: A Computational Pipeline for Genome Assembly and Reconstruction via Graph Neural Networks." Master's thesis, Georgia Southern University, 2026.
Research Data and Supplementary Material
No
Included in
Artificial Intelligence and Robotics Commons, Bioinformatics Commons, Theory and Algorithms Commons