College of Graduate Studies: Theses & Dissertations

Term of Award

Spring 2026

Degree Name

Master of Science, Computer Science (M.S.C.S.)

Document Type and Release Option

Thesis (open access)

Copyright Statement / License for Reuse

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Department

Department of Computer Science

Committee Chair

Vijayalakshmi Ramasamy

Committee Member 1

Weitian Tong

Committee Member 2

Lixin Li

Abstract

Genome assembly — the reconstruction of a complete DNA sequence from short, overlapping reads — remains a fundamental challenge in computational biology. A central difficulty is distinguishing true genomic overlaps from spurious connections arising from repetitive sequences, a task that traditional assemblers address through hand-tuned heuristic rules applied to de Bruijn or overlap graphs. This thesis introduces COGRAM (Coggins–Ramasamy Assembly Method), a genome assembly pipeline that reframes sequence reconstruction as an edge classification task on a k-mer overlap graph, replacing heuristic graph cleaning with a learned model.

COGRAM constructs a directed overlap graph from raw sequencing reads using a k-mer length of 31, yielding approximately 4.95 million edges. Node representations are derived from three structural features — in-degree, out-degree, and k-mer coverage — and refined through a two-layer Graph Convolutional Network (GCN) with batch normalization and edge dropout regularization, producing 64-dimensional hidden representations and 3-dimensional node embeddings. Edge embeddings are formed via four-way concatenation of endpoint representations, capturing both absolute and relational signals in a 12-dimensional vector that is passed to a multilayer perceptron for binary classification.

The pipeline is evaluated on Escherichia coli K-12 MG1655, a well-characterized reference genome, with a stratified 70/15/15 train-validation-test split. Because true overlap edges constitute a small minority of all candidate edges, standard accuracy is a misleading metric; model performance is instead assessed using area under the precision-recall curve (AUPRC). COGRAM achieves an AUPRC of 0.9995, a precision of 99.99%, and a recall of 97.67% on the held-out test set. Assembly quality is evaluated using minimap2 alignment against the reference genome, yielding a genome fraction of 98.6%, three misassemblies, an N50 of 9,710 bp, and 1,365 contigs. The full pipeline trains on consumer-grade hardware (NVIDIA RTX 3070) in approximately seven hours, demonstrating practical feasibility without institutional infrastructure.

These results establish that a graph neural network can learn the structural signatures of true genomic adjacencies with sufficient accuracy to support high-quality sequence reconstruction. COGRAM contributes a proof of concept for GNN-based edge classification as an alternative to heuristic-driven graph cleaning in genome assembly pipelines.

OCLC Number

1588662443

Research Data and Supplementary Material

No

Share

COinS