A Focused Review of Bioinformatics Pipelines for Circulating Tumor DNA (ctDNA) Analysis
Maryam Radmanfard,1,*Asal Naghipour_Kordlar,2
1. Department of Basic Sciences, Ta.C., Islamic Azad University, Tabriz, Iran 2. Faculty of Nursing, Tabriz University of Medical Sciences, Tabriz, Iran
Introduction: The analysis of circulating tumor DNA (ctDNA) from blood, a key component of liquid biopsy, offers a minimally invasive alternative to traditional tissue biopsies, which are limited by invasiveness and their inability to capture tumor heterogeneity. ctDNA analysis enables dynamic monitoring of a patient's cancer, with critical applications in guiding targeted therapies, monitoring treatment response, detecting acquired resistance, and assessing minimal residual disease (MRD). The successful translation of ctDNA analysis from research to clinical practice is entirely dependent on sophisticated bioinformatics pipelines capable of extracting faint tumor signals from a noisy background. These pipelines must overcome significant challenges, including the low fraction of ctDNA in early-stage disease, errors introduced during sequencing, and biological noise from non-tumor somatic mutations. A standardized, multi-stage workflow is therefore essential to convert raw sequencing data into reliable, clinically actionable insights {Dang, 2022 #44}.
2. Pre-analytical and Technical Considerations for ctDNA Analysis
The reliability of a ctDNA bioinformatics pipeline begins long before the first line of code is executed. Pre-analytical variables can introduce biases and artifacts that no downstream algorithm can fully correct.
2.1 Sample Collection and Processing
The choice of blood collection tube and the timing of plasma separation are critical. Standard EDTA tubes are suitable if plasma is separated within four to six hours. For longer delays, specialized tubes containing cell-stabilizing reagents (e.g., from manufacturers like Streck, Roche, or Qiagen) are necessary to prevent the lysis of white blood cells, which releases contaminating genomic DNA (gDNA) and dilutes the ctDNA signal. To further minimize gDNA contamination, a standardized double-centrifugation protocol is recommended, typically involving an initial low-speed spin (e.g., 1,600 × g) followed by a high-speed spin of the plasma supernatant (e.g., 16,000 × g) {Bai, 2019 #45}.
2.2 Library Preparation and Sequencing
Two primary methods are used for targeted enrichment of ctDNA: hybrid-capture and amplicon-based sequencing. Hybrid-capture methods offer better uniformity of coverage but may require larger DNA inputs, whereas amplicon-based methods can work with less input but may have higher on-target rates at the cost of uniformity. A key technology for sensitive ctDNA analysis is the use of Unique Molecular Identifiers (UMIs), short random DNA sequences ligated to each cfDNA fragment before PCR amplification. This allows for the computational grouping of all reads originating from a single molecule, enabling the correction of PCR and sequencing errors and dramatically improving the signal-to-noise ratio {Cescon, 2020 #39}.
2.3 Quality Control and Performance Metrics
Rigorous quality control (QC) is essential at multiple stages. Key metrics for accepting a sequencing run include a high percentage of bases with a quality score ≥30 (%Q30), typically >80%, and a low overall error rate as determined by a PhiX spike-in control. The analytical sensitivity, or limit of detection (LOD), defines the lowest variant allele frequency (VAF) an assay can reliably detect. For MRD applications, an LOD below 0.01% (1 in 10,000) is often required, which necessitates ultra-deep sequencing (e.g., >20,000× coverage) and effective error suppression. UMI-based error correction is most effective when the UMI family size (number of reads per original molecule) is between 2 and 5, providing sufficient redundancy for error correction without wasting sequencing capacity {Tzanikou, 2020 #36}.
3. Core Bioinformatic Pipeline for ctDNA Data
3.1 Pre-processing and Alignment
The pipeline begins with raw sequencing reads (FASTQ files), which are first assessed for quality using tools like FastQC. Subsequently, adapter sequences and low-quality bases are removed using tools such as Trimmomatic. For UMI-tagged data, reads are grouped into consensus families to generate error-corrected sequences. The cleaned reads are then aligned to a human reference genome, typically GRCh38. For this step, BWA-MEM is a widely used aligner valued for its accuracy, particularly with the -M flag to mark shorter split hits as secondary, which aids downstream compatibility. To improve alignment accuracy and reduce false positives, it is best practice to use a reference genome that includes "decoy" sequences and to filter out reads mapping to problematic genomic regions defined in blacklists, such as those provided by the ENCODE project {Shah, 2025 #43}.
3.2 Somatic Variant Calling (SNVs and Indels)
Variant calling can be performed in a tumor-normal mode, where cfDNA is compared to a matched germline sample (e.g., from white blood cells), or in a tumor-only mode. In the latter, which is common for cfDNA, a Panel of Normals (PoN) is crucial for filtering out recurrent technical artifacts and germline variants. A PoN should be constructed from at least 40 technically-matched normal samples. Several variant callers have been benchmarked for ctDNA analysis {Gašperšič, 2020 #37}:
• Standard Callers: MuTect2 is highly sensitive but can be prone to false positives without careful filtering.
VarDict also demonstrates high sensitivity, while LoFreq provides a good balance between sensitivity and precision.
• ctDNA-Specific Callers: A new generation of tools is designed for low-VAF detection. shearwater shows excellent precision in tumor-informed analyses (where tumor mutations are known), achieving a ROC-AUC of 0.984 for sample classification. In tumor-agnostic settings, the deep-learning-based DREAMS-vc performs best, with a ROC-AUC of 0.808.
3.3 Detection of Copy Number and Structural Variations
Beyond small mutations, ctDNA can reveal large-scale genomic alterations.
• Copy Number Variations (CNVs): For genome-wide CNV analysis from low-coverage whole-genome sequencing (lcWGS), ichorCNA is a standard tool that can estimate tumor fraction and detect large, arm-level CNVs without prior tumor information. For targeted panel data, CNVkit is effective as it leverages both on-target and off-target reads for normalization, though care must be taken to correct for batch effects, often by normalizing against a panel of normal samples.
• Structural Variants (SVs): Detecting SVs from fragmented cfDNA is challenging. Specialized tools have been developed, such as SViCT, which uses local assembly of reads to achieve high sensitivity and precision even at ctDNA fractions of 0.01%.
Aperture offers a novel, alignment-free k-mer-based approach that is extremely fast and excels at detecting SVs in repetitive regions of the genome {Moon, 2025 #42}.
Methods: 5.1 The Conundrum of Clonal Hematopoiesis (CHIP)
A major challenge to the specificity of ctDNA assays is Clonal Hematopoiesis of Indeterminate Potential (CHIP), an age-related phenomenon where hematopoietic stem cells acquire somatic mutations (e.g., in DNMT3A, TET2, ASXL1, TP53) and clonally expand. These non-cancer mutations are shed into the blood and can be mistaken for tumor-derived variants. The gold-standard for filtering CHIP is to perform paired sequencing of cfDNA and matched white blood cells (WBCs). However, due to cost, plasma-only computational strategies are in development. These include blacklisting variants in common CHIP genes and monitoring VAFs over time, as CHIP variants tend to remain stable while ctDNA levels often fluctuate with treatment response {Gašperšič, 2020 #37}.
5.2 The Role of Machine Learning
Machine learning (ML) is increasingly being integrated into ctDNA pipelines to improve accuracy. ML models can learn complex error signatures from sequencing data to more accurately filter false positive variants than traditional hard-filtering methods. Furthermore, ML is being used to integrate multi-modal data, such as ctDNA methylation patterns and fragmentomics (analysis of cfDNA fragment sizes), with SNV data to enhance the sensitivity of early cancer detection {Oliveira, 2020 #40}.
Results: The efficacy of any ctDNA bioinformatics pipeline is ultimately measured by its performance in detecting low-VAF mutations accurately. Numerous benchmarking studies have evaluated various variant callers, revealing a clear hierarchy of performance that is highly dependent on the use of UMIs and the specific analytical context.
In standard analyses of non-UMI data, a distinct trade-off between sensitivity and precision is observed. Callers like MuTect2 and VarDict consistently demonstrate high sensitivity, able to detect a large number of true positive variants. However, this sensitivity often comes at the cost of specificity, with MuTect2 in particular returning a higher number of uniquely called variants, which is an indicator of potential false positives. In contrast, LoFreq often provides the best balance, achieving high precision by detecting the fewest putative false positives in some benchmarks, making it a reliable choice when specificity is paramount.
The integration of UMIs fundamentally improves performance, enabling callers to better distinguish true variants from technical artifacts. UMI-aware callers have the potential to outperform standard tools by directly incorporating error-corrected consensus reads into their models. For instance, UMI-VarCal has been shown to detect fewer false positives and a higher percentage of known COSMIC variants in real-world samples, indicating high specificity and sensitivity. UMIErrorCorrect is exceptionally sensitive at the lowest VAFs, though its false-positive rate can increase with sequencing depth.
Further performance gains have been achieved with a new generation of callers specifically designed for the unique error profiles of cfDNA. In tumor-informed analyses, where a tumor's mutational profile is known beforehand, shearwater demonstrates the highest precision, achieving a ROC-AUC of 0.984 for classifying samples. For tumor-agnostic applications, such as initial cancer screening, the deep learning-based DREAMS-vc shows the best performance, with a ROC-AUC of 0.808. These results underscore that the optimal variant caller is not universal; its selection depends critically on the specific application, data type (UMI vs. non-UMI), and whether prior tumor information is available.
Conclusion: 6.1 Summary and Unmet Needs
Bioinformatics pipelines for ctDNA analysis have matured to the point of enabling sensitive and specific detection of various genomic alterations. UMI-based error correction and ctDNA-aware variant callers are now standard for high-sensitivity applications. However, significant unmet needs remain, particularly the development of robust and cost-effective methods for filtering CHIP variants from plasma-only data and improving the detection of complex structural variants.
6.2 Standardization, Reproducibility, and Clinical Translation
For ctDNA analysis to become a routine clinical tool, standardization and reproducibility are paramount. The lack of standard protocols is a major barrier to comparing results across studies. Consortia like BloodPAC are working with regulatory agencies to establish guidelines for the analytical validation of ctDNA assays. For computational reproducibility, the use of workflow managers like Nextflow and Snakemake, combined with containerization technologies such as Docker or Singularity, is becoming essential for creating portable and scalable pipelines. Finally, for clinical reporting, variants must be interpreted and classified according to established guidelines from organizations like the Association for Molecular Pathology (AMP), American Society of Clinical Oncology (ASCO), and College of American Pathologists (CAP), which provide a four-tiered system for categorizing somatic variants based on clinical significance. Continued innovation in these areas will be critical to realizing the full transformative potential of ctDNA analysis in precision oncology {Malla, 2022 #38}.