Configuration

This page covers the container file structure, config.yaml settings, input file formats, and custom primer/index setup.

Container File Structure

The Docker and Singularity images create the following directory layout:

Within the main 16s-demux directory:

config/ — configuration files, fastq file list, samplesheet, and index files
workflow/rules/ — analysis code
workflow/out/ — output directory for real analyses
workflow/test_out/ — output directory for test runs
fastq_data/ — test sequencing data

The Snakefile and Slurm submission scripts are in the top-level 16s-demux directory.

config.yaml Reference

The config/config.yaml file controls all pipeline settings. The key fields are:

Field	Description	Default
`samplesheet`	Path to the samplesheet file	`samplesheet.txt`
`indices`	Path to the index file for demultiplexing	`indexfordemux.txt`
`fastqlist`	Path to the fastq file list	`fastq.txt`
`lenR1index`	Length of the read 1 index (longest phase)	`7`
`lenR2index`	Length of the read 2 index (longest phase)	`7`
`lenR1primer`	Length of read 1 primer (gene-specific + spacer)	`23`
`lenR2primer`	Length of read 2 primer (gene-specific + spacer)	`24`

All paths are relative to the config/ directory unless absolute paths are provided.

Input Files

Three inputs are required beyond the sequencing data itself: a fastq file list, a samplesheet, and the sequencing fastq data.

Note

Most common pipeline errors arise from input file formatting issues. The pipeline validates inputs automatically — check workflow/out/inputCheck_log.txt for any warnings.

Fastq Data

Naming rules:

Sample names must not contain spaces, underscores, or periods (hyphens are fine)
Rename files before demultiplexing if they violate these rules

Format:

Gzipped fastq files (.fastq.gz)

Location:

Fastq files can be located anywhere accessible to the container. Their paths are defined by combining the fastqdir variable in config.yaml with the paths in the fastq file list.

`fastqdir` (config.yaml)	Fastq file list entry
`""`	`/home/users/TEST/sequencing/exp01/fastqs/TEST_R1_001.fastq.gz`
`/home/users/TEST/sequencing/exp01/fastqs/`	`TEST_R1_001.fastq.gz`

Tip

Periods in sample names are accepted for demultiplexing but will cause issues with downstream tools like DADA2. Avoid them if you plan to use DADA2.

Fastq File List

Location: Place in the config/ directory and update the fastqlist field in config.yaml (default name: fastq.txt).

Format: Tab-delimited text file with three columns:

Column	Description
`read1`	Path to read 1 fastq.gz file
`read2`	Path to read 2 fastq.gz file
`file`	Shortened file identifier

The file column should use the format {RunName}-{round2plate}-{well}, where:

RunName — any identifier without underscores, spaces, or periods
round2plate — plate identifier for round 2 barcodes
well — well number

Example:

read1	read2	file
`../fastq_data/test_inputs/KKRP-001_S441_R1_001.fastq.gz`	`../fastq_data/test_inputs/KKRP-001_S441_R2_001.fastq.gz`	`15mc-003-P08B01-A01`
`../fastq_data/test_inputs/KKRP-002_S442_R1_001.fastq.gz`	`../fastq_data/test_inputs/KKRP-002_S442_R2_001.fastq.gz`	`15mc-003-P08B01-A02`

Important

The file field in the fastq file list must match the first part of the filename field in the samplesheet.

Samplesheet

Location: Place in the config/ directory and update the samplesheet field in config.yaml (default name: samplesheet.txt).

Format: Tab-delimited file (.tsv or .txt) with a header row. Three columns are required:

Column	Description
`filename`	Format: `{RunName}-{round2plate}-{well}-L{round1index}`
`sample`	Final sample name after demultiplexing (no underscores or periods)
`group`	Group identifier for output organization (no spaces or slashes)

Additional columns can be included freely — only filename, sample, and group are used by the pipeline.

The {RunName}-{round2plate}-{well} portion of filename must match the corresponding file entries in the fastq file list. The round1index value should match the phase entry in the indexfordemux.txt table.

Group behavior: Samples in different groups are output into separate subdirectories within trimmed/. If you don't need grouping, use a single group name for all samples or leave the column blank (keep the group header).

Example (key columns shown):

filename	sample	group
`15mc-003-P08B01-A01-L4`	`P5-A01-plateP-wellA1`	`group1`
`15mc-003-P08B01-A02-L4`	`P5-A02-plateP-wellA2`	`group1`
`15mc-003-P08B01-A01-L5`	`P6-A01-plateW-wellA5`	`group2`
`15mc-003-P08B01-A02-L5`	`P6-A02-plateW-wellA6`	`group2`

Index Files

The indexfordemux.txt file contains the inline indexes used for demultiplexing. The default file uses the 16S V4 index set.

Pre-built Index Sets

Index files for 13 regions are provided in the other_index_for_demux/ directory:

Region	File
V1 - V2	`V1-V2_index_for_demux.txt`
V1 - V3	`V1-V3_index_for_demux.txt`
V2 - V3	`V2-V3_index_for_demux.txt`
V3	`V3_index_for_demux.txt`
V3 - V4	`V3-V4_index_for_demux.txt`
V4	`V4_index_for_demux.txt` (default)
V4 - V5	`V4-V5_index_for_demux.txt`
V5	`V5_index_for_demux.txt`
V5 - V7	`V5-V7_index_for_demux.txt`
V6	`V6_index_for_demux.txt`
V6 - V7	`V6-V7_index_for_demux.txt`
V6 - V8	`V6-V8_index_for_demux.txt`
V7 - V9	`V7-V9_index_for_demux.txt`

To use a different region, copy the appropriate file into config/ and update the indices field in config.yaml.

Tip

For all pre-built index sets, the default config.yaml length parameters are correct — no changes needed for lenR1index, lenR2index, lenR1primer, or lenR2primer.

Custom Index Files

If you are using custom primers, you need to create a custom indexfordemux.txt file and may need to update the length parameters in config.yaml.

An Excel template (other_index_for_demux.xlsx) is provided in the other_index_for_demux/ directory to help derive custom index files.

Understanding the Read Structure

The final read structure after library prep looks like this (lengths not to scale):

The round 1 indexes use variable-length "phases" (0–7). Each phase has a different number of index bases, with the remaining positions filled by spacer and gene-specific primer sequence:

Phase	Variable FP	Variable RP
0		`ATGGACT`
1	`T`	`GCTAGC`
2	`GG`	`TGACT`
3	`ACT`	`CGGT`
4	`TAAC`	`GTA`
5	`CAGTC`	`AA`
6	`ATCGAT`	`C`
7	`GCAAGTC`

During demultiplexing, reads are treated as having 7 base pair indexes on both ends. Any positions not filled by the actual index contain spacer or gene-specific primer sequence. In the default V4 index set:

Underlined = actual index bases
lowercase = spacer bases
BOLD UPPERCASE = gene-specific primer region

Phase	read1index	read2index	bc
0	cagtAGA	ATGGACT	CAGTAGAATGGACT
1	TcagtAG	GCTAGCa	TCAGTAGGCTAGCA
2	GGcagtA	TGACTat	GGCAGTATGACTAT
3	ACTcagt	CGGTatc	ACTCAGTCGGTATC
4	TAACcag	GTAatcc	TAACCAGGTAATCC
5	CAGTCca	AAatccT	CAGTCCAAAATCCT
6	ATCGATc	CatccTA	ATCGATCCATCCTA
7	GCAAGTC	atccTAC	GCAAGTCATCCTAC

Changing Only the Gene-Specific Region

If you use the same primer design and index scheme but target a different gene, only the gene-specific regions within the indexes need to change.

Example: For the default V4 primers, the gene starts with AGA... and ends with GTA on the forward strand, giving regions of homology AGA and TAC (both 5' to 3').

For a gene reading ATG ... CGT, the regions of homology become ATG and ACG (both 5' to 3'). The updated index table would be:

Phase	read1index	read2index	bc
0	cagtATG	ATGGACT	CAGTAGAATGGACT
1	TcagtAT	GCTAGCa	TCAGTAGGCTAGCA
2	GGcagtA	TGACTat	GGCAGTATGACTAT
3	ACTcagt	CGGTatc	ACTCAGTCGGTATC
4	TAACcag	GTAatcc	TAACCAGGTAATCC
5	CAGTCca	AAatccA	CAGTCCAAAATCCT
6	ATCGATc	CatccAC	ATCGATCCATCCTA
7	GCAAGTC	atccACG	GCAAGTCATCCTAC

Only the gene-specific regions (bold) have changed — index and spacer sequences remain identical.

Mixed Base Characters

Avoid mixed base characters such as W or N in the first three positions of your gene-specific primer. If such bases are present, you must include extra entries in the indexfordemux.txt table — one for each possible base (e.g., for W, one entry with A and one with T). Each affected sample should appear twice in the samplesheet, and the output files will need to be merged downstream.

Updating Length Parameters

If your custom primers change the index or primer lengths, update config.yaml:

Parameter	Description	Default
`lenR1index`	Longest read 1 index length	`7`
`lenR2index`	Longest read 2 index length	`7`
`lenR1primer`	Gene-specific primer + spacer length (read 1)	`23`
`lenR2primer`	Gene-specific primer + spacer length (read 2)	`24`

Note

For all pre-built 16S index sets (V1–V2 through V7–V9), the default length values are correct and do not need to be changed.