Extract the first n sequences from a FASTA file
Published on
A FASTA file consists of a series of biological sequences (DNA, RNA, or protein). It looks like this:
>gi|173695|gb|M59083.1|AETRR16S Acetomaculum ruminis 16S ribosomal RNA
NNTAAACAAGAGAGTTCGATCCTGGCTCAGGATNAACGCTGGCGGCATGCCTAACACATGCAAGTCGAAC
GGAGTGCTTGTAGAAGCTTTTTCGGAAGTGGAAATAAGTTACTTAGTGGCGGACGGGTGAGTAACGCGTG
>gi|310975154|ref|NR_037018.1| Acidaminococcus fermentans strain VR4 16S ribosomal RNA gene, partial sequence
GGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAGAACTTTCTTCGGAATGTTC
TTAGTGGCGAACGGGTGAGTAACGCGTAGGCAACCTNCCCCTCTGTTGGGGACAACATTCCGAAAGGGAT
There probably exist dozens of python scripts to extract the first \(n\) sequences from a FASTA file. Here I will show an awk one-liner that performs this task, and explain how it works.
Here it is (assuming the number of sequences is stored in the
environment variable NSEQS
):
awk "/^>/ {n++} n>$NSEQS {exit} {print}"
This one-liner can read from standard input (e.g. as part of a pipe), or you can append one or more file names to the end of the command, e.g.
awk "/^>/ {n++} n>$NSEQS {exit} {print}" file.fasta
An awk script consists of one or more statements of the form
pattern { actions }
. The input is read line-by-line, and if
the current line matches the pattern, the corresponding actions are
executed.
Our script consists of 3 statements:
/^>/ {n++}
increments the counter each time a new sequence is started./.../
denotes a regular expression pattern, and^>
is a regular expression that matches the>
sign at the beginning of a line.An uninitialized variable in awk has the value 0, which is exactly what we want here. If we needed some other initial value (say, 1), we could have added a
BEGIN
pattern like this:BEGIN {n=1}
.n>$NSEQS {exit}
aborts processing once the counter reaches the desired number of sequences.{print}
is an action without a pattern (and thus matching every line), which prints every line of the input until the script is aborted byexit
.
A shorter and more cryptic way to write the same is
awk "/^>/ {n++} n>$NSEQS {exit} 1"
Here I replaced the action-without-pattern by a
pattern-without-action. The pattern 1
(meaning “true”)
matches every line, and when the action is omitted, it is assumed to be
{print}
.