Extract the first n sequences from a FASTA file

Published on

A FASTA file consists of a series of biological sequences (DNA, RNA, or protein). It looks like this:

>gi|173695|gb|M59083.1|AETRR16S Acetomaculum ruminis 16S ribosomal RNA

>gi|310975154|ref|NR_037018.1| Acidaminococcus fermentans strain VR4 16S ribosomal RNA gene, partial sequence

There probably exist dozens of python scripts to extract the first \(n\) sequences from a FASTA file. Here I will show an awk one-liner that performs this task, and explain how it works.

Here it is (assuming the number of sequences is stored in the environment variable NSEQS):

awk "/^>/ {n++} n>$NSEQS {exit} {print}"

This one-liner can read from standard input (e.g. as part of a pipe), or you can append one or more file names to the end of the command, e.g.

awk "/^>/ {n++} n>$NSEQS {exit} {print}" file.fasta

An awk script consists of one or more statements of the form pattern { actions }. The input is read line-by-line, and if the current line matches the pattern, the corresponding actions are executed.

Our script consists of 3 statements:

  1. /^>/ {n++} increments the counter each time a new sequence is started. /.../ denotes a regular expression pattern, and ^> is a regular expression that matches the > sign at the beginning of a line.

    An uninitialized variable in awk has the value 0, which is exactly what we want here. If we needed some other initial value (say, 1), we could have added a BEGIN pattern like this: BEGIN {n=1}.

  2. n>$NSEQS {exit} aborts processing once the counter reaches the desired number of sequences.

  3. {print} is an action without a pattern (and thus matching every line), which prints every line of the input until the script is aborted by exit.

A shorter and more cryptic way to write the same is

awk "/^>/ {n++} n>$NSEQS {exit} 1"

Here I replaced the action-without-pattern by a pattern-without-action. The pattern 1 (meaning “true”) matches every line, and when the action is omitted, it is assumed to be {print}.