FASTA format

In bioinformatics, long character strings are often encoded in a format called FASTA. A FASTA file can contain several strings, each identified by a name marked by a “>” character at the beginning of the line.

Write a program that reads a FASTA file such as:

>Rosetta_Example_1
THERECANBENOSPACE
>Rosetta_Example_2
THERECANBESEVERAL
LINESBUTTHEYALLMUST
BECONCATENATED

And prints the following output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Note that a high-quality implementation will not hold the entire file in memory at once; real FASTA files can be multiple gigabytes in size.

C++

<lang cpp>#include <iostream>

include <fstream>

int main( int argc, char **argv ){

   if( argc <= 1 ){
       std::cerr << "Usage: "<<argv[0]<<" [infile]" << std::endl;
       return -1;
   }

   std::ifstream input(argv[1]);
   if(!input.good()){
       std::cerr << "Error opening '"<<argv[1]<<"'. Bailing out." << std::endl;
       return -1;
   }

   std::string line, name, content;
   while( std::getline( input, line ).good() ){
       if( line.empty() || line[0] == '>' ){ // Identifier marker
           if( !name.empty() ){ // Print out what we read from the last entry
               std::cout << name << " : " << content << std::endl;
               name.clear();
           }
           if( !line.empty() ){
               name = line.substr(1);
           }
           content.clear();
       } else if( !name.empty() ){
           if( line.find(' ') != std::string::npos ){ // Invalid sequence--no spaces allowed
               name.clear();
               content.clear();
           } else {
               content += line;
           }
       }
   }
   if( !name.empty() ){ // Print out what we read from the last entry
       std::cout << name << " : " << content << std::endl;
   }
   
   return 0;

}</lang>

Output:

Rosetta_Example_1 : THERECANBENOSPACE
Rosetta_Example_2 : THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

D

<lang d>import std.stdio, std.string;

void main() {

   immutable fileName = "fasta_format_data.fasta";

   bool first = true;

   foreach (const line; fileName.File.byLine) {
       if (line[0] == '>') {
           if (first) {
               first = false;
           } else {
               writeln;
           }

           write(line[1 .. $].strip, ": ");
       } else {
           line.strip.write;
       }
   }

   writeln;

}</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Perl 6

Certainly not the most elegant way to do it, but that's a start: <lang Perl 6>say "{.[0]}: {.[1]>>.chomp.join}" for ">Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED".comb: / '>' (\N+)\n (<!before '>'>\N+\n?)+ /, :match</lang>

Python

I use a string to mimic an input file. If it was an input file, then the file is read line-by-line and I use a generator expression yielding key, value pairs as soon as they are read keeping the minimum in memory. <lang python>import io

FASTA=\ >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED

infile = io.StringIO(FASTA)

def fasta_parse(infile):

   key = 
   for line in infile:
       if line.startswith('>'):
           if key:
               yield key, val
           key, val = line[1:].rstrip().split()[0], 
       elif key:
           val += line.rstrip()
   if key:
       yield key, val

print('\n'.join('%s: %s' % keyval for keyval in fasta_parse(infile)))</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Racket

lang racket

(let loop ([m #t])

 (when m
   (when (regexp-try-match #rx"^>" (current-input-port))
     (unless (eq? #t m) (newline))
     (printf "~a: " (read-line)))
   (loop (regexp-match #rx"\n" (current-input-port) 0 #f
                       (current-output-port)))))

(newline) </lang>

REXX

version 1

This REXX version correctly processes the examples shown. <lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */

  do  while  lines(iFID)\==0          /*process the FASTA file contents*/
  x=strip(linein(iFID), 'T')          /*read a line (record) from file,*/
                                      /*     and strip trailing blanks.*/
  if left(x,1)=='>'  then do
                          if $\==  then say name':'  $
                          name=substr(x,2)
                          $=
                          end
                     else $=$||x
  end   /*j*/

if $\== then say name':' $

                                      /*stick a fork in it, we're done.*/</lang>

output when using the default input file

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

version 2

This REXX version handles (see the talk page):

blank lines
sequences that end in an asterisk [*]
sequences that contain blanks, tabs, and other whitespace
sequence names that are identified with a semicolon [;]

<lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */

  do  while  lines(iFID)\==0          /*process the FASTA file contents*/
  x=strip(linein(iFID), 'T')          /*read a line (record) from file,*/
                                      /*     and strip trailing blanks.*/
  if x==  then iterate              /*ignore blank lines.            */
  if left(x,1)==';'  then do
                          if name== then name=substr(x,2)
                          say x
                          iterate
                          end
  if left(x,1)=='>'  then do
                          if $\==  then say name':'  $
                          name=substr(x,2)
                          $=
                          end
                     else $=space($||translate(x,,'*'),0)
  end   /*j*/

if $\== then say name':' $

                                      /*stick a fork in it, we're done.*/</lang>

input The FASTA2.IN file is shown below:

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

output when the FASTA2.IN input file is used:

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
LCBO - Prolactin precursor - Bovine: MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSSEMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHLVTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDEDARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC
MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken: ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK
gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]: LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY

Run BASIC

<lang runbasic>a$ = ">Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED"

i = 1 while i <= len(a$)

 if mid$(a$,i,17) = ">Rosetta_Example_" then 
   print 
   print mid$(a$,i,18);": ";
   i = i + 17
  else
   if asc(mid$(a$,i,1)) > 20 then print mid$(a$,i,1);
 end if
 i = i + 1

wend</lang>

Output:

>Rosetta_Example_1: THERECANBENOSPACE
>Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED

Tcl

<lang tcl>proc fastaReader {filename} {

   set f [open $filename]
   set sep ""
   while {[gets $f line] >= 0} {

if {[string match >* $line]} { puts -nonewline "$sep[string range $line 1 end]: " set sep "\n" } else { puts -nonewline $line }

   }
   puts ""
   close $f

}

fastaReader ./rosettacode.fas</lang>

Output:

Rosetta_Example_1: THERECANBENOSPACE
Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED