FASTA format
In bioinformatics, long character strings are often encoded in a format called FASTA. A FASTA file can contain several strings, each identified by a name marked by a “>” character at the beginning of the line.
Write a program that reads a FASTA file such as:
>Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED
And prints the following output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Note that a high-quality implementation will not hold the entire file in memory at once; real FASTA files can be multiple gigabytes in size.
C++
<lang cpp>#include <iostream>
- include <fstream>
int main( int argc, char **argv ){
if( argc <= 1 ){ std::cerr << "Usage: "<<argv[0]<<" [infile]" << std::endl; return -1; }
std::ifstream input(argv[1]); if(!input.good()){ std::cerr << "Error opening '"<<argv[1]<<"'. Bailing out." << std::endl; return -1; }
std::string line, name, content; while( std::getline( input, line ).good() ){ if( line.empty() || line[0] == '>' ){ // Identifier marker if( !name.empty() ){ // Print out what we read from the last entry std::cout << name << " : " << content << std::endl; name.clear(); } if( !line.empty() ){ name = line.substr(1); } content.clear(); } else if( !name.empty() ){ if( line.find(' ') != std::string::npos ){ // Invalid sequence--no spaces allowed name.clear(); content.clear(); } else { content += line; } } } if( !name.empty() ){ // Print out what we read from the last entry std::cout << name << " : " << content << std::endl; } return 0;
}</lang>
- Output:
Rosetta_Example_1 : THERECANBENOSPACE Rosetta_Example_2 : THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
D
<lang d>import std.stdio, std.string;
void main() {
immutable fileName = "fasta_format_data.fasta";
bool first = true;
foreach (const line; fileName.File.byLine) { if (line[0] == '>') { if (first) { first = false; } else { writeln; }
write(line[1 .. $].strip, ": "); } else { line.strip.write; } }
writeln;
}</lang>
- Output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Perl 6
Certainly not the most elegant way to do it, but that's a start: <lang Perl 6>say "{.[0]}: {.[1]>>.chomp.join}" for ">Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED".comb: / '>' (\N+)\n (<!before '>'>\N+\n?)+ /, :match</lang>
Python
I use a string to mimic an input file. If it was an input file, then the file is read line-by-line and I use a generator expression yielding key, value pairs as soon as they are read keeping the minimum in memory. <lang python>import io
FASTA=\ >Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED
infile = io.StringIO(FASTA)
def fasta_parse(infile):
key = for line in infile: if line.startswith('>'): if key: yield key, val key, val = line[1:].rstrip().split()[0], elif key: val += line.rstrip() if key: yield key, val
print('\n'.join('%s: %s' % keyval for keyval in fasta_parse(infile)))</lang>
- Output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Racket
<lang racket>
- lang racket
(let loop ([m #t])
(when m (when (regexp-try-match #rx"^>" (current-input-port)) (unless (eq? #t m) (newline)) (printf "~a: " (read-line))) (loop (regexp-match #rx"\n" (current-input-port) 0 #f (current-output-port)))))
(newline) </lang>
REXX
version 1
This REXX version correctly processes the examples shown. <lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */
do while lines(iFID)\==0 /*process the FASTA file contents*/ x=strip(linein(iFID), 'T') /*read a line (record) from file,*/ /* and strip trailing blanks.*/ if left(x,1)=='>' then do if $\== then say name':' $ name=substr(x,2) $= end else $=$||x end /*j*/
if $\== then say name':' $
/*stick a fork in it, we're done.*/</lang>
output when using the default input file
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
version 2
This REXX version handles (see the talk page):
- blank lines
- sequences that end in an asterisk [*]
- sequences that contain blanks, tabs, and other whitespace
- sequence names that are identified with a semicolon [;]
<lang rexx>/*REXX pgm reads a (bioinformational) FASTA file and displays contents.*/ parse arg iFID _ . /*iFID = input file to be read.*/ if iFID== then iFID='FASTA.IN' /*Not specified? Use the default*/ $=; name= /*default values (so far). */
do while lines(iFID)\==0 /*process the FASTA file contents*/ x=strip(linein(iFID), 'T') /*read a line (record) from file,*/ /* and strip trailing blanks.*/ if x== then iterate /*ignore blank lines. */ if left(x,1)==';' then do if name== then name=substr(x,2) say x iterate end if left(x,1)=='>' then do if $\== then say name':' $ name=substr(x,2) $= end else $=space($||translate(x,,'*'),0) end /*j*/
if $\== then say name':' $
/*stick a fork in it, we're done.*/</lang>
input The FASTA2.IN file is shown below:
;LCBO - Prolactin precursor - Bovine ; a sample sequence in FASTA format MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC* >MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA DIDGDGQVNYEEFVQMMTAK* >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY
output when the FASTA2.IN input file is used:
;LCBO - Prolactin precursor - Bovine ; a sample sequence in FASTA format LCBO - Prolactin precursor - Bovine: MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSSEMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHLVTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDEDARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken: ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]: LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY
Run BASIC
<lang runbasic>a$ = ">Rosetta_Example_1 THERECANBENOSPACE >Rosetta_Example_2 THERECANBESEVERAL LINESBUTTHEYALLMUST BECONCATENATED"
i = 1 while i <= len(a$)
if mid$(a$,i,17) = ">Rosetta_Example_" then print print mid$(a$,i,18);": "; i = i + 17 else if asc(mid$(a$,i,1)) > 20 then print mid$(a$,i,1); end if i = i + 1
wend</lang>
- Output:
>Rosetta_Example_1: THERECANBENOSPACE >Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED
Tcl
<lang tcl>proc fastaReader {filename} {
set f [open $filename] set sep "" while {[gets $f line] >= 0} {
if {[string match >* $line]} { puts -nonewline "$sep[string range $line 1 end]: " set sep "\n" } else { puts -nonewline $line }
} puts "" close $f
}
fastaReader ./rosettacode.fas</lang>
- Output:
Rosetta_Example_1: THERECANBENOSPACE Rosetta_Example_2: THERECANBESEVERALLINESBUTTHEYALLMUSTBECONCATENATED