Bioinformatics/base count
You are encouraged to solve this task according to the task description, using any language you may know.
Given this string representing ordered DNA bases:
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
- Task
-
- "Pretty print" the sequence followed by a summary of the counts of each of the bases: (A, C, G, and T) in the sequence
- print the total count of each base in the string.
- Metrics
- Counting
- Word frequency
- Letter frequency
- Jewels and stones
- I before E except after C
- Bioinformatics/base count
- Count occurrences of a substring
- Count how many vowels and consonants occur in a string
- Remove/replace
- XXXX redacted
- Conjugate a Latin verb
- Remove vowels from a string
- String interpolation (included)
- Strip block comments
- Strip comments from a string
- Strip a set of characters from a string
- Strip whitespace from a string -- top and tail
- Strip control codes and extended characters from a string
- Anagrams/Derangements/shuffling
- Word wheel
- ABC problem
- Sattolo cycle
- Knuth shuffle
- Ordered words
- Superpermutation minimisation
- Textonyms (using a phone text pad)
- Anagrams
- Anagrams/Deranged anagrams
- Permutations/Derangements
- Find/Search/Determine
- ABC words
- Odd words
- Word ladder
- Semordnilap
- Word search
- Wordiff (game)
- String matching
- Tea cup rim text
- Alternade words
- Changeable words
- State name puzzle
- String comparison
- Unique characters
- Unique characters in each string
- Extract file extension
- Levenshtein distance
- Palindrome detection
- Common list elements
- Longest common suffix
- Longest common prefix
- Compare a list of strings
- Longest common substring
- Find common directory path
- Words from neighbour ones
- Change e letters to i in words
- Non-continuous subsequences
- Longest common subsequence
- Longest palindromic substrings
- Longest increasing subsequence
- Words containing "the" substring
- Sum of the digits of n is substring of n
- Determine if a string is numeric
- Determine if a string is collapsible
- Determine if a string is squeezable
- Determine if a string has all unique characters
- Determine if a string has all the same characters
- Longest substrings without repeating characters
- Find words which contains all the vowels
- Find words which contain the most consonants
- Find words which contains more than 3 vowels
- Find words whose first and last three letters are equal
- Find words with alternating vowels and consonants
- Formatting
- Substring
- Rep-string
- Word wrap
- String case
- Align columns
- Literals/String
- Repeat a string
- Brace expansion
- Brace expansion using ranges
- Reverse a string
- Phrase reversals
- Comma quibbling
- Special characters
- String concatenation
- Substring/Top and tail
- Commatizing numbers
- Reverse words in a string
- Suffixation of decimal numbers
- Long literals, with continuations
- Numerical and alphabetical suffixes
- Abbreviations, easy
- Abbreviations, simple
- Abbreviations, automatic
- Song lyrics/poems/Mad Libs/phrases
- Mad Libs
- Magic 8-ball
- 99 bottles of beer
- The Name Game (a song)
- The Old lady swallowed a fly
- The Twelve Days of Christmas
- Tokenize
- Text between
- Tokenize a string
- Word break problem
- Tokenize a string with escaping
- Split a character string based on change of character
- Sequences
11l
F basecount(dna)
DefaultDict[Char, Int] d
L(c) dna
d[c]++
R sorted(d.items())
F seq_split(dna, n = 50)
R (0 .< dna.len).step(n).map(i -> @dna[i .+ @n])
F seq_pp(dna, n = 50)
L(part) seq_split(dna, n)
print(‘#5: #.’.format(L.index * n, part))
print("\n BASECOUNT:")
V tot = 0
L(base, count) basecount(dna)
print(‘ #3: #.’.format(base, count))
tot += count
V (base, count) = (‘TOT’, tot)
print(‘ #3= #.’.format(base, count))
print(‘SEQUENCE:’)
V sequence = "\
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG\
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG\
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT\
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT\
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG\
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA\
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT\
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG\
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC\
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
seq_pp(sequence)
- Output:
SEQUENCE: 0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT BASECOUNT: A: 129 C: 97 G: 119 T: 155 TOT= 500
AArch64 Assembly
/* ARM assembly AARCH64 Raspberry PI 3B */
/* program cptAdn64.s */
/************************************/
/* Constantes */
/************************************/
/* for this file see task include a file in language AArch64 assembly*/
.include "../includeConstantesARM64.inc"
.equ LIMIT, 30
.equ SHIFT, 8
//.include "../../ficmacros64.inc" // use for debugging
/************************************/
/* Initialized data */
/************************************/
.data
szMessResult: .asciz "Result: "
szDNA1: .ascii "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG"
.ascii "CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG"
.ascii "AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT"
.ascii "GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT"
.ascii "CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG"
.ascii "TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA"
.ascii "TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT"
.ascii "CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG"
.ascii "TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC"
.asciz "GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
szCarriageReturn: .asciz "\n"
szMessStart: .asciz "Program 64 bits start.\n"
szMessCounterA: .asciz "Base A : "
szMessCounterC: .asciz "Base C : "
szMessCounterG: .asciz "Base G : "
szMessCounterT: .asciz "Base T : "
szMessTotal: .asciz "Total : "
sPrintLine: .fill LIMIT + SHIFT + 2,1,' ' // init line with spaces
/************************************/
/* UnInitialized data */
/************************************/
.bss
sZoneConv: .skip 24
/************************************/
/* code section */
/************************************/
.text
.global main
main: // entry of program
ldr x0,qAdrszMessStart
bl affichageMess
ldr x0,qAdrszDNA1
bl printDNA
ldr x0,qAdrszDNA1
bl countBase
100: // standard end of the program
mov x0, #0 // return code
mov x8, #EXIT // request to exit program
svc 0 // perform the system call
qAdrszDNA1: .quad szDNA1
qAdrsZoneConv: .quad sZoneConv
qAdrszMessResult: .quad szMessResult
qAdrszCarriageReturn: .quad szCarriageReturn
qAdrszMessStart: .quad szMessStart
/***************************************************/
/* count dna line and print */
/***************************************************/
/* x0 contains dna string address */
printDNA:
stp x1,lr,[sp,-16]!
stp x2,x3,[sp,-16]!
stp x4,x5,[sp,-16]!
stp x6,x7,[sp,-16]!
stp x8,x9,[sp,-16]!
mov x8,x0 // save address
mov x4,#0 // counter
mov x3,#0 // index string
mov x4,#0 // byte line counter
mov x5,#1 // start line value
ldr x7,qAdrsPrintLine
ldr x9,qAdrsZoneConv
1:
ldrb w6,[x8,x3] // load byte of dna
cmp x6,#0 // end string ?
beq 4f // yes -> end
add x1,x7,#SHIFT
strb w6,[x1,x4] // store byte in display line
add x4,x4,#1 // increment index line
cmp x4,#LIMIT // end line ?
blt 3f
mov x0,x5 // convert decimal counter base
mov x1,x9
bl conversion10
mov x2,xzr
2: // copy decimal conversion in display line
ldrb w6,[x9,x2]
strb w6,[x7,x2]
add x2,x2,1
cmp x2,x0
blt 2b
mov x0,#0 // Zero final
add x1,x7,#LIMIT
add x1,x1,#SHIFT + 1
strb w0,[x1]
mov x0,x7 // line display
bl affichageMess
ldr x0,qAdrszCarriageReturn
bl affichageMess
add x5,x5,#LIMIT // add line size to counter
mov x4,#0 // and init line index
3:
add x3,x3,#1 // increment index string
b 1b // and loop
4: // display end line if line contains base
cmp x4,#0
beq 100f
mov x0,x5
mov x1,x9
bl conversion10
mov x2,xzr
5: // copy decimal conversion in display line
ldrb w6,[x9,x2]
strb w6,[x7,x2]
add x2,x2,1
cmp x2,x0
blt 5b
mov x0,#0 // Zero final
add x1,x7,x4
add x1,x1,#SHIFT
strb w0,[x1]
mov x0,x7 // last line display
bl affichageMess
ldr x0,qAdrszCarriageReturn
bl affichageMess
100:
ldp x8,x9,[sp],16
ldp x6,x7,[sp],16
ldp x4,x5,[sp],16
ldp x2,x3,[sp],16
ldp x1,lr,[sp],16
ret
qAdrsPrintLine: .quad sPrintLine
/***************************************************/
/* count bases */
/***************************************************/
/* x0 contains dna string address */
countBase:
stp x1,lr,[sp,-16]!
stp x2,x3,[sp,-16]!
stp x4,x5,[sp,-16]!
stp x6,x7,[sp,-16]!
mov x2,#0 // string index
mov x3,#0 // A counter
mov x4,#0 // C counter
mov x5,#0 // G counter
mov x6,#0 // T counter
1:
ldrb w1,[x0,x2] // load byte of dna
cmp x1,#0 // end string ?
beq 2f
cmp x1,#'A'
cinc x3,x3,eq
cmp x1,#'C'
cinc x4,x4,eq
cmp x1,#'G'
cinc x5,x5,eq
cmp x1,#'T'
cinc x6,x6,eq
add x2,x2,#1
b 1b
2:
mov x0,x3 // convert decimal counter A
ldr x1,qAdrsZoneConv
bl conversion10
ldr x0,qAdrszMessCounterA
bl affichageMess
ldr x0,qAdrsZoneConv
bl affichageMess
ldr x0,qAdrszCarriageReturn
bl affichageMess
mov x0,x4 // convert decimal counter C
ldr x1,qAdrsZoneConv
bl conversion10
ldr x0,qAdrszMessCounterC
bl affichageMess
ldr x0,qAdrsZoneConv
bl affichageMess
ldr x0,qAdrszCarriageReturn
bl affichageMess
mov x0,x5 // convert decimal counter G
ldr x1,qAdrsZoneConv
bl conversion10
ldr x0,qAdrszMessCounterG
bl affichageMess
ldr x0,qAdrsZoneConv
bl affichageMess
ldr x0,qAdrszCarriageReturn
bl affichageMess
mov x0,x6 // convert decimal counter T
ldr x1,qAdrsZoneConv
bl conversion10
ldr x0,qAdrszMessCounterT
bl affichageMess
ldr x0,qAdrsZoneConv
bl affichageMess
ldr x0,qAdrszCarriageReturn
bl affichageMess
add x0,x3,x4 // convert decimal total
add x0,x0,x5
add x0,x0,x6
ldr x1,qAdrsZoneConv
bl conversion10
ldr x0,qAdrszMessTotal
bl affichageMess
ldr x0,qAdrsZoneConv
bl affichageMess
ldr x0,qAdrszCarriageReturn
bl affichageMess
100:
ldp x6,x7,[sp],16
ldp x4,x5,[sp],16
ldp x2,x3,[sp],16
ldp x1,lr,[sp],16 // TODO: retaur à completer
ret
qAdrszMessCounterA: .quad szMessCounterA
qAdrszMessCounterC: .quad szMessCounterC
qAdrszMessCounterG: .quad szMessCounterG
qAdrszMessCounterT: .quad szMessCounterT
qAdrszMessTotal: .quad szMessTotal
/***************************************************/
/* ROUTINES INCLUDE */
/***************************************************/
/* for this file see task include a file in language AArch64 assembly*/
.include "../includeARM64.inc"
- Output:
Program 64 bits start. 1 CGTAAAAAATTACAACGTCCTTTGGCTATC 31 TCTTAAACTCCTGCTAAATGCTCGTGCTTT 61 CCAATTATGTAAGCGTTCCGAGACGGGGTG 91 GTCGATTCTGAGGACAAAGGTCAAGATGGA 121 GCGCATCGAACGCAATAAGGATCATTTGAT 151 GGGACGTTTCGTCGACAAAGTCTTGTTTCG 181 AGAGTAACGGCTACCGTCTTCGATTCTGCT 211 TATAACACTATGTTCTTATGAAATGGATGT 241 TCTGAGTTGGTCAGTCCCAATGTGCGGGGT 271 TTCTTTTAGTACGTCGGGAGTGGTATTATA 301 TTTAATTTTTCTATATAGCGATCTGTATTT 331 AAGCAATTCATTTAGGTTATCGCCGCGATG 361 CTCGGTTCGGACCGCCAAGCATCTGGCTCC 391 ACTGCTAGTGTCCTAAATTTGAATGGCAAA 421 CACAAATAAGATTTAGCAATTCGTGTAGAC 451 GACCGGGGACTTGCATGATGGGAGCAGCTT 481 TGTTAAACTACGAACGTAAT Base A : 129 Base C : 97 Base G : 119 Base T : 155 Total : 500
Action!
I the solution the number of nucleotides per row is equal 30 to fit the screen on Atari 8-bit computer.
DEFINE PTR="CARD"
PROC PrettyPrint(PTR ARRAY data INT count,gsize,gcount)
INT index,item,i,ingroup,group,a,t,c,g
CHAR ARRAY s
CHAR ch
index=0 item=0 i=1 ingroup=0 group=0
a=0 t=0 g=0 c=0
s=data(0)
DO
WHILE i>s(0)
DO
i=1 item==+1
IF item>=count THEN EXIT FI
s=data(item)
OD
IF item>=count THEN EXIT FI
index==+1
IF group=0 AND ingroup=0 THEN
IF index<10 THEN Put(32) FI
IF index<100 THEN Put(32) FI
PrintI(index) Print(":")
FI
IF ingroup=0 THEN Put(32) FI
ch=s(i) i==+1
Put(ch)
IF ch='A THEN a==+1
ELSEIF ch='T THEN t==+1
ELSEIF ch='C THEN c==+1
ELSEIF ch='G THEN g==+1 FI
ingroup==+1
IF ingroup>=gsize THEN
ingroup=0 group==+1
IF group>=gcount THEN
group=0
FI
FI
OD
PrintF("%E%EBases: A:%I, T:%I, C:%I, G:%I%E",a,t,c,g)
PrintF("%ETotal: %I",a+t+g+c)
RETURN
PROC Main()
PTR ARRAY data(10)
BYTE LMARGIN=$52,oldLMARGIN
oldLMARGIN=LMARGIN
LMARGIN=0 ;remove left margin on the screen
Put(125) PutE() ;clear the screen
data(0)="CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG"
data(1)="CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG"
data(2)="AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT"
data(3)="GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT"
data(4)="CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG"
data(5)="TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA"
data(6)="TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT"
data(7)="CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG"
data(8)="TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC"
data(9)="GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
PrettyPrint(data,10,5,6)
LMARGIN=oldLMARGIN ;restore left margin on the screen
RETURN
- Output:
Screenshot from Atari 8-bit computer
1: CGTAA AAAAT TACAA CGTCC TTTGG CTATC 31: TCTTA AACTC CTGCT AAATG CTCGT GCTTT 61: CCAAT TATGT AAGCG TTCCG AGACG GGGTG 91: GTCGA TTCTG AGGAC AAAGG TCAAG ATGGA 121: GCGCA TCGAA CGCAA TAAGG ATCAT TTGAT 151: GGGAC GTTTC GTCGA CAAAG TCTTG TTTCG 181: AGAGT AACGG CTACC GTCTT CGATT CTGCT 211: TATAA CACTA TGTTC TTATG AAATG GATGT 241: TCTGA GTTGG TCAGT CCCAA TGTGC GGGGT 271: TTCTT TTAGT ACGTC GGGAG TGGTA TTATA 301: TTTAA TTTTT CTATA TAGCG ATCTG TATTT 331: AAGCA ATTCA TTTAG GTTAT CGCCG CGATG 361: CTCGG TTCGG ACCGC CAAGC ATCTG GCTCC 391: ACTGC TAGTG TCCTA AATTT GAATG GCAAA 421: CACAA ATAAG ATTTA GCAAT TCGTG TAGAC 451: GACCG GGGAC TTGCA TGATG GGAGC AGCTT 481: TGTTA AACTA CGAAC GTAAT Bases: A:129, T:155, C:97, G:119 Total: 500
Ada
with Ada.Text_Io;
procedure Base_Count is
type Sequence is new String;
Test : constant Sequence :=
"CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG" &
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG" &
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT" &
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT" &
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG" &
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA" &
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT" &
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG" &
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC" &
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT";
Line_Width : constant := 70;
procedure Put (Seq : Sequence) is
use Ada.Text_Io;
package Position_Io is new Ada.Text_Io.Integer_Io (Natural);
First : Natural := Seq'First;
Last : Natural;
begin
loop
Last := Natural'Min (Seq'Last, First + Line_Width - 1);
Position_Io.Put (First, Width => 3);
Put (String'(".."));
Position_Io.Put (Last, Width => 3);
Put (String'(" "));
Put (String (Seq (First .. Last)));
New_Line;
exit when Last = Seq'Last;
First := First + Line_Width;
end loop;
end Put;
procedure Count (Seq : Sequence) is
use Ada.Text_Io;
A_Count, C_Count : Natural := 0;
G_Count, T_Count : Natural := 0;
begin
for B of Seq loop
case B is
when 'A' => A_Count := A_Count + 1;
when 'C' => C_Count := C_Count + 1;
when 'G' => G_Count := G_Count + 1;
when 'T' => T_Count := T_Count + 1;
when others =>
raise Constraint_Error;
end case;
end loop;
Put_Line ("A: " & A_Count'Image);
Put_Line ("C: " & C_Count'Image);
Put_Line ("G: " & G_Count'Image);
Put_Line ("T: " & T_Count'Image);
Put_Line ("Total: " & Seq'Length'Image);
end Count;
begin
Put (Test);
Count (Test);
end Base_Count;
- Output:
1.. 70 CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGT 71..140 AAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGG 141..210 ATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCT 211..280 TATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGT 281..350 ACGTCGGGAGTGGTATTATATTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 351..420 CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTGTCCTAAATTTGAATGGCAAA 421..490 CACAAATAAGATTTAGCAATTCGTGTAGACGACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTA 491..500 CGAACGTAAT A: 129 C: 97 G: 119 T: 155 Total: 500
ALGOL 68
Includes a count for non-bases if they are present in the sequence, as this would presumably indicate an error.
BEGIN # count DNA bases in a sequence #
# returns an array of counts of the characters in s that are in c #
# an extra final element holds the count of characters not in c #
PRIO COUNT = 9;
OP COUNT = ( STRING s, STRING c )[]INT:
BEGIN
[ LWB c : UPB c + 1 ]INT results; # extra element for "other" #
[ 0 : 255 ]INT counts; # only counts ASCII characters #
FOR i FROM LWB counts TO UPB counts DO counts[ i ] := 0 OD;
FOR i FROM LWB results TO UPB results DO results[ i ] := 0 OD;
# count the occurrences of each ASCII character in s #
FOR i FROM LWB s TO UPB s DO
IF INT ch pos = ABS s[ i ];
ch pos >= LWB counts AND ch pos <= UPB counts
THEN
# have a character we can count #
counts[ ch pos ] +:= 1
ELSE
# not an ASCII character ? #
results[ UPB results ] +:= 1
FI
OD;
# return the counts of the required characters #
# set the results for the expected characters and clear their #
# counts so we can count the "other" characters #
FOR i FROM LWB results TO UPB results - 1 DO
IF INT ch pos = ABS c[ i ];
ch pos >= LWB counts AND ch pos <= UPB counts
THEN
results[ i ] := counts[ ch pos ];
counts[ ch pos ] := 0
FI
OD;
# count the "other" characters #
FOR i FROM LWB counts TO UPB counts DO
IF counts[ i ] /= 0 THEN
results[ UPB results ] +:= counts[ i ]
FI
OD;
results
END; # COUNT #
# returns the combined counts of the characters in the elements of s #
# that are in c #
# an extra final element holds the count of characters not in c #
OP COUNT = ( []STRING s, STRING c )[]INT:
BEGIN
[ LWB c : UPB c + 1 ]INT results;
FOR i FROM LWB results TO UPB results DO results[ i ] := 0 OD;
FOR i FROM LWB s TO UPB s DO
[]INT counts = s[ i ] COUNT c;
FOR i FROM LWB results TO UPB results DO
results[ i ] +:= counts[ i ]
OD
OD;
results
END; # COUNT #
# returns the length of s #
OP LEN = ( STRING s )INT: ( UPB s - LWB s ) + 1;
# count the bases in the required sequence #
[]STRING seq = ( "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG"
, "CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG"
, "AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT"
, "GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT"
, "CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG"
, "TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA"
, "TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT"
, "CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG"
, "TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC"
, "GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
);
STRING bases = "ATCG";
[]INT counts = seq COUNT bases;
# print the sequence with leading character positions #
# find the overall length of the sequence #
INT seq len := 0;
FOR i FROM LWB seq TO UPB seq DO
seq len +:= LEN seq[ i ]
OD;
# compute the minimum field width required for the positions #
INT s len := seq len;
INT width := 1;
WHILE s len >= 10 DO
width +:= 1;
s len OVERAB 10
OD;
# show the sequence #
print( ( "Sequence:", newline, newline ) );
INT start := 0;
FOR i FROM LWB seq TO UPB seq DO
print( ( " ", whole( start, - width ), " :", seq[ i ], newline ) );
start +:= LEN seq[ i ]
OD;
# show the base counts #
print( ( newline, "Bases: ", newline, newline ) );
INT total := 0;
FOR i FROM LWB bases TO UPB bases DO
print( ( " ", bases[ i ], " : ", whole( counts[ i ], - width ), newline ) );
total +:= counts[ i ]
OD;
# show the count of other characters (invalid bases) - if there are any #
IF INT others = UPB counts;
counts[ others ] /= 0
THEN
# there were characters other than the bases #
print( ( newline, "Other: ", whole( counts[ others ], - width ), newline, newline ) );
total +:= counts[ UPB counts ]
FI;
# totals #
print( ( newline, "Total: ", whole( total, - width ), newline ) )
END
- Output:
Sequence: 0 :CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50 :CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100 :AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150 :GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200 :CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250 :TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300 :TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350 :CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400 :TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450 :GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Bases: A : 129 T : 155 C : 97 G : 119 Total: 500
APL
bases←'CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCC',
'GAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGG',
'GACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTC',
'TTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTA',
'TATTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTATCGCCGCGATGCTCGGTTCGGA',
'CCGCCAAGCATCTGGCTCCACTGCTAGTGTCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGT',
'GTAGACGACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT'
50 {w←⍺⋄s←⍵⋄{(w×1-⍨⍵),((w÷⍨≢s) w ⍴s)[⍵;]} ⍳(≢s)÷⍺} bases
{⍵,':',+/bases=⍵}¨∪bases[⍋∪bases]
'Total:',≢bases
- Output:
0 CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50 CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100 AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150 GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200 CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250 TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300 TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350 CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400 TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450 GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT A: 129 C: 97 G: 119 T: 155 Total: 500
ARM Assembly
/* ARM assembly Raspberry PI */
/* program cptAdn.s */
/************************************/
/* Constantes */
/************************************/
/* for this file see task include a file in language ARM assembly*/
.include "../constantes.inc"
.equ LIMIT, 50
.equ SHIFT, 11
/************************************/
/* Initialized data */
/************************************/
.data
szMessResult: .asciz "Result: "
szDNA1: .ascii "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG"
.ascii "CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG"
.ascii "AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT"
.ascii "GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT"
.ascii "CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG"
.ascii "TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA"
.ascii "TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT"
.ascii "CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG"
.ascii "TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC"
.asciz "GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
szCarriageReturn: .asciz "\n"
szMessStart: .asciz "Program 32 bits start.\n"
szMessCounterA: .asciz "Base A : "
szMessCounterC: .asciz "Base C : "
szMessCounterG: .asciz "Base G : "
szMessCounterT: .asciz "Base T : "
szMessTotal: .asciz "Total : "
/************************************/
/* UnInitialized data */
/************************************/
.bss
sZoneConv: .skip 24
sPrintLine: .skip LIMIT + SHIFT + 2
/************************************/
/* code section */
/************************************/
.text
.global main
main: @ entry of program
ldr r0,iAdrszMessStart
bl affichageMess
ldr r0,iAdrszDNA1
bl printDNA
ldr r0,iAdrszDNA1
bl countBase
100: @ standard end of the program
mov r0, #0 @ return code
mov r7, #EXIT @ request to exit program
svc 0 @ perform the system call
iAdrszDNA1: .int szDNA1
iAdrsZoneConv: .int sZoneConv
iAdrszMessResult: .int szMessResult
iAdrszCarriageReturn: .int szCarriageReturn
iAdrszMessStart: .int szMessStart
/***************************************************/
/* count dna line and print */
/***************************************************/
/* r0 contains dna string address */
printDNA:
push {r1-r8,lr} @ save registers
mov r8,r0 @ save address
mov r4,#0 @ counter
mov r3,#0 @ index stone
mov r4,#0
mov r5,#1
ldr r7,iAdrsPrintLine
1:
ldrb r6,[r8,r3] @ load byte of dna
cmp r6,#0 @ end string ?
beq 4f
add r1,r7,#SHIFT
strb r6,[r1,r4] @ store byte in display line
add r4,r4,#1 @ increment index line
cmp r4,#LIMIT @ end line ?
blt 3f
mov r0,r5 @ convert decimal counter base
mov r1,r7
bl conversion10
mov r0,#0 @ Zero final
add r1,r7,#LIMIT
add r1,r1,#SHIFT + 1
strb r0,[r1]
mov r0,r7 @ line display
bl affichageMess
ldr r0,iAdrszCarriageReturn
bl affichageMess
add r5,r5,#LIMIT @ add line size to counter
mov r4,#0 @ and init line index
3:
add r3,r3,#1 @ increment index string
b 1b @ and loop
4: @ display end line if line contains base
cmp r4,#0
beq 100f
mov r0,r5
mov r1,r7
bl conversion10
mov r0,#0 @ Zero final
add r1,r7,r4
add r1,r1,#SHIFT
strb r0,[r1]
mov r0,r7 @ last line display
bl affichageMess
ldr r0,iAdrszCarriageReturn
bl affichageMess
100:
pop {r1-r8,pc}
iAdrsPrintLine: .int sPrintLine
/***************************************************/
/* count bases */
/***************************************************/
/* r0 contains dna string address */
countBase:
push {r1-r6,lr} @ save registers
mov r2,#0 @ string index
mov r3,#0 @ A counter
mov r4,#0 @ C counter
mov r5,#0 @ G counter
mov r6,#0 @ T counter
1:
ldrb r1,[r0,r2] @ load byte of dna
cmp r1,#0 @ end string ?
beq 2f
cmp r1,#'A'
addeq r3,r3,#1
cmp r1,#'C'
addeq r4,r4,#1
cmp r1,#'G'
addeq r5,r5,#1
cmp r1,#'T'
addeq r6,r6,#1
add r2,r2,#1
b 1b
2:
mov r0,r3 @ convert decimal counter A
ldr r1,iAdrsZoneConv
bl conversion10
ldr r0,iAdrszMessCounterA
bl affichageMess
ldr r0,iAdrsZoneConv
bl affichageMess
ldr r0,iAdrszCarriageReturn
bl affichageMess
mov r0,r4 @ convert decimal counter C
ldr r1,iAdrsZoneConv
bl conversion10
ldr r0,iAdrszMessCounterC
bl affichageMess
ldr r0,iAdrsZoneConv
bl affichageMess
ldr r0,iAdrszCarriageReturn
bl affichageMess
mov r0,r5 @ convert decimal counter G
ldr r1,iAdrsZoneConv
bl conversion10
ldr r0,iAdrszMessCounterG
bl affichageMess
ldr r0,iAdrsZoneConv
bl affichageMess
ldr r0,iAdrszCarriageReturn
bl affichageMess
mov r0,r6 @ convert decimal counter T
ldr r1,iAdrsZoneConv
bl conversion10
ldr r0,iAdrszMessCounterT
bl affichageMess
ldr r0,iAdrsZoneConv
bl affichageMess
ldr r0,iAdrszCarriageReturn
bl affichageMess
add r0,r3,r4 @ convert decimal total
add r0,r0,r5
add r0,r0,r6
ldr r1,iAdrsZoneConv
bl conversion10
ldr r0,iAdrszMessTotal
bl affichageMess
ldr r0,iAdrsZoneConv
bl affichageMess
ldr r0,iAdrszCarriageReturn
bl affichageMess
100:
pop {r1-r6,pc}
iAdrszMessCounterA: .int szMessCounterA
iAdrszMessCounterC: .int szMessCounterC
iAdrszMessCounterG: .int szMessCounterG
iAdrszMessCounterT: .int szMessCounterT
iAdrszMessTotal: .int szMessTotal
/***************************************************/
/* ROUTINES INCLUDE */
/***************************************************/
/* for this file see task include a file in language ARM assembly*/
.include "../affichage.inc"
- Output:
Program 32 bits start. 1 CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 51 CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 101 AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 151 GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 201 CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 251 TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 301 TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 351 CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 401 TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 451 GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base A : 129 Base C : 97 Base G : 119 Base T : 155 Total : 500
Arturo
dna: {
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
}
prettyPrint: function [in][
count: #[ A: 0, T: 0, G: 0, C: 0 ]
loop.with:'i split.lines in 'line [
prints [pad to :string i*50 3 ":"]
print split.every:10 line
loop split line 'ch [
case [ch=]
when? -> "A" -> count\A: count\A + 1
when? -> "T" -> count\T: count\T + 1
when? -> "G" -> count\G: count\G + 1
when? -> "C" -> count\C: count\C + 1
else []
]
]
print ["Total count => A:" count\A, "T:" count\T "G:" count\G "C:" count\C]
]
prettyPrint dna
- Output:
0 : CGTAAAAAAT TACAACGTCC TTTGGCTATC TCTTAAACTC CTGCTAAATG 50 : CTCGTGCTTT CCAATTATGT AAGCGTTCCG AGACGGGGTG GTCGATTCTG 100 : AGGACAAAGG TCAAGATGGA GCGCATCGAA CGCAATAAGG ATCATTTGAT 150 : GGGACGTTTC GTCGACAAAG TCTTGTTTCG AGAGTAACGG CTACCGTCTT 200 : CGATTCTGCT TATAACACTA TGTTCTTATG AAATGGATGT TCTGAGTTGG 250 : TCAGTCCCAA TGTGCGGGGT TTCTTTTAGT ACGTCGGGAG TGGTATTATA 300 : TTTAATTTTT CTATATAGCG ATCTGTATTT AAGCAATTCA TTTAGGTTAT 350 : CGCCGCGATG CTCGGTTCGG ACCGCCAAGC ATCTGGCTCC ACTGCTAGTG 400 : TCCTAAATTT GAATGGCAAA CACAAATAAG ATTTAGCAAT TCGTGTAGAC 450 : GACCGGGGAC TTGCATGATG GGAGCAGCTT TGTTAAACTA CGAACGTAAT Total count => A: 129 T: 155 G: 119 C: 97
AutoHotkey
test := "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATATTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTATCGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTGTCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGACGACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
tek := 1, bases := " "
loop,parse,test
{
if (A_LoopField = "A")
countA += 1
else if (A_LoopField = "C")
countC += 1
else if (A_LoopField = "G")
countG += 1
else if (A_LoopField = "T")
countT += 1
if (mod(a_index,50) = 0)
{
bases .= a_index . " -> " . substr(test,tek,50) . "`n"
tek += 50
}
}
MsgBox % bases "`nA: " countA "`nC: " countC "`nG: " countG "`nT: " countT "`nTotal = " countA+countC+countG+countT
ExitApp
- Output:
50 -> CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 100 -> CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 150 -> AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 200 -> GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 250 -> CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 300 -> TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 350 -> TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 400 -> CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 450 -> TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 500 -> GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT A: 129 C: 97 G: 119 T: 155 Total = 500
AWK
# syntax: GAWK -f BIOINFORMATICS_BASE_COUNT.AWK
# converted from FreeBASIC
#
# sorting:
# PROCINFO["sorted_in"] is used by GAWK
# SORTTYPE is used by Thompson Automation's TAWK
#
BEGIN {
dna = "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG" \
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG" \
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT" \
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT" \
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG" \
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA" \
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT" \
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG" \
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC" \
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
curr = first = 1
while (curr <= length(dna)) {
curr_base = substr(dna,curr,1)
base_arr[curr_base]++
rec = sprintf("%s%s",rec,curr_base)
curr++
if (curr % 10 == 1) {
rec = sprintf("%s ",rec)
}
if (curr % 50 == 1) {
printf("%3d-%3d: %s\n",first,curr-1,rec)
rec = ""
first = curr
}
}
PROCINFO["sorted_in"] = "@ind_str_asc" ; SORTTYPE = 1
printf("\nBase count\n")
for (i in base_arr) {
printf("%s %8d\n",i,base_arr[i])
total += base_arr[i]
}
printf("%10d total\n",total)
exit(0)
}
- Output:
1- 50: CGTAAAAAAT TACAACGTCC TTTGGCTATC TCTTAAACTC CTGCTAAATG 51-100: CTCGTGCTTT CCAATTATGT AAGCGTTCCG AGACGGGGTG GTCGATTCTG 101-150: AGGACAAAGG TCAAGATGGA GCGCATCGAA CGCAATAAGG ATCATTTGAT 151-200: GGGACGTTTC GTCGACAAAG TCTTGTTTCG AGAGTAACGG CTACCGTCTT 201-250: CGATTCTGCT TATAACACTA TGTTCTTATG AAATGGATGT TCTGAGTTGG 251-300: TCAGTCCCAA TGTGCGGGGT TTCTTTTAGT ACGTCGGGAG TGGTATTATA 301-350: TTTAATTTTT CTATATAGCG ATCTGTATTT AAGCAATTCA TTTAGGTTAT 351-400: CGCCGCGATG CTCGGTTCGG ACCGCCAAGC ATCTGGCTCC ACTGCTAGTG 401-450: TCCTAAATTT GAATGGCAAA CACAAATAAG ATTTAGCAAT TCGTGTAGAC 451-500: GACCGGGGAC TTGCATGATG GGAGCAGCTT TGTTAAACTA CGAACGTAAT Base count A 129 C 97 G 119 T 155 500 total
BBC BASIC
DNA$="CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG" +\
\ "CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG" +\
\ "AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT" +\
\ "GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT" +\
\ "CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG" +\
\ "TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA" +\
\ "TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT" +\
\ "CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG" +\
\ "TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC" +\
\ "GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT" + CHR$0
@%=3 : REM Width of the print zone
P%=!^DNA$ : REM Address of string in memory
WHILE ?P%
IF I% MOD 50 == 0 PRINT 'I% ": ";
VDU ?P% : REM Output ASCII value at address P%
CASE ?P% OF
WHEN ASC"A" A+=1
WHEN ASC"C" C+=1
WHEN ASC"G" G+=1
WHEN ASC"T" T+=1
ENDCASE
I%+=1
P%+=1
ENDWHILE
PRINT '' "A: " A ' "C: " C ' "G: " G ' "T: " T
PRINT "Total: " A + C + G + T
- Output:
0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT A: 129 C: 97 G: 119 T: 155 Total: 500
C
Reads genome from a file, determines string length to ensure optimal formatting
#include<string.h>
#include<stdlib.h>
#include<stdio.h>
typedef struct genome{
char* strand;
int length;
struct genome* next;
}genome;
genome* genomeData;
int totalLength = 0, Adenine = 0, Cytosine = 0, Guanine = 0, Thymine = 0;
int numDigits(int num){
int len = 1;
while(num>10){
num = num/10;
len++;
}
return len;
}
void buildGenome(char str[100]){
int len = strlen(str),i;
genome *genomeIterator, *newGenome;
totalLength += len;
for(i=0;i<len;i++){
switch(str[i]){
case 'A': Adenine++;
break;
case 'T': Thymine++;
break;
case 'C': Cytosine++;
break;
case 'G': Guanine++;
break;
};
}
if(genomeData==NULL){
genomeData = (genome*)malloc(sizeof(genome));
genomeData->strand = (char*)malloc(len*sizeof(char));
strcpy(genomeData->strand,str);
genomeData->length = len;
genomeData->next = NULL;
}
else{
genomeIterator = genomeData;
while(genomeIterator->next!=NULL)
genomeIterator = genomeIterator->next;
newGenome = (genome*)malloc(sizeof(genome));
newGenome->strand = (char*)malloc(len*sizeof(char));
strcpy(newGenome->strand,str);
newGenome->length = len;
newGenome->next = NULL;
genomeIterator->next = newGenome;
}
}
void printGenome(){
genome* genomeIterator = genomeData;
int width = numDigits(totalLength), len = 0;
printf("Sequence:\n");
while(genomeIterator!=NULL){
printf("\n%*d%3s%3s",width+1,len,":",genomeIterator->strand);
len += genomeIterator->length;
genomeIterator = genomeIterator->next;
}
printf("\n\nBase Count\n----------\n\n");
printf("%3c%3s%*d\n",'A',":",width+1,Adenine);
printf("%3c%3s%*d\n",'T',":",width+1,Thymine);
printf("%3c%3s%*d\n",'C',":",width+1,Cytosine);
printf("%3c%3s%*d\n",'G',":",width+1,Guanine);
printf("\n%3s%*d\n","Total:",width+1,Adenine + Thymine + Cytosine + Guanine);
free(genomeData);
}
int main(int argc,char** argv)
{
char str[100];
int counter = 0, len;
if(argc!=2){
printf("Usage : %s <Gene file name>\n",argv[0]);
return 0;
}
FILE *fp = fopen(argv[1],"r");
while(fscanf(fp,"%s",str)!=EOF)
buildGenome(str);
fclose(fp);
printGenome();
return 0;
}
Run and output :
abhishek_ghosh@Azure:~/doodles$ ./a.out genome.txt Sequence: 0 :CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50 :CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100 :AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150 :GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200 :CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250 :TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300 :TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350 :CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400 :TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450 :GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base Count ---------- A : 129 T : 155 C : 97 G : 119 Total: 500
Naive solution
#include <stdio.h>
int main(void) {
char dna[] = "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG"
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG"
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT"
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT"
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG"
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA"
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT"
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG"
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC"
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT";
int c_count = 0, t_count = 0, a_count = 0, g_count = 0, total;
for (total = 0; dna[total]; total++) {
if (total % 50 == 0)
printf("\n%3d - %3d: %c", total + 1, total + 50, dna[total]);
else if (total % 5 == 0)
printf(" %c", dna[total]);
else
printf("%c", dna[total]);
switch (dna[total]) {
case 'C': c_count++; break;
case 'T': t_count++; break;
case 'A': a_count++; break;
case 'G': g_count++; break;
}
}
printf("\n\nC count: %3d\nT count: %3d\nA count: %3d\nG count: %3d\n Total: %3d\n\n",
c_count, t_count, a_count, g_count, total);
return 0;
}
- Output:
1 - 50: CGTAA AAAAT TACAA CGTCC TTTGG CTATC TCTTA AACTC CTGCT AAATG 51 - 100: CTCGT GCTTT CCAAT TATGT AAGCG TTCCG AGACG GGGTG GTCGA TTCTG 101 - 150: AGGAC AAAGG TCAAG ATGGA GCGCA TCGAA CGCAA TAAGG ATCAT TTGAT 151 - 200: GGGAC GTTTC GTCGA CAAAG TCTTG TTTCG AGAGT AACGG CTACC GTCTT 201 - 250: CGATT CTGCT TATAA CACTA TGTTC TTATG AAATG GATGT TCTGA GTTGG 251 - 300: TCAGT CCCAA TGTGC GGGGT TTCTT TTAGT ACGTC GGGAG TGGTA TTATA 301 - 350: TTTAA TTTTT CTATA TAGCG ATCTG TATTT AAGCA ATTCA TTTAG GTTAT 351 - 400: CGCCG CGATG CTCGG TTCGG ACCGC CAAGC ATCTG GCTCC ACTGC TAGTG 401 - 450: TCCTA AATTT GAATG GCAAA CACAA ATAAG ATTTA GCAAT TCGTG TAGAC 451 - 500: GACCG GGGAC TTGCA TGATG GGAGC AGCTT TGTTA AACTA CGAAC GTAAT C count: 97 T count: 155 A count: 129 G count: 119 Total: 500
C++
Creates a class DnaBase which either uses a provided string or the default DNA sequence.
#include <map>
#include <string>
#include <iostream>
#include <iomanip>
const std::string DEFAULT_DNA = "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG"
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG"
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT"
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT"
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG"
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA"
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT"
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG"
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC"
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT";
class DnaBase {
public:
DnaBase(const std::string& dna = DEFAULT_DNA, int width = 50) : genome(dna), displayWidth(width) {
// Map each character to a counter
for (auto elm : dna) {
if (count.find(elm) == count.end())
count[elm] = 0;
++count[elm];
}
}
void viewGenome() {
std::cout << "Sequence:" << std::endl;
std::cout << std::endl;
int limit = genome.size() / displayWidth;
if (genome.size() % displayWidth != 0)
++limit;
for (int i = 0; i < limit; ++i) {
int beginPos = i * displayWidth;
std::cout << std::setw(4) << beginPos << " :" << std::setw(4) << genome.substr(beginPos, displayWidth) << std::endl;
}
std::cout << std::endl;
std::cout << "Base Count" << std::endl;
std::cout << "----------" << std::endl;
std::cout << std::endl;
int total = 0;
for (auto elm : count) {
std::cout << std::setw(4) << elm.first << " : " << elm.second << std::endl;
total += elm.second;
}
std::cout << std::endl;
std::cout << "Total: " << total << std::endl;
}
private:
std::string genome;
std::map<char, int> count;
int displayWidth;
};
int main(void) {
auto d = new DnaBase();
d->viewGenome();
delete d;
return 0;
}
- Output:
Sequence: 0 :CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50 :CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100 :AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150 :GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200 :CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250 :TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300 :TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350 :CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400 :TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450 :GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base Count ---------- A : 129 C : 97 G : 119 T : 155 Total: 500
Delphi
program base_count;
{$APPTYPE CONSOLE}
uses
System.SysUtils,
Generics.Collections,
System.Console;
const
DNA = 'CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG' +
'CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG' +
'AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT' +
'GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT' +
'CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG' +
'TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA' +
'TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT' +
'CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG' +
'TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC' +
'GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT';
procedure Println(code: ansistring);
var
c: ansichar;
begin
console.ForegroundColor := TConsoleColor.Black;
for c in code do
begin
case c of
'A':
console.BackgroundColor := TConsoleColor.Red;
'C':
console.BackgroundColor := TConsoleColor.Blue;
'T':
console.BackgroundColor := TConsoleColor.Green;
'G':
console.BackgroundColor := TConsoleColor.Yellow;
else
console.BackgroundColor := TConsoleColor.Black;
end;
console.Write(c);
end;
console.ForegroundColor := TConsoleColor.White;
console.BackgroundColor := TConsoleColor.Black;
console.WriteLine;
end;
begin
console.WriteLine('SEQUENCE:');
var le := Length(DNA);
var index := 0;
while index < le do
begin
Write(index: 5, ': ');
Println(dna.Substring(index, 50));
inc(index, 50);
end;
var baseMap := TDictionary<byte, integer>.Create;
for var i := 1 to le do
begin
var key := ord(dna[i]);
if baseMap.ContainsKey(key) then
baseMap[key] := baseMap[key] + 1
else
baseMap.Add(key, 1);
end;
var bases: TArray<byte>;
for var k in baseMap.Keys do
begin
SetLength(bases, Length(bases) + 1);
bases[High(bases)] := k;
end;
TArray.Sort<Byte>(bases);
console.WriteLine(#10'BASE COUNT:');
for var base in bases do
console.WriteLine(' {0}: {1}', [ansichar(base), baseMap[base]]);
console.WriteLine(' ------');
console.WriteLine(' S: {0}', [le]);
console.WriteLine(' ======');
readln;
end.
Color [1]
- Output:
SEQUENCE: 0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT BASE COUNT: A: 129 C: 97 G: 119 T: 155 ------ Σ: 500 ======
DuckDB
The following highlights the histogram() function, which computes a dictionary of frequency counts.
It is assumed that a file named dna.txt contains the DNA sequence; any characters in the file apart from upper-case alphabetic characters will be ignored.
# Filter the contents of the file into a table (id, c) of uppercase letters
create or replace table bases as (
select row_number() OVER () as id, c
from (select unnest(regexp_extract_all(content, '[A-Z]') ) as c
from read_text('rc-bioinformatics-base-count.txt') )
);
.print DNA sequence:
with tbl as (select (id - 1 - mod(id - 1, 50)) as "offset", c from bases)
select "offset", string_agg(c, '') as sequence
from tbl
group by "offset"
order by "offset" ;
.print
.print Distribution of bases:
select histogram(c), count(*) as N
from bases ;
- Output:
DNA sequence: ┌────────┬────────────────────────────────────────────────────┐ │ offset │ sequence │ │ int64 │ varchar │ ├────────┼────────────────────────────────────────────────────┤ │ 0 │ CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG │ │ 50 │ CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG │ │ 100 │ AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT │ │ 150 │ GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT │ │ 200 │ CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG │ │ 250 │ TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA │ │ 300 │ TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT │ │ 350 │ CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG │ │ 400 │ TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC │ │ 450 │ GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT │ ├────────┴────────────────────────────────────────────────────┤ │ 10 rows 2 columns │ └─────────────────────────────────────────────────────────────┘ Distribution of bases: ┌─────────────────────────────┬───────┐ │ histogram(c) │ N │ │ map(varchar, ubigint) │ int64 │ ├─────────────────────────────┼───────┤ │ {A=129, C=97, G=119, T=155} │ 500 │ └─────────────────────────────┴───────┘
EasyLang
len d[] 26
pos = 1
numfmt 0 4
repeat
s$ = input
until s$ = ""
for c$ in strchars s$
if pos mod 40 = 1
write pos & ":"
.
if pos mod 4 = 1
write " "
.
write c$
if pos mod 40 = 0
print ""
.
pos += 1
c = strcode c$
d[c - 64] += 1
.
.
print ""
for i in [ 1 3 7 20 ]
write strchar (64 + i) & ": "
print d[i]
.
print "Total: " & d[1] + d[3] + d[7] + d[20]
input_data
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
ed
Uses a potentially non-portable \> special regex sequence.
# by Artyom Bologov
H
,j
# Repeat the line 5 times
t0
t0
t0
t0
# Split the line in chunks of 50
1s/[ACGT]\{0,50\}/&\
/g
# Count every base by leaving X letters on each row
$-3s/[^A]//g
$-2s/[^C]//g
$-1s/[^G]//g
$s/[^T]//g
# Turn all letters into i's
$-3s/A/i/g
$-2s/C/i/g
$-1s/G/i/g
$s/T/i/g
# unary -> decimal (up to 10^10)
g/i/s/^\(i*\)\1\1\1\1\1\1\1\1\1\(i\{0,9\}\)/\1 \2/
g/i/s/^\(i*\)\1\1\1\1\1\1\1\1\1\(i\{0,9\}\)/\1 \2/
g/i/s/^\(i*\)\1\1\1\1\1\1\1\1\1\(i\{0,9\}\)/\1 \2/
g/i/s/^\(i*\)\1\1\1\1\1\1\1\1\1\(i\{0,9\}\)/\1 \2/
g/i/s/^\(i*\)\1\1\1\1\1\1\1\1\1\(i\{0,9\}\)/\1 \2/
g/i/s/^\(i*\)\1\1\1\1\1\1\1\1\1\(i\{0,9\}\)/\1 \2/
g/i/s/^\(i*\)\1\1\1\1\1\1\1\1\1\(i\{0,9\}\)/\1 \2/
g/i/s/^\(i*\)\1\1\1\1\1\1\1\1\1\(i\{0,9\}\)/\1 \2/
g/i/s/^\(i*\)\1\1\1\1\1\1\1\1\1\(i\{0,9\}\)/\1 \2/
g/i/s/^\(i*\)\1\1\1\1\1\1\1\1\1\(i\{0,9\}\)/\1 \2/
,p
g/i/s/[ ]$/ 0 /g
g/i/s/[ ][ ]/ 0 /g
g/i/s/[ ][ ]/ 0 /g
g/i/s/[ ]iiiiiiiii\>/ 9 /g
g/i/s/[ ]iiiiiiii\>/ 8 /g
g/i/s/[ ]iiiiiii\>/ 7 /g
g/i/s/[ ]iiiiii\>/ 6 /g
g/i/s/[ ]iiiii\>/ 5 /g
g/i/s/[ ]iiii\>/ 4 /g
g/i/s/[ ]iii\>/ 3 /g
g/i/s/[ ]ii\>/ 2 /g
g/i/s/[ ]i\>/ 1 /g
g/[ ]/s///g
g/^0\{1,\}\([0-9]\)/s//\1/
$-3s/.*/A &/g
$-2s/.*/C &/g
$-1s/.*/G &/g
$s/.*/T &/g
,p
Q
- Output:
$ ed -s base-count.input < base-count.ed Newline appended CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT A 129 C 97 G 119 T 155
Factor
USING: assocs formatting grouping io kernel literals math
math.statistics prettyprint qw sequences sorting ;
CONSTANT: dna
$[
qw{
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
} concat
]
: .dna ( seq n -- )
"SEQUENCE:" print [ group ] keep
[ * swap " %3d: %s\n" printf ] curry each-index ;
: show-counts ( seq -- )
"BASE COUNTS:" print histogram >alist [ first ] sort-with
[ [ " %c: %3d\n" printf ] assoc-each ]
[ "TOTAL: " write [ second ] [ + ] map-reduce . ] bi ;
dna [ 50 .dna nl ] [ show-counts ] bi
- Output:
SEQUENCE: 0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT BASE COUNTS: A: 129 C: 97 G: 119 T: 155 TOTAL: 500
Forth
( Gforth 0.7.3 )
: dnacode s" CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATATTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTATCGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTGTCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGACGACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT" ;
variable #A \ Gforth initialises variables to 0
variable #C
variable #G
variable #T
variable #ch
50 constant pplength
: basecount ( adr u -- )
." Sequence:"
swap dup rot + swap ?do \ count while pretty-printing
#ch @ pplength mod 0= if cr #ch @ 10 .r 2 spaces then
i c@ dup emit
dup 'A = if drop #A @ 1+ #A ! else
dup 'C = if drop #C @ 1+ #C ! else
dup 'G = if drop #G @ 1+ #G ! else
dup 'T = if drop #T @ 1+ #T ! else drop then then then then
#ch @ 1+ #ch !
loop
cr cr ." Base counts:"
cr 4 spaces 'A emit ': emit #A @ 5 .r
cr 4 spaces 'C emit ': emit #C @ 5 .r
cr 4 spaces 'G emit ': emit #G @ 5 .r
cr 4 spaces 'T emit ': emit #T @ 5 .r
cr ." ----------"
cr ." Sum:" #ch @ 5 .r
cr ." ==========" cr cr
;
( demo run: )
dnacode basecount
- Output:
Sequence: 0 CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50 CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100 AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150 GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200 CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250 TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300 TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350 CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400 TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450 GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base counts: A: 129 C: 97 G: 119 T: 155 ---------- Sum: 500 ==========
FreeBASIC
#define SCW 36
#define GRP 3
function padto( n as integer, w as integer ) as string
dim as string r = str(n)
while len(r)<w
r = " "+r
wend
return r
end function
dim as string dna = "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG"+_
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG"+_
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT"+_
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT"+_
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG"+_
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA"+_
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT"+_
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG"+_
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC"+_
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
dim as string outstr = "", currb
dim as integer bases(0 to 3), curr = 1, first = 1
while curr <= len(dna)
currb = mid(dna, curr, 1)
if currb = "A" then bases(0) += 1
if currb = "C" then bases(1) += 1
if currb = "G" then bases(2) += 1
if currb = "T" then bases(3) += 1
outstr += currb
curr += 1
if curr mod GRP = 1 then outstr += " "
if curr mod SCW = 1 or curr=len(dna)+1 then
outstr = padto(first,3) + "--" + padto(curr-1,3) + ": " + outstr
print outstr
outstr = ""
first = curr
end if
wend
print
print "Base counts"
print "-----------"
print " A: " + str(bases(0))
print " C: " + str(bases(1))
print " G: " + str(bases(2))
print " T: " + str(bases(3))
print
print " total: " + str(bases(0)+bases(1)+bases(2)+bases(3))
- Output:
1-- 36: CGT AAA AAA TTA CAA CGT CCT TTG GCT ATC TCT TAA 37-- 72: ACT CCT GCT AAA TGC TCG TGC TTT CCA ATT ATG TAA 73--108: GCG TTC CGA GAC GGG GTG GTC GAT TCT GAG GAC AAA 109--144: GGT CAA GAT GGA GCG CAT CGA ACG CAA TAA GGA TCA 145--180: TTT GAT GGG ACG TTT CGT CGA CAA AGT CTT GTT TCG 181--216: AGA GTA ACG GCT ACC GTC TTC GAT TCT GCT TAT AAC 217--252: ACT ATG TTC TTA TGA AAT GGA TGT TCT GAG TTG GTC 253--288: AGT CCC AAT GTG CGG GGT TTC TTT TAG TAC GTC GGG 289--324: AGT GGT ATT ATA TTT AAT TTT TCT ATA TAG CGA TCT 325--360: GTA TTT AAG CAA TTC ATT TAG GTT ATC GCC GCG ATG 361--396: CTC GGT TCG GAC CGC CAA GCA TCT GGC TCC ACT GCT 397--432: AGT GTC CTA AAT TTG AAT GGC AAA CAC AAA TAA GAT 433--468: TTA GCA ATT CGT GTA GAC GAC CGG GGA CTT GCA TGA 469--500: TGG GAG CAG CTT TGT TAA ACT ACG AAC GTA AT Base counts ----------- A: 129 C: 97 G: 119 T: 155 total: 500
FutureBasic
window 1, @"Bioinformatics/base count"
local fn SubstringCount( string as CFStringRef, substring as CFStringRef ) as long
CFStringRef tempString = fn StringByReplacingOccurrencesOfString( string, substring, @"" )
end fn = len(string) - len(tempString)
void local fn DoIt
CFArrayRef sequence
CFStringRef string
long index = 0
long a = 0, c = 0, g = 0, t = 0
sequence = @[@"CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG",
@"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG",
@"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT",
@"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT",
@"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG",
@"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA",
@"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT",
@"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG",
@"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC",
@"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"]
for string in sequence
printf @"%3ld: %@",index,string
index += len(string)
a += fn SubstringCount( string, @"A" )
c += fn SubstringCount( string, @"C" )
g += fn SubstringCount( string, @"G" )
t += fn SubstringCount( string, @"T" )
next
print
printf @"A:\t\t%3ld",a
printf @"C:\t\t%3ld",c
printf @"G:\t\t%3ld",g
printf @"T:\t\t%3ld",t
printf @"\t\t---"
printf @"Total:\t%ld",a+c+g+t
end fn
fn DoIt
HandleEvents
- Output:
0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT A: 129 C: 97 G: 119 T: 155 --- Total: 500
Fōrmulæ
Fōrmulæ programs are not textual, visualization/edition of programs is done showing/manipulating structures but not text. Moreover, there can be multiple visual representations of the same program. Even though it is possible to have textual representation —i.e. XML, JSON— they are intended for storage and transfer purposes more than visualization and edition.
Programs in Fōrmulæ are created/edited online in its website.
In this page you can see and run the program(s) related to this task and their results. You can also change either the programs or the parameters they are called with, for experimentation, but remember that these programs were created with the main purpose of showing a clear solution of the task, and they generally lack any kind of validation.
Solution
Test case
Go
package main
import (
"fmt"
"sort"
)
func main() {
dna := "" +
"CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG" +
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG" +
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT" +
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT" +
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG" +
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA" +
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT" +
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG" +
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC" +
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
fmt.Println("SEQUENCE:")
le := len(dna)
for i := 0; i < le; i += 50 {
k := i + 50
if k > le {
k = le
}
fmt.Printf("%5d: %s\n", i, dna[i:k])
}
baseMap := make(map[byte]int) // allows for 'any' base
for i := 0; i < le; i++ {
baseMap[dna[i]]++
}
var bases []byte
for k := range baseMap {
bases = append(bases, k)
}
sort.Slice(bases, func(i, j int) bool { // get bases into alphabetic order
return bases[i] < bases[j]
})
fmt.Println("\nBASE COUNT:")
for _, base := range bases {
fmt.Printf(" %c: %3d\n", base, baseMap[base])
}
fmt.Println(" ------")
fmt.Println(" Σ:", le)
fmt.Println(" ======")
}
- Output:
SEQUENCE: 0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT BASE COUNT: A: 129 C: 97 G: 119 T: 155 ------ Σ: 500 ======
Haskell
import Data.List (group, sort)
import Data.List.Split (chunksOf)
import Text.Printf (printf, IsChar(..), PrintfArg(..), fmtChar, fmtPrecision, formatString)
data DNABase = A | C | G | T deriving (Show, Read, Eq, Ord)
type DNASequence = [DNABase]
instance IsChar DNABase where
toChar = head . show
fromChar = read . pure
instance PrintfArg DNABase where
formatArg x fmt = formatString (show x) (fmt { fmtChar = 's', fmtPrecision = Nothing })
test :: DNASequence
test = read . pure <$> concat
[ "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG"
, "CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG"
, "AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT"
, "GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT"
, "CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG"
, "TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA"
, "TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT"
, "CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG"
, "TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC"
, "GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT" ]
chunkedDNASequence :: DNASequence -> [(Int, [DNABase])]
chunkedDNASequence = zip [50,100..] . chunksOf 50
baseCounts :: DNASequence -> [(DNABase, Int)]
baseCounts = fmap ((,) . head <*> length) . group . sort
main :: IO ()
main = do
putStrLn "Sequence:"
mapM_ (uncurry (printf "%3d: %s\n")) $ chunkedDNASequence test
putStrLn "\nBase Counts:"
mapM_ (uncurry (printf "%2s: %2d\n")) $ baseCounts test
putStrLn (replicate 8 '-') >> printf " Σ: %d\n\n" (length test)
- Output:
Sequence: 50: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 100: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 150: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 200: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 250: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 300: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 350: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 400: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 450: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 500: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base Counts: A: 129 C: 97 G: 119 T: 155 -------- Σ: 500
J
Solution:
countBases=: (({.;#)/.~)@,
totalBases=: #@,
require 'format/printf'
printSequence=: verb define
'Sequence:' printf ''
'%4d: %s' printf ((- {.)@(+/\)@:(#"1) ,.&<"_1 ]) y
'\n Base Count\n-----------' printf ''
'%5s: %4d' printf countBases y
'-----------\nTotal = %3d' printf totalBases y
)
Required Example:
DNABases=: ];._2 noun define
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
)
printSequence DNABases
Sequence:
0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
Base Count
-----------
C: 97
G: 119
T: 155
A: 129
-----------
Total = 500
Java
This can be quickly achieved using a for-loop and the String.toCharArray
method.
Additionally, use a BufferedReader
to utilize the readLine
method.
To "pretty print" the output, we can use a String
formatter.
void printBaseCount(String string) throws IOException {
BufferedReader reader = new BufferedReader(new StringReader(string));
int index = 0;
String sequence;
int A = 0, C = 0, G = 0, T = 0;
int a, c, g, t;
while ((sequence = reader.readLine()) != null) {
System.out.printf("%d %s ", index++, sequence);
a = c = g = t = 0;
for (char base : sequence.toCharArray()) {
switch (base) {
case 'A' -> {
A++;
a++;
}
case 'C' -> {
C++;
c++;
}
case 'G' -> {
G++;
g++;
}
case 'T' -> {
T++;
t++;
}
}
}
System.out.printf("[A %2d, C %2d, G %2d, T %2d]%n", a, c, g, t);
}
reader.close();
int total = A + C + G + T;
System.out.printf("%nTotal of %d bases%n", total);
System.out.printf("A %3d (%.2f%%)%n", A, ((double) A / total) * 100);
System.out.printf("C %3d (%.2f%%)%n", C, ((double) C / total) * 100);
System.out.printf("G %3d (%.2f%%)%n", G, ((double) G / total) * 100);
System.out.printf("T %3d (%.2f%%)%n", T, ((double) T / total) * 100);
}
0 CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG [A 16, C 12, G 6, T 16] 1 CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG [A 8, C 11, G 15, T 16] 2 AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT [A 19, C 8, G 14, T 9] 3 GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT [A 10, C 11, G 14, T 15] 4 CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG [A 12, C 7, G 11, T 20] 5 TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA [A 9, C 8, G 15, T 18] 6 TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT [A 14, C 5, G 6, T 25] 7 CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG [A 7, C 18, G 15, T 10] 8 TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC [A 20, C 8, G 8, T 14] 9 GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT [A 14, C 9, G 15, T 12] Total of 500 bases A 129 (25.80%) C 97 (19.40%) G 119 (23.80%) T 155 (31.00%)
Alternately
For counting the bases, we simply use a HashMap
, and then use the Map.merge
, inserting 1
, and using Integer::sum
as the aggregation function. This effectively creates a Map
that keeps a running count for us. Java does provide the groupingBy
and counting
collectors, which would generally make these kinds of operation easier. However, String
’s chars()
method returns a IntStream
, which generally just makes everything more complicated. Or verbose. Or inefficient. Ultimately, doing it by hand is easier and more efficient than with streams. The best tool for this job though would be Guava’s MultiSet
, which is a dedicated Key to Count container.
Note that Java’s native strings are UCS-2/UTF-16: Each character is 2-byte long. If parsing from a very large ASCII/UTF8 text file, then String
is a poor choice, as opposed to, say byte[]
. For the purpose of this exercise though, using byte[]
would just add uninteresting casts and bloat to the code, so we stick to String
.
import java.util.HashMap;
import java.util.Map;
public class orderedSequence {
public static void main(String[] args) {
Sequence gene = new Sequence("CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATATTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTATCGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTGTCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGACGACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT");
gene.runSequence();
}
}
/** Separate class for defining behaviors */
public class Sequence {
private final String seq;
public Sequence(String sq) {
this.seq = sq;
}
/** print the organized structure of the sequence */
public void prettyPrint() {
System.out.println("Sequence:");
int i = 0;
for ( ; i < seq.length() - 50 ; i += 50) {
System.out.printf("%5s : %s\n", i + 50, seq.substring(i, i + 50));
}
System.out.printf("%5s : %s\n", seq.length(), seq.substring(i));
}
/** display a base vs. frequency chart */
public void displayCount() {
Map<Character, Integer> counter = new HashMap<>();
for (int i = 0 ; i < seq.length() ; ++i) {
counter.merge(seq.charAt(i), 1, Integer::sum);
}
System.out.println("Base vs. Count:");
counter.forEach(
key, value -> System.out.printf("%5s : %s\n", key, value));
System.out.printf("%5s: %s\n", "SUM", seq.length());
}
public void runSequence() {
this.prettyPrint();
this.displayCount();
}
}
- Output:
Sequence: 50 : CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 100 : CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 150 : AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 200 : GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 250 : CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 300 : TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 350 : TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 400 : CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 450 : TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 500 : GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base vs. Count: A : 129 C : 97 T : 155 G : 119 SUM: 500
JavaScript
const rowLength = 50;
const bases = ['A', 'C', 'G', 'T'];
// Create the starting sequence
const seq = `CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT`
.split('')
.filter(e => bases.includes(e))
/**
* Convert the given array into an array of smaller arrays each with the length
* given by n.
* @param {number} n
* @returns {function(!Array<*>): !Array<!Array<*>>}
*/
const chunk = n => a => a.reduce(
(p, c, i) => (!(i % n)) ? p.push([c]) && p : p[p.length - 1].push(c) && p,
[]);
const toRows = chunk(rowLength);
/**
* Given a number, return function that takes a string and left pads it to n
* @param {number} n
* @returns {function(string): string}
*/
const padTo = n => v => ('' + v).padStart(n, ' ');
const pad = padTo(5);
/**
* Count the number of elements that match the given value in an array
* @param {Array<string>} arr
* @returns {function(string): number}
*/
const countIn = arr => s => arr.filter(e => e === s).length;
/**
* Utility logging function
* @param {string|number} v
* @param {string|number} n
*/
const print = (v, n) => console.log(`${pad(v)}:\t${n}`)
const prettyPrint = seq => {
const chunks = toRows(seq);
console.log('SEQUENCE:')
chunks.forEach((e, i) => print(i * rowLength, e.join('')))
}
const printBases = (seq, bases) => {
const filterSeq = countIn(seq);
const counts = bases.map(filterSeq);
console.log('\nBASE COUNTS:')
counts.forEach((e, i) => print(bases[i], e));
print('Total', counts.reduce((p,c) => p + c, 0));
}
prettyPrint(seq);
printBases(seq, bases);
- Output:
SEQUENCE: 0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT BASE COUNTS: A: 129 C: 97 G: 119 T: 155 Total: 500
jq
Naive (in-memory) solution
First, some general utility functions:
def lpad($len; $fill): tostring | ($len - length) as $l | ($fill * $l)[:$l] + .;
# Create a bag of words, i.e. a JSON object with counts of the items in the stream
def bow(stream):
reduce stream as $word ({}; .[($word|tostring)] += 1);
Next, some helper functions:
def read_seq:
reduce inputs as $line (""; . + $line);
# Emit a bow of the letters in the input string
def counts:
. as $in | bow(range(0;length) | $in[.:.+1]);
def pp_counts:
"BASE COUNTS:",
(counts | to_entries | sort[] | " \(.key): \(.value | lpad(6;" "))"),
"Total: \(length|lpad(7;" "))" ;
def pp_sequence($cols):
range(0; length / $cols) as $i
| "\($i*$cols | lpad(5; " ")): " + .[ $i * $cols : ($i+1) * $cols] ;
Finally, the task at hand:
read_seq | pp_sequence(50), "", pp_counts
- Output:
The invocation:
jq -nrR -f base_count.jq base_count.txt
produces:
0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
BASE COUNTS:
A: 129
C: 97
G: 119
T: 155
Total: 500
Memory-efficient solution
def lpad($len; $fill): tostring | ($len - length) as $l | ($fill * $l)[:$l] + .;
# "bow" = bag of words, i.e. a JSON object with counts
# Input: a bow or null
# Output: augmented bow
def bow(stream):
reduce stream as $word (.; .[($word|tostring)] += 1);
# The main function ignores its input in favor of `stream`:
def report(stream; $cols):
# input: a string, possibly longer than $cols
def pp_sequence($start):
range(0; length / $cols) as $i
| "\($start + ($i*$cols) | lpad(5; " ")): " + .[ $i * $cols : ($i+1) * $cols] ;
# input: a bow
def pp_counts:
"BASE COUNTS:",
(to_entries | sort[] | " \(.key): \(.value | lpad(6;" "))"),
"Total: \( [.[]] | add | lpad(7;" "))" ;
# state: {bow, emit, pending, start}
foreach (stream,null) as $line ({start: - $cols};
.start += $cols
| if $line == null
then .emit = .pending
else .bow |= bow(range(0; $line|length) | $line[.:.+1])
| (($line|length) + (.pending|length) ) as $len
| if $len >= $cols
then (.pending + $line) as $new
| .emit = $new[:$cols]
| .pending = $new[$cols:]
else .pending = $line
end
end;
(select(.emit|length > 0) | .start as $start | .emit | pp_sequence($start)),
(select($line == null) | "", (.bow|pp_counts) ) )
;
# To illustrate reformatting:
report(inputs; 33)
- Output:
0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCT
33: TAAACTCCTGCTAAATGCTCGTGCTTTCCAATT
66: ATGTAAGCGTTCCGAGACGGGGTGGTCGATTCT
99: GAGGACAAAGGTCAAGATGGAGCGCATCGAACG
132: CAATAAGGATCATTTGATGGGACGTTTCGTCGA
165: CAAAGTCTTGTTTCGAGAGTAACGGCTACCGTC
198: TTCGATTCTGCTTATAACACTATGTTCTTATGA
231: AATGGATGTTCTGAGTTGGTCAGTCCCAATGTG
264: CGGGGTTTCTTTTAGTACGTCGGGAGTGGTATT
297: ATATTTAATTTTTCTATATAGCGATCTGTATTT
330: AAGCAATTCATTTAGGTTATCGCCGCGATGCTC
363: GGTTCGGACCGCCAAGCATCTGGCTCCACTGCT
396: AGTGTCCTAAATTTGAATGGCAAACACAAATAA
429: GATTTAGCAATTCGTGTAGACGACCGGGGACTT
462: GCATGATGGGAGCAGCTTTGTTAAACTACGAAC
495: GTAAT
BASE COUNTS:
A: 129
C: 97
G: 119
T: 155
Total: 500
Julia
const sequence =
"CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG" *
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG" *
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT" *
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT" *
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG" *
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA" *
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT" *
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG" *
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC" *
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
function dnasequenceprettyprint(seq, colsize=50)
println(length(seq), "nt DNA sequence:\n")
rows = [seq[i:min(length(seq), i + colsize - 1)] for i in 1:colsize:length(seq)]
for (i, r) in enumerate(rows)
println(lpad(colsize * (i - 1), 5), " ", r)
end
end
dnasequenceprettyprint(sequence)
function printcounts(seq)
bases = [['A', 0], ['C', 0], ['G', 0], ['T', 0]]
for c in seq, base in bases
if c == base[1]
base[2] += 1
end
end
println("\nNucleotide counts:\n")
for base in bases
println(lpad(base[1], 10), lpad(string(base[2]), 12))
end
println(lpad("Other", 10), lpad(string(length(seq) - sum(x[2] for x in bases)), 12))
println(" _________________\n", lpad("Total", 10), lpad(string(length(seq)), 12))
end
printcounts(sequence)
- Output:
500nt DNA sequence: 0 CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50 CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100 AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150 GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200 CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250 TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300 TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350 CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400 TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450 GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Nucleotide counts: A 129 C 97 G 119 T 155 Other 0 _________________ Total 500
Kotlin
For the first part, we can leverage the built-in String.chunked
to transform a String
into a List<String>
, where each String
has a defined chunk size.
For counting the bases, we use groupingBy
, which is a versatile tool for aggregating objects based on a key-function. In this case, the key function is the identity function (it
), and the aggregation function is the counting function: eachCount
.
Finally, the total count is simply the input’s length.
fun printSequence(sequence: String, width: Int = 50) {
fun printWithLabel(label: Any, data: Any) =
label.toString().padStart(5).also { println("$it: $data") }
println("SEQUENCE:")
sequence.chunked(width).forEachIndexed() { i, chunk ->
printWithLabel(i * width + chunk.length, chunk)
}
println("BASE:")
sequence.groupingBy { it }.eachCount().forEach { (k, v) ->
printWithLabel(k, v)
}
printWithLabel("TOTALS", sequence.length)
}
const val BASE_SEQUENCE = "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATATTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTATCGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTGTCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGACGACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
fun main() = printSequence(BASE_SEQUENCE)
- Output:
SEQUENCE: 50: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 100: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 150: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 200: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 250: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 300: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 350: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 400: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 450: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 500: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT BASE: C: 97 G: 119 T: 155 A: 129 TOTALS: 500
Lambdatalk
{def DNA CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATATTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTATCGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTGTCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGACGACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT}
-> DNA
{def base_count
{def base_count.r
{lambda {:dna :b :n :i :count}
{if {> :i :n}
then :count
else {base_count.r :dna :b :n {+ :i 1}
{if {W.equal? {W.get :i :dna} :b}
then {+ :count 1}
else :count}} }}}
{lambda {:dna :b}
{base_count.r :dna :b {- {W.length :dna} 1} 0 0} }}
-> base_count
{def S {S.map {base_count {DNA}}} A C G T}}
-> S
[A C G T] = (129 97 119 155)
A+C+G+T = {+ {S}}
-> A+C+G+T = 500
Lua
function prettyprint(seq) -- approx DDBJ format
seq = seq:gsub("%A",""):lower()
local sums, n = { a=0, c=0, g=0, t=0 }, 1
seq:gsub("(%a)", function(c) sums[c]=sums[c]+1 end)
local function printf(s,...) io.write(s:format(...)) end
printf("LOCUS AB000000 %12d bp mRNA linear HUM 01-JAN-2001\n", #seq)
printf(" BASE COUNT %12d a %12d c %12d g %12d t\n", sums.a, sums.c, sums.g, sums.t)
printf("ORIGIN\n")
while n < #seq do
local sub60 = seq:sub(n,n+59)
printf("%9d %s\n", n, sub60:gsub("(..........)","%1 "))
n = n + #sub60
end
end
prettyprint[[
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
]]
- Output:
LOCUS AB000000 500 bp mRNA linear HUM 01-JAN-2001 BASE COUNT 129 a 97 c 119 g 155 t ORIGIN 1 cgtaaaaaat tacaacgtcc tttggctatc tcttaaactc ctgctaaatg ctcgtgcttt 61 ccaattatgt aagcgttccg agacggggtg gtcgattctg aggacaaagg tcaagatgga 121 gcgcatcgaa cgcaataagg atcatttgat gggacgtttc gtcgacaaag tcttgtttcg 181 agagtaacgg ctaccgtctt cgattctgct tataacacta tgttcttatg aaatggatgt 241 tctgagttgg tcagtcccaa tgtgcggggt ttcttttagt acgtcgggag tggtattata 301 tttaattttt ctatatagcg atctgtattt aagcaattca tttaggttat cgccgcgatg 361 ctcggttcgg accgccaagc atctggctcc actgctagtg tcctaaattt gaatggcaaa 421 cacaaataag atttagcaat tcgtgtagac gaccggggac ttgcatgatg ggagcagctt 481 tgttaaacta cgaacgtaat
Mathematica / Wolfram Language
seq = "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCA\
ATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGC\
AATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGA\
TTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTC\
TTTTAGTACGTCGGGAGTGGTATTATATTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTT\
AGGTTATCGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTGTCCTAAATTTGAA\
TGGCAAACACAAATAAGATTTAGCAATTCGTGTAGACGACCGGGGACTTGCATGATGGGAGCAGCTTTGT\
TAAACTACGAACGTAAT";
size = 70;
parts = StringPartition[seq, UpTo[size]];
begins = Most[Accumulate[Prepend[StringLength /@ parts, 1]]];
ends = Rest[Accumulate[Prepend[StringLength /@ parts, 0]]];
StringRiffle[MapThread[ToString[#1] <> "-" <> ToString[#2] <> ": " <> #3 &, {begins, ends, parts}], "\n"]
StringRiffle[#1 <> ": " <> ToString[#2] & @@@ Tally[Characters[seq]], "\n"]
- Output:
1-70: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGT 71-140: AAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGG 141-210: ATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCT 211-280: TATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGT 281-350: ACGTCGGGAGTGGTATTATATTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 351-420: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTGTCCTAAATTTGAATGGCAAA 421-490: CACAAATAAGATTTAGCAATTCGTGTAGACGACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTA 491-500: CGAACGTAAT C: 97 G: 119 T: 155 A: 129
MATLAB / Octave
function r = base_count(f)
fid = fopen(f,'r');
nn=[0,0,0,0];
while ~feof(fid)
s = fgetl(fid);
fprintf(1,'%5d :%s\n', sum(nn), s(s=='A'|s=='C'|s=='G'|s=='T'));
nn = nn+[sum(s=='A'),sum(s=='C'),sum(s=='G'),sum(s=='T')];
end
fclose(fid);
fprintf(1, '\nBases:\n\n A : %d\n C : %d\n G : %d\n T : %d\n', nn);
fprintf(1, '\nTotal: %d\n\n', sum(nn));
end;
- Output:
base_count('base_count_data.txt'); 0 :CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50 :CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100 :AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150 :GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200 :CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250 :TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300 :TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350 :CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400 :TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450 :GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Bases: A : 129 C : 97 G : 119 T : 155 Total: 500
Nim
Rather than inventing a new presentation format, we have chosen to use the EMBL (European Molecular Biology Laboratory) format which is well documented. See specifications here: ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt
import strformat
import strutils
const Source = "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG" &
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG" &
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT" &
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT" &
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG" &
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA" &
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT" &
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG" &
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC" &
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
# Enumeration type for bases.
type Base* {.pure.} = enum A, C, G, T, Other = "other"
proc display*(dnaSeq: string) =
## Display a DNA sequence using EMBL format.
var counts: array[Base, Natural] # Count of bases.
for c in dnaSeq:
inc counts[parseEnum[Base]($c, Other)] # Use Other as default value.
# Display the SQ line.
var sqline = fmt"SQ {dnaSeq.len} BP; "
for (base, count) in counts.pairs:
sqline &= fmt"{count} {base}; "
echo sqline
# Display the sequence.
var idx = 0
var row = newStringOfCap(80)
var remaining = dnaSeq.len
while remaining > 0:
row.setLen(0)
row.add(" ")
# Add groups of 10 bases.
for group in 1..6:
let nextIdx = idx + min(10, remaining)
row.add(dnaSeq[idx..<nextIdx] & ' ')
dec remaining, nextIdx - idx
idx = nextIdx
if remaining == 0:
break
# Append the number of the last base in the row.
row.add(spaces(72 - row.len))
row.add(fmt"{idx:>8}")
echo row
# Add termination.
echo "//"
when isMainModule:
Source.display()
- Output:
SQ 500 BP; 129 A; 97 C; 119 G; 155 T; 0 other; CGTAAAAAAT TACAACGTCC TTTGGCTATC TCTTAAACTC CTGCTAAATG CTCGTGCTTT 60 CCAATTATGT AAGCGTTCCG AGACGGGGTG GTCGATTCTG AGGACAAAGG TCAAGATGGA 120 GCGCATCGAA CGCAATAAGG ATCATTTGAT GGGACGTTTC GTCGACAAAG TCTTGTTTCG 180 AGAGTAACGG CTACCGTCTT CGATTCTGCT TATAACACTA TGTTCTTATG AAATGGATGT 240 TCTGAGTTGG TCAGTCCCAA TGTGCGGGGT TTCTTTTAGT ACGTCGGGAG TGGTATTATA 300 TTTAATTTTT CTATATAGCG ATCTGTATTT AAGCAATTCA TTTAGGTTAT CGCCGCGATG 360 CTCGGTTCGG ACCGCCAAGC ATCTGGCTCC ACTGCTAGTG TCCTAAATTT GAATGGCAAA 420 CACAAATAAG ATTTAGCAAT TCGTGTAGAC GACCGGGGAC TTGCATGATG GGAGCAGCTT 480 TGTTAAACTA CGAACGTAAT
Pascal
program DNA_Base_Count;
{$IFDEF FPC}
{$MODE DELPHI}//String = AnsiString
{$ELSE}
{$APPTYPE CONSOLE}
{$ENDIF}
const
dna =
'CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG' +
'CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG' +
'AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT' +
'GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT' +
'CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG' +
'TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA' +
'TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT' +
'CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG' +
'TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC' +
'GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT';
var
CntIdx : array of NativeUint;
DNABases : String;
SumBaseTotal : NativeInt;
procedure OutFormatBase(var DNA: String;colWidth:NativeInt);
var
j: NativeInt;
Begin
j := 0;
Writeln(' DNA base sequence');
While j<Length(DNA) do
Begin
writeln(j:5,copy(DNA,j+1,colWidth):colWidth+2);
inc(j,colWidth);
end;
writeln;
end;
procedure Cnt(const DNA: String);
var
i,p :NativeInt;
Begin
SetLength(CntIdx,Length(DNABases));
i := 1;
while i <= Length(DNA) do
Begin
p := Pos(DNA[i],DNABases);
//found new base so extend list
if p = 0 then
Begin
DNABases := DNABases+DNA[i];
p := length(DNABases);
Setlength(CntIdx,p+1);
end;
inc(CntIdx[p]);
inc(i);
end;
Writeln('Base Count');
SumBaseTotal := 0;
For i := 1 to Length(DNABases) do
Begin
p := CntIdx[i];
inc(SumBaseTotal,p);
writeln(DNABases[i]:4,p:10);
end;
Writeln('Total base count ',SumBaseTotal);
writeln;
end;
var
TestDNA: String;
Begin
DNABases :='ACGT';// predefined
TestDNA := DNA;
OutFormatBase(TestDNA,50);
Cnt(TestDNA);
end.
- Output:
DNA base sequence 0 CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50 CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100 AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150 GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200 CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250 TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300 TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350 CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400 TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450 GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base Count A 129 C 97 G 119 T 155 Total base count 500
Perl
use strict;
use warnings;
use feature 'say';
my %cnt;
my $total = 0;
while ($_ = <DATA>) {
chomp;
printf "%4d: %s\n", $total+1, s/(.{10})/$1 /gr;
$total += length;
$cnt{$_}++ for split //
}
say "\nTotal bases: $total";
say "$_: " . ($cnt{$_}//0) for <A C G T>;
__DATA__
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
- Output:
1: CGTAAAAAAT TACAACGTCC TTTGGCTATC TCTTAAACTC CTGCTAAATG 51: CTCGTGCTTT CCAATTATGT AAGCGTTCCG AGACGGGGTG GTCGATTCTG 101: AGGACAAAGG TCAAGATGGA GCGCATCGAA CGCAATAAGG ATCATTTGAT 151: GGGACGTTTC GTCGACAAAG TCTTGTTTCG AGAGTAACGG CTACCGTCTT 201: CGATTCTGCT TATAACACTA TGTTCTTATG AAATGGATGT TCTGAGTTGG 251: TCAGTCCCAA TGTGCGGGGT TTCTTTTAGT ACGTCGGGAG TGGTATTATA 301: TTTAATTTTT CTATATAGCG ATCTGTATTT AAGCAATTCA TTTAGGTTAT 351: CGCCGCGATG CTCGGTTCGG ACCGCCAAGC ATCTGGCTCC ACTGCTAGTG 401: TCCTAAATTT GAATGGCAAA CACAAATAAG ATTTAGCAAT TCGTGTAGAC 451: GACCGGGGAC TTGCATGATG GGAGCAGCTT TGTTAAACTA CGAACGTAAT Total bases: 500 A: 129 C: 97 G: 119 T: 155
Phix
constant dna = substitute(""" CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT ""","\n","") sequence acgt = repeat(0,5) for i=1 to length(dna) do acgt[find(dna[i],"ACGT")] += 1 end for acgt[$] = sum(acgt) sequence s = split(trim(join_by(split(join_by(dna,1,10,""),"\n"),1,5," ")),"\n") for i=1 to length(s) do printf(1,"%3d: %s\n",{(i-1)*50+1,s[i]}) end for printf(1,"\nBase counts: A:%d, C:%d, G:%d, T:%d, total:%d\n",acgt)
- Output:
1: CGTAAAAAAT TACAACGTCC TTTGGCTATC TCTTAAACTC CTGCTAAATG 51: CTCGTGCTTT CCAATTATGT AAGCGTTCCG AGACGGGGTG GTCGATTCTG 101: AGGACAAAGG TCAAGATGGA GCGCATCGAA CGCAATAAGG ATCATTTGAT 151: GGGACGTTTC GTCGACAAAG TCTTGTTTCG AGAGTAACGG CTACCGTCTT 201: CGATTCTGCT TATAACACTA TGTTCTTATG AAATGGATGT TCTGAGTTGG 251: TCAGTCCCAA TGTGCGGGGT TTCTTTTAGT ACGTCGGGAG TGGTATTATA 301: TTTAATTTTT CTATATAGCG ATCTGTATTT AAGCAATTCA TTTAGGTTAT 351: CGCCGCGATG CTCGGTTCGG ACCGCCAAGC ATCTGGCTCC ACTGCTAGTG 401: TCCTAAATTT GAATGGCAAA CACAAATAAG ATTTAGCAAT TCGTGTAGAC 451: GACCGGGGAC TTGCATGATG GGAGCAGCTT TGTTAAACTA CGAACGTAAT Base counts: A:129, C:97, G:119, T:155, total:500
Picat
main =>
dna(DNA, ChunkSize),
Count = 0,
println("Sequence:"),
Map = new_map(['A'=0,'C'=0,'G'=0,'T'=0]),
foreach(Chunk in DNA.chunks_of(ChunkSize))
printf("%4d: %s\n", Count, Chunk),
Count := Count + Chunk.len,
foreach(C in Chunk)
Map.put(C,Map.get(C)+1)
end
end,
println("\nBase count:"),
foreach(C in "ACGT")
printf("%5c: %3d\n", C, Map.get(C))
end,
printf("Total: %d\n", Count),
nl.
dna(DNA,ChunkSize) =>
DNA = "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT".delete_all('\n'),
ChunkSize = 50.
- Output:
Sequence: 0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base count: A: 129 C: 97 G: 119 T: 155 Total: 500
PicoLisp
(let
(S (chop "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG\
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG\
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT\
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT\
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG\
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA\
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT\
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG\
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC\
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT" )
R )
(for I S (accu 'R I 1))
(for I R (println I))
(println 'Total: (sum cdr R)) )
- Output:
("A" . 129) ("T" . 155) ("G" . 119) ("C" . 97) Total: 500
PureBasic
dna$ = "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG" +
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG" +
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT" +
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT" +
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG" +
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA" +
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT" +
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG" +
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC" +
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
NewMap basecount.i()
If OpenConsole("")
For i = 1 To Len(dna$)
If (i % 50) = 1
Print(~"\n" + RSet(Str(i - 1), 5) + " : ")
EndIf
t$ = Mid(dna$, i, 1)
basecount(t$) + 1
Print(t$)
Next
PrintN(~"\n\n" + Space(2) + "Base count")
PrintN(Space(2) + ~"---- -----")
ForEach basecount()
PrintN(RSet(MapKey(basecount()), 5) + " : " + RSet(Str(basecount()), 5))
sigma + basecount()
Next
PrintN(~"\n" + "Total = " + RSet(Str(sigma), 5))
Input()
EndIf
- Output:
0 : CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50 : CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100 : AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150 : GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200 : CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250 : TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300 : TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350 : CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400 : TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450 : GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base count ---- ----- A : 129 C : 97 G : 119 T : 155 Total = 500
Python
Procedural
from collections import Counter
def basecount(dna):
return sorted(Counter(dna).items())
def seq_split(dna, n=50):
return [dna[i: i+n] for i in range(0, len(dna), n)]
def seq_pp(dna, n=50):
for i, part in enumerate(seq_split(dna, n)):
print(f"{i*n:>5}: {part}")
print("\n BASECOUNT:")
tot = 0
for base, count in basecount(dna):
print(f" {base:>3}: {count}")
tot += count
base, count = 'TOT', tot
print(f" {base:>3}= {count}")
if __name__ == '__main__':
print("SEQUENCE:")
sequence = '''\
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG\
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG\
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT\
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT\
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG\
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA\
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT\
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG\
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC\
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT'''
seq_pp(sequence)
- Output:
SEQUENCE: 0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT BASECOUNT: A: 129 C: 97 G: 119 T: 155 TOT= 500
procedural ( dictionary version)
"""
Python 3.10.5 (main, Jun 6 2022, 18:49:26) [GCC 12.1.0] on linux
Created on Wed 2022/08/17 11:19:31
"""
def main ():
def DispCount () :
return f'\n\nBases :\n\n' + f''.join ( [ f'{i} =\t{D [ i ]:4d}\n' for i in sorted ( BoI ) ] )
S = 'CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG' \
'AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT' \
'CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA' \
'TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTATCGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG' \
'TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGACGACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT'
All = set( S )
BoI = set ( [ "A","C","G","T" ] )
other = All - BoI
D = { k : S.count ( k ) for k in All }
print ( 'Sequence:\n\n')
print ( ''.join ( [ f'{k:4d} : {S [ k: k + 50 ]}\n' for k in range ( 0, len ( S ), 50 ) ] ) )
print ( f'{DispCount ()} \n------------')
print ( '' if ( other == set () ) else f'Other\t{sum ( [ D [ k ] for k in sorted ( other ) ] ):4d}\n\n' )
print ( f'Σ = \t {sum ( [ D [ k ] for k in sorted ( All ) ] ) } \n============\n')
pass
def test ():
pass
## START
LIVE = True
if ( __name__ == '__main__' ) :
main () if LIVE else test ()
JPD 2022/08/17
Sequence: 0 :CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50 :CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100 :AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150 :GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200 :CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250 :TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300 :TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350 :CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400 :TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450 :GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Bases : A = 129 C = 97 G = 119 T = 155 ------------ Σ = 500 ============
Functional
Sequence and base counts displayed in GenBank format.
'''Bioinformatics – base count'''
from itertools import count
from functools import reduce
# genBankFormatWithBaseCounts :: String -> String
def genBankFormatWithBaseCounts(sequence):
'''DNA Sequence displayed in a subset of the GenBank format.
See example at foot of:
https://www.genomatix.de/online_help/help/sequence_formats.html
'''
ks, totals = zip(*baseCounts(sequence))
ns = list(map(str, totals))
w = 2 + max(map(len, ns))
return '\n'.join([
'DEFINITION len=' + str(sum(totals)),
'BASE COUNT ' + ''.join(
n.rjust(w) + ' ' + k.lower() for (k, n)
in zip(ks, ns)
),
'ORIGIN'
] + [
str(i).rjust(9) + ' ' + k for i, k
in zip(
count(1, 60),
[
' '.join(row) for row in
chunksOf(6)(chunksOf(10)(sequence))
]
)
] + ['//'])
# baseCounts :: String -> Zip [(String, Int)]
def baseCounts(baseString):
'''Sums for each base type in the given sequence string, with
a fifth sum for any characters not drawn from {A, C, G, T}.'''
bases = {
'A': 0,
'C': 1,
'G': 2,
'T': 3
}
return zip(
list(bases.keys()) + ['Other'],
foldl(
lambda a: compose(
nthArrow(succ)(a),
flip(curry(bases.get))(4)
)
)((0, 0, 0, 0, 0))(baseString)
)
# -------------------------- TEST --------------------------
# main :: IO ()
def main():
'''Base counts and sequence displayed in GenBank format
'''
print(
genBankFormatWithBaseCounts('''\
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG\
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG\
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT\
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT\
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG\
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA\
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT\
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG\
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC\
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT''')
)
# ------------------------ GENERIC -------------------------
# chunksOf :: Int -> [a] -> [[a]]
def chunksOf(n):
'''A series of lists of length n, subdividing the
contents of xs. Where the length of xs is not evenly
divible, the final list will be shorter than n.
'''
return lambda xs: reduce(
lambda a, i: a + [xs[i:n + i]],
range(0, len(xs), n), []
) if 0 < n else []
# compose :: ((a -> a), ...) -> (a -> a)
def compose(*fs):
'''Composition, from right to left,
of a series of functions.
'''
def go(f, g):
def fg(x):
return f(g(x))
return fg
return reduce(go, fs, lambda x: x)
# curry :: ((a, b) -> c) -> a -> b -> c
def curry(f):
'''A curried function derived
from an uncurried function.
'''
return lambda x: lambda y: f(x, y)
# flip :: (a -> b -> c) -> b -> a -> c
def flip(f):
'''The (curried or uncurried) function f with its
arguments reversed.
'''
return lambda a: lambda b: f(b)(a)
# foldl :: (a -> b -> a) -> a -> [b] -> a
def foldl(f):
'''Left to right reduction of a list,
using the binary operator f, and
starting with an initial value a.
'''
def go(acc, xs):
return reduce(lambda a, x: f(a)(x), xs, acc)
return lambda acc: lambda xs: go(acc, xs)
# nthArrow :: (a -> b) -> Tuple -> Int -> Tuple
def nthArrow(f):
'''A simple function lifted to one which applies
to a tuple, transforming only its nth value.
'''
def go(v, n):
return v if n > len(v) else [
x if n != i else f(x)
for i, x in enumerate(v)
]
return lambda tpl: lambda n: tuple(go(tpl, n))
# succ :: Enum a => a -> a
def succ(x):
'''The successor of a value.
For numeric types, (1 +).
'''
return 1 + x
# MAIN ---
if __name__ == '__main__':
main()
- Output:
DEFINITION len=500 BASE COUNT 129 a 97 c 119 g 155 t 0 other ORIGIN 1 CGTAAAAAAT TACAACGTCC TTTGGCTATC TCTTAAACTC CTGCTAAATG CTCGTGCTTT 61 CCAATTATGT AAGCGTTCCG AGACGGGGTG GTCGATTCTG AGGACAAAGG TCAAGATGGA 121 GCGCATCGAA CGCAATAAGG ATCATTTGAT GGGACGTTTC GTCGACAAAG TCTTGTTTCG 181 AGAGTAACGG CTACCGTCTT CGATTCTGCT TATAACACTA TGTTCTTATG AAATGGATGT 241 TCTGAGTTGG TCAGTCCCAA TGTGCGGGGT TTCTTTTAGT ACGTCGGGAG TGGTATTATA 301 TTTAATTTTT CTATATAGCG ATCTGTATTT AAGCAATTCA TTTAGGTTAT CGCCGCGATG 361 CTCGGTTCGG ACCGCCAAGC ATCTGGCTCC ACTGCTAGTG TCCTAAATTT GAATGGCAAA 421 CACAAATAAG ATTTAGCAAT TCGTGTAGAC GACCGGGGAC TTGCATGATG GGAGCAGCTT 481 TGTTAAACTA CGAACGTAAT //
Quackery
[ over size -
space swap of
swap join ] is justify ( $ n --> $ )
[ 0 swap
[ dup $ "" != while
cr over number$
4 justify echo$
5 times
[ dup $ "" = iff
conclude done
sp
10 split swap echo$ ]
dip [ 50 + ] again ]
2drop ] is prettyprint ( $ --> )
[ stack ] is adenine ( --> s )
[ stack ] is cytosine ( --> s )
[ stack ] is guanine ( --> s )
[ stack ] is thymine ( --> s )
[ table
adenine cytosine
guanine thymine ] is bases ( --> [ )
[ 4 times
[ 0 i^ bases put ]
witheach
[ $ "ACGT" find bases
1 swap tally ]
4 times
[ sp
i^ bases dup echo
sp share echo cr ]
0 4 times
[ i^ bases take + ]
cr say " total " echo ] is tallybases ( [ --> )
$ "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG"
$ "CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG" join
$ "AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT" join
$ "GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT" join
$ "CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG" join
$ "TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA" join
$ "TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT" join
$ "CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG" join
$ "TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC" join
$ "GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT" join
dup prettyprint cr cr tallybases
- Output:
0 CGTAAAAAAT TACAACGTCC TTTGGCTATC TCTTAAACTC CTGCTAAATG 50 CTCGTGCTTT CCAATTATGT AAGCGTTCCG AGACGGGGTG GTCGATTCTG 100 AGGACAAAGG TCAAGATGGA GCGCATCGAA CGCAATAAGG ATCATTTGAT 150 GGGACGTTTC GTCGACAAAG TCTTGTTTCG AGAGTAACGG CTACCGTCTT 200 CGATTCTGCT TATAACACTA TGTTCTTATG AAATGGATGT TCTGAGTTGG 250 TCAGTCCCAA TGTGCGGGGT TTCTTTTAGT ACGTCGGGAG TGGTATTATA 300 TTTAATTTTT CTATATAGCG ATCTGTATTT AAGCAATTCA TTTAGGTTAT 350 CGCCGCGATG CTCGGTTCGG ACCGCCAAGC ATCTGGCTCC ACTGCTAGTG 400 TCCTAAATTT GAATGGCAAA CACAAATAAG ATTTAGCAAT TCGTGTAGAC 450 GACCGGGGAC TTGCATGATG GGAGCAGCTT TGTTAAACTA CGAACGTAAT adenine 129 cytosine 97 guanine 119 thymine 155 total 500
R
#Data
gene1 <- "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
#Analysis:
gene2 <- gsub("\n", "", gene1) #remove \n chars
gene3 <- strsplit(gene2, split = character(0)) #split into list
gene4 <- gene3[[1]] #pull out character vector from list
basecounts <- as.data.frame(table(gene4)) #make table of base counts
#quick helper function to print table results
print_row <- function(df, row){paste0(df$gene[row],": ", df$Freq[row])}
#Print Function for Data with Results:
cat(" Data: \n",
" 1:",substring(gene2, 1, 50),"\n",
" 51:",substring(gene2, 51, 100),"\n",
"101:",substring(gene2, 101, 150),"\n",
"151:",substring(gene2, 151, 200),"\n",
"201:",substring(gene2, 201, 250),"\n",
"251:",substring(gene2, 251, 300),"\n",
"301:",substring(gene2, 301, 350),"\n",
"351:",substring(gene2, 351, 400),"\n",
"401:",substring(gene2, 401, 450),"\n",
"451:",substring(gene2, 451, 500),"\n",
"\n",
"Base Count Results: \n",
print_row(basecounts,1), "\n",
print_row(basecounts,2), "\n",
print_row(basecounts,3), "\n",
print_row(basecounts,4), "\n",
"\n",
"Total Base Count:", paste(length(gene4))
)
- Output:
Data: 1: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 51: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 101: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 151: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 201: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 251: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 301: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 351: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 401: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 451: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base Count Results: A: 129 C: 97 G: 119 T: 155 Total Base Count: 500
Racket
#lang racket
(define (fold-sequence seq kons #:finalise (finalise (λ x (apply values x))) . k0s)
(define (recur seq . ks)
(if (null? seq)
(call-with-values (λ () (apply finalise ks)) (λ vs (apply values vs)))
(call-with-values (λ () (apply kons (car seq) ks)) (λ ks+ (apply recur (cdr seq) ks+)))))
(apply recur (if (string? seq) (string->list (regexp-replace* #px"[^ACGT]" seq "")) seq) k0s))
(define (sequence->pretty-printed-string seq)
(define (fmt idx cs-rev) (format "~a: ~a" (~a idx #:width 3 #:align 'right) (list->string (reverse cs-rev))))
(fold-sequence
seq
(λ (b n start-idx lns-rev cs-rev)
(if (zero? (modulo n 50))
(values (+ n 1) n (if (pair? cs-rev) (cons (fmt start-idx cs-rev) lns-rev) lns-rev) (cons b null))
(values (+ n 1) start-idx lns-rev (cons b cs-rev))))
0 0 null null
#:finalise (λ (n idx lns-rev cs-rev)
(string-join (reverse (if (null? cs-rev) lns-rev (cons (fmt idx cs-rev) lns-rev))) "\n"))))
(define (count-bases b as cs gs ts n)
(values (+ as (if (eq? b #\A) 1 0))
(+ cs (if (eq? b #\C) 1 0))
(+ gs (if (eq? b #\T) 1 0))
(+ ts (if (eq? b #\G) 1 0))
(add1 n)))
(define (bioinformatics-Base_count s)
(define-values (as cs gs ts n) (fold-sequence s count-bases 0 0 0 0 0))
(printf "SEQUENCE:~%~%~a~%~%" (sequence->pretty-printed-string s))
(printf "BASE COUNT:~%-----------~%~%~a~%~%"
(string-join (map (λ (c n) (format " ~a :~a" c (~a #:width 4 #:align 'right n)))
'(A T C G)
(list as ts cs gs)) "\n"))
(newline)
(printf "TOTAL: ~a~%" n))
(module+
main
(define the-string
#<<EOS
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
EOS
)
(bioinformatics-Base_count the-string))
- Output:
SEQUENCE: 0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT BASE COUNT: ----------- A : 129 T : 119 C : 97 G : 155 TOTAL: 500
Raku
(formerly Perl 6)
It's the Letter frequency task all over again, just simpler and dressed up in different clothes.
The specs for what "pretty print" means are sadly lacking. Ah well, just makes it easily defensible if I do anything at all.
my $dna = join '', lines q:to/END/;
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
END
put pretty($dna, 80);
put "\nTotal bases: ", +my $bases = $dna.comb.Bag;
put $bases.sort(~*.key).join: "\n";
sub pretty ($string, $wrap = 50) {
$string.comb($wrap).map( { sprintf "%8d: %s", $++ * $wrap, $_ } ).join: "\n"
}
- Output:
0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCG 80: AGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTC 160: GTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGT 240: TCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATATTTAATTTTTCTATATAGCG 320: ATCTGTATTTAAGCAATTCATTTAGGTTATCGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGACGACCGGGGACTTGCATGATGGGAGCAGCTT 480: TGTTAAACTACGAACGTAAT Total bases: 500 A 129 C 97 G 119 T 155
REXX
A little extra boilerplate was added to verify correct coding of the bases in a DNA string and the alignment of the (totals) numbers.
/*REXX program finds the number of each base in a DNA string */
/* (along with a total). */
Parse Arg dna .
If dna==''|dna==',' Then
dna='cgtaaaaaattacaacgtcctttggctatctcttaaactcctgctaaatg',
'ctcgtgctttccaattatgtaagcgttccgagacggggtggtcgattctg',
'aggacaaaggtcaagatggagcgcatcgaacgcaataaggatcatttgat',
'gggacgtttcgtcgacaaagtcttgtttcgagagtaacggctaccgtctt',
'cgattctgcttataacactatgttcttatgaaatggatgttctgagttgg',
'tcagtcccaatgtgcggggtttcttttagtacgtcgggagtggtattata',
'tttaatttttctatatagcgatctgtatttaagcaattcatttaggttat',
'cgccgcgatgctcggttcggaccgccaagcatctggctccactgctagtg',
'tcctaaatttgaatggcaaacacaaataagatttagcaattcgtgtagac',
'gaccggggacttgcatgatgggagcagctttgttaaactacgaacgtaat'
dna=translate(space(dna,0)) /* elide blanks from DNA; uppercas*/
Say '--------length of the DNA string: ' length(dna)
count.=0 /* initialize the count for all bases*/
w=1 /* the maximum width of a base count */
names='' /* list of all names */
Do j=1 To length(dna) /* traipse through the DNA string */
name=substr(dna,j,1) /* obtain a base name from the DNA */
If pos(name,names)==0 Then
names=names||name /* if not found, add it to the list */
count.name=count.name+1 /* bump the count of this base. */
w=max(w,length(count.name)) /* compute the maximum number width */
End
Say
Do k=0 To 255
z=d2c(k) /* traipse through all possibilities */
If pos(z,names)>0 Then Do
Say ' base ' z ' has a basecount of: ' right(count.z,w)
count.tot=count.tot+count.z /* add to a grand total to verify */
End
End
Say
Say '--------total for all basecounts:' right(count.tot,w+1)
- output when using the default input:
────────length of the DNA string: 500 base A has a basecount of: 129 base C has a basecount of: 97 base G has a basecount of: 119 base T has a basecount of: 155 ────────total for all basecounts: 500
Ring
dna = "" +
"CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG" +
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG" +
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT" +
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT" +
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG" +
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA" +
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT" +
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG" +
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC" +
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
dnaBase = [:A=0, :C=0, :G=0, :T=0]
lenDna = len(dna)
for n = 1 to lenDna
dnaStr = substr(dna,n,1)
switch dnaStr
on "A"
strA = dnaBase["A"]
strA++
dnaBase["A"] = strA
on "C"
strC = dnaBase["C"]
strC++
dnaBase["C"] = strC
on "G"
strG = dnaBase["G"]
strG++
dnaBase["G"] = strG
on "T"
strT = dnaBase["T"]
strT++
dnaBase["T"] = strT
off
next
? "A : " + dnaBase["A"]
? "T : " + dnaBase["T"]
? "C : " + dnaBase["C"]
? "G : " + dnaBase["G"]
- Output:
A : 129 T : 155 C : 97 G : 119
RPL
Bases are grouped by codons, then output by groups of 4 codons to match the 22-character screen
≪ "ACGT" → sequence nucleotides ≪ { 4 } 0 CON 1 sequence SIZE FOR j sequence j DUP SUB IF nucleotides SWAP POS THEN LAST GET LAST ROT 1 + PUT END NEXT ≫ ≫ 'BASECOUNT' STO ≪ "" 1 3 PICK SIZE FOR j OVER j DUP 2 + SUB + " " + 3 STEP SWAP DROP → codons ≪ 1 codons SIZE FOR j codons j DUP 15 + SUB 16 STEP codons BASECOUNT DUP 1 CON DOT @ calculate the sum of the vector returned by BASECOUNT ≫ ≫ 'SHOWSEQ' STO
"CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATATTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTATCGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTGTCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGACGACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT" SHOWSEQ
- Output:
44: "CGT AAA AAA TTA " 43: "CAA CGT CCT TTG " ... 4: "TGT TAA ACT ACG " 3: "AAC GTA AT " 2: [ 129 97 119 155 ] 1: 500
Ruby
dna = <<DNA_STR
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
DNA_STR
chunk_size = 60
dna = dna.delete("\n")
size = dna.size
0.step(size, chunk_size) do |pos|
puts "#{pos.to_s.ljust(6)} #{dna[pos, chunk_size]}"
end
puts dna.chars.tally.sort.map{|ar| ar.join(" : ") }
puts "Total : #{dna.size}"
- Output:
0 CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATGCTCGTGCTTT 60 CCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTGAGGACAAAGGTCAAGATGGA 120 GCGCATCGAACGCAATAAGGATCATTTGATGGGACGTTTCGTCGACAAAGTCTTGTTTCG 180 AGAGTAACGGCTACCGTCTTCGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGT 240 TCTGAGTTGGTCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300 TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTATCGCCGCGATG 360 CTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTGTCCTAAATTTGAATGGCAAA 420 CACAAATAAGATTTAGCAATTCGTGTAGACGACCGGGGACTTGCATGATGGGAGCAGCTT 480 TGTTAAACTACGAACGTAAT A : 129 C : 97 G : 119 T : 155 Total : 500
Rust
use std::collections::HashMap;
fn main() {
let dna = "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG\
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG\
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT\
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT\
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG\
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA\
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT\
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG\
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC\
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT";
let mut base_count = HashMap::new();
let mut total_count = 0;
print!("Sequence:");
for base in dna.chars() {
if total_count % 50 == 0 {
print!("\n{:3}: ", total_count);
}
print!("{}", base);
total_count += 1;
let count = base_count.entry(base).or_insert(0); // Return current count for base or insert 0
*count += 1;
}
println!("\n");
println!("Base count:");
println!("-----------");
let mut base_count: Vec<_> = base_count.iter().collect(); // HashMaps can't be sorted, so collect into Vec
base_count.sort_by_key(|bc| bc.0); // Sort bases alphabetically
for (base, count) in base_count.iter() {
println!(" {}: {:3}", base, count);
}
println!();
println!("Total: {}", total_count);
}
- Output:
Sequence: 0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base count: ----------- A: 129 C: 97 G: 119 T: 155 Total: 500
Swift
import Foundation
let dna = """
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT
"""
print("input:\n\(dna)\n")
let counts =
dna.replacingOccurrences(of: "\n", with: "").reduce(into: [:], { $0[$1, default: 0] += 1 })
print("Counts: \(counts)")
print("Total: \(counts.values.reduce(0, +))")
- Output:
input: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT ["C": 97, "T": 155, "G": 119, "A": 129] Total: 500
Tcl
namespace path ::tcl::mathop
proc process {data {width 50}} {
set len [string length $data]
set addrwidth [string length [* [/ $len $width] $width]]
for {set i 0} {$i < $len} {incr i $width} {
puts "[format %${addrwidth}u $i] [string range $data $i $i+[- $width 1]]"
}
puts "\nBase count:"
foreach base {A C G T} {
puts "$base [regexp -all $base $data]"
}
puts "Total $len"
}
set test [string cat \
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG \
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG \
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT \
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT \
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG \
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA \
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT \
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG \
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC \
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT]
process $test 50
- Output:
0 CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50 CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100 AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150 GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200 CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250 TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300 TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350 CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400 TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450 GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base count: A 129 C 97 G 119 T 155 Total 500
uBasic/4tH
a := "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG"
a = Join(a, "CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG")
a = Join(a, "AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT")
a = Join(a, "GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT")
a = Join(a, "CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG")
a = Join(a, "TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA")
a = Join(a, "TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT")
a = Join(a, "CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG")
a = Join(a, "TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC")
a = Join(a, "GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT")
For x = 0 To Len(a)-1
If (x % 50) = 0 Then Print : Print Using "__#: "; x;
Print Chr(Set(b, Peek (a, x)));
@(b - Ord("A")) = @(b - Ord("A")) + 1
Next
Print : Print : Push Ord("T"), Ord("G"), Ord("C"), Ord("A")
For x = 1 To Used()
Print Chr(Set(b, Pop())); Using ": __#"; @(b - Ord("A"))
Next
- Output:
0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT A: 129 C: 97 G: 119 T: 155 0 OK, 0:957
VBScript
b=_
"CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG" &_
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG" &_
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT" &_
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT" &_
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG" &_
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA" &_
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT" &_
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG" &_
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC" &_
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
s="SEQUENCE:"
acnt=0:ccnt=0:gcnt=0:tcnt=0
for i=0 to len(b)-1
if (i mod 30)=0 then s = s & vbcrlf & right(" "& i+1,3)&": "
if (i mod 5)=0 then s=s& " "
m=mid(b,i+1,1)
s=s & m
select case m
case "A":acnt=acnt+1
case "C":ccnt=ccnt+1
case "G":gcnt=gcnt+1
case "T":tcnt=tcnt+1
case else
wscript.echo "error at ",i+1, m
end select
next
wscript.echo s & vbcrlf
wscript.echo "Count: A="&acnt & " C=" & ccnt & " G=" & gcnt & " T=" & tcnt
- Output:
SEQUENCE: 1: CGTAA AAAAT TACAA CGTCC TTTGG CTATC 31: TCTTA AACTC CTGCT AAATG CTCGT GCTTT 61: CCAAT TATGT AAGCG TTCCG AGACG GGGTG 91: GTCGA TTCTG AGGAC AAAGG TCAAG ATGGA 121: GCGCA TCGAA CGCAA TAAGG ATCAT TTGAT 151: GGGAC GTTTC GTCGA CAAAG TCTTG TTTCG 181: AGAGT AACGG CTACC GTCTT CGATT CTGCT 211: TATAA CACTA TGTTC TTATG AAATG GATGT 241: TCTGA GTTGG TCAGT CCCAA TGTGC GGGGT 271: TTCTT TTAGT ACGTC GGGAG TGGTA TTATA 301: TTTAA TTTTT CTATA TAGCG ATCTG TATTT 331: AAGCA ATTCA TTTAG GTTAT CGCCG CGATG 361: CTCGG TTCGG ACCGC CAAGC ATCTG GCTCC 391: ACTGC TAGTG TCCTA AATTT GAATG GCAAA 421: CACAA ATAAG ATTTA GCAAT TCGTG TAGAC 451: GACCG GGGAC TTGCA TGATG GGAGC AGCTT 481: TGTTA AACTA CGAAC GTAAT Count: A=129 C=97 G=119 T=155
V (Vlang)
fn main() {
dna := "" +
"CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG" +
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG" +
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT" +
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT" +
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG" +
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA" +
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT" +
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG" +
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC" +
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
println("SEQUENCE:")
le := dna.len
for i := 0; i < le; i += 50 {
mut k := i + 50
if k > le {
k = le
}
println("${i:5}: ${dna[i..k]}")
}
mut base_map := map[byte]int{} // allows for 'any' base
for i in 0..le {
base_map[dna[i]]++
}
mut bases := base_map.keys()
bases.sort()
println("\nBASE COUNT:")
for base in bases {
println(" $base: ${base_map[base]:3}")
}
println(" ------")
println(" Σ: $le")
println(" ======")
}
- Output:
SEQUENCE: 0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT BASE COUNT: A: 129 C: 97 G: 119 T: 155 ------ Σ: 500 ======
Wren
import "./fmt" for Fmt
import "./sort" for Sort
import "./iterate" for Stepped
var dna = "CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG" +
"CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG" +
"AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT" +
"GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT" +
"CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG" +
"TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA" +
"TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT" +
"CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG" +
"TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC" +
"GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT"
System.print("SEQUENCE:")
var le = dna.count
for (i in Stepped.new(0...le, 50)) {
var k = i + 50
if (k > le) k = le
System.print("%(Fmt.d(5, i)): %(dna[i...k])")
}
var baseMap = {} // allows for 'any' base
for (i in 0...le) {
var d = dna[i]
var v = baseMap[d]
baseMap[d] = !v ? 1 : v + 1
}
var bases = baseMap.keys.toList
Sort.quick(bases)
System.print("\nBASE COUNT:")
for (base in bases) {
System.print(" %(base): %(Fmt.d(3, baseMap[base]))")
}
System.print(" ------")
System.print(" Σ: %(le)")
System.print(" ======")
- Output:
SEQUENCE: 0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT BASE COUNT: A: 129 C: 97 G: 119 T: 155 ------ Σ: 500 ======
XPL0
char Bases;
int Counts(256), Cnt, I, Ch;
[Bases:= "
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAATx ";
for I:= 0 to 255 do Counts(I):= 0;
Format(5, 0);
Cnt:= 0;
I:= 0;
loop [repeat Ch:= Bases(I);
I:= I+1;
if Ch = ^x then quit;
Counts(Ch):= Counts(Ch)+1;
ChOut(0, Ch);
until Ch = \LF\$0A;
RlOut(0, float(Cnt)); Text(0, ": ");
Cnt:= Cnt + 50;
];
CrLf(0); CrLf(0);
Text(0, "Base counts A: "); IntOut(0, Counts(^A));
Text(0, " C: "); IntOut(0, Counts(^C));
Text(0, " G: "); IntOut(0, Counts(^G));
Text(0, " T: "); IntOut(0, Counts(^T));
Text(0, "
Total: "); IntOut(0, Cnt); CrLf(0);
]
- Output:
0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base counts A: 129 C: 97 G: 119 T: 155 Total: 500
zkl
bases:=
#<<<"
CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG
CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG
AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT
GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT
CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG
TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA
TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT
CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG
TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC
GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT" - " \n";
#<<<
[0..*,50].zipWith(fcn(n,bases){ println("%6d: %s".fmt(n,bases.concat())) },
bases.walker().walk.fp(50)).pump(Void); // .pump forces the iterator
println("\nBase Counts: ", bases.counts().pump(String,Void.Read,"%s: %d ".fmt));
println("Total: ",bases.len());
- Output:
0: CGTAAAAAATTACAACGTCCTTTGGCTATCTCTTAAACTCCTGCTAAATG 50: CTCGTGCTTTCCAATTATGTAAGCGTTCCGAGACGGGGTGGTCGATTCTG 100: AGGACAAAGGTCAAGATGGAGCGCATCGAACGCAATAAGGATCATTTGAT 150: GGGACGTTTCGTCGACAAAGTCTTGTTTCGAGAGTAACGGCTACCGTCTT 200: CGATTCTGCTTATAACACTATGTTCTTATGAAATGGATGTTCTGAGTTGG 250: TCAGTCCCAATGTGCGGGGTTTCTTTTAGTACGTCGGGAGTGGTATTATA 300: TTTAATTTTTCTATATAGCGATCTGTATTTAAGCAATTCATTTAGGTTAT 350: CGCCGCGATGCTCGGTTCGGACCGCCAAGCATCTGGCTCCACTGCTAGTG 400: TCCTAAATTTGAATGGCAAACACAAATAAGATTTAGCAATTCGTGTAGAC 450: GACCGGGGACTTGCATGATGGGAGCAGCTTTGTTAAACTACGAACGTAAT Base Counts: A: 129 C: 97 G: 119 T: 155 Total: 500
- Programming Tasks
- Solutions by Programming Task
- 11l
- AArch64 Assembly
- Action!
- Ada
- ALGOL 68
- APL
- ARM Assembly
- Arturo
- AutoHotkey
- AWK
- BBC BASIC
- C
- C++
- Delphi
- System.SysUtils
- Generics.Collections
- System.Console
- DuckDB
- EasyLang
- Ed
- Factor
- Forth
- FreeBASIC
- FutureBasic
- Fōrmulæ
- Go
- Haskell
- J
- Java
- JavaScript
- Jq
- Julia
- Kotlin
- Lambdatalk
- Lua
- Mathematica
- Wolfram Language
- MATLAB
- Octave
- Nim
- Pascal
- Perl
- Phix
- Picat
- PicoLisp
- PureBasic
- Python
- Quackery
- R
- Racket
- Raku
- REXX
- Ring
- RPL
- Ruby
- Rust
- Swift
- Tcl
- UBasic/4tH
- VBScript
- V (Vlang)
- Wren
- Wren-fmt
- Wren-sort
- Wren-iterate
- XPL0
- Zkl