Tokenize a string

From Rosetta Code
Revision as of 04:34, 2 December 2007 by rosettacode>Mwn3d (→‎{{header|Java}}: Grammar counts)
Task
Tokenize a string
You are encouraged to solve this task according to the task description, using any language you may know.

Separate the string "Hello,How,Are,You,Today" by commas into an array (or list) so that each element of it stores a different word. Display the words to the 'user', in the simplest manner possible, separated by a period. To simplify, you may display a trailing period.

Ada

with Ada.Strings.Fixed; use Ada.Strings.Fixed;
with Ada.Text_Io; use Ada.Text_Io;

procedure Parse_Commas is
   Source_String : String := "Hello,How,Are,You,Today";
   Index_List : array(1..256) of Natural;
   Next_Index : Natural := 1;
begin
   Index_List(Next_Index) := 1;
   while Index_List(Next_Index) < Source_String'Last loop
      Next_Index := Next_Index + 1;
      Index_List(Next_Index) := 1 + Index(Source_String(Index_List(Next_Index - 1)..Source_String'Last), ",");
      if Index_List(Next_Index) = 1 then 
         Index_List(Next_Index) := Source_String'Last + 2;
      end if;
      Put(Source_String(Index_List(Next_Index - 1)..Index_List(Next_Index)-2) & ".");
   end loop;
end Parse_Commas;

C

Standard: ANSI C

Compiler: gcc 3.3.3

Library: POSIX (strdup())

This example uses the strtok() function to separate the tokens. This function is destructive (replacing token separators with '\0'), so we have to make a copy of the string (using strdup()) before tokenizing. strdup() is not part of ANSI C, but is available on most platforms. It can easily be implemented with a combination of strlen(), malloc(), and strcpy().

#include<string.h>
#include<stdio.h>
#include<stdlib.h>

int main(void)
{
	char *a[5];
	const char *s="Hello,How,Are,You,Today";
	int n=0, nn;

	char *ds=strdup(s);

	a[n]=strtok(ds, ",");
	while(a[n] && n<5) a[++n]=strtok(NULL, ",");

	for(nn=0; nn<n; ++nn) printf("%s.", a[nn]);
	putchar('\n');

	free(ds);

	return 0;
}

C#

string str = "Hello,How,Are,You,Today";
string[] strings = str.Split(',');
foreach (string s in strings)
{
    Console.WriteLine (s + ".");
}

C++

Standard: ANSI C++

Compiler: GCC g++ (GCC) 3.4.4 (cygming special)

Library: STL

This is not the most efficient method as it involves redundant copies in the background, but it is very easy to use. In most cases it will be a good choice as long as it is not used as an inner loop in a performance critical system.

Note doxygen tags in comments before function, describing details of interface.

#include <string>
#include <vector>
/// \brief convert input string into vector of string tokens
///
/// \note consecutive delimiters will be treated as single delimiter
/// \note delimiters are _not_ included in return data
///
/// \param input string to be parsed
/// \param delims list of delimiters.

std::vector<std::string> tokenize_str(const std::string & str,
                                      const std::string & delims=", \t")
{
  using namespace std;
  // Skip delims at beginning, find start of first token
  string::size_type lastPos = str.find_first_not_of(delims, 0);
  // Find next delimiter @ end of token
  string::size_type pos     = str.find_first_of(delims, lastPos);

  // output vector
  vector<string> tokens;

  while (string::npos != pos || string::npos != lastPos)
    {
      // Found a token, add it to the vector.
      tokens.push_back(str.substr(lastPos, pos - lastPos));
      // Skip delims.  Note the "not_of". this is beginning of token
      lastPos = str.find_first_not_of(delims, pos);
      // Find next delimiter at end of token.
      pos     = str.find_first_of(delims, lastPos);
    }

  return tokens;
}


here is sample usage code:

#include <iostream>
int main() {
  using namespace std;
  string s("Hello,How,Are,You,Today");

  vector<string> v(tokenize_str(s));

  for (unsigned i  = 0; i < v.size(); i++) 
    cout << v[i] << ".";
  
  cout << endl;
  return 0;
}

E

".".rjoin("Hello,How,Are,You,Today".split(","))

Forth

There is no standard string split routine, but it is easily written. The results are saved temporarily to the dictionary.

: split ( str len separator len -- tokens count )
  here >r 2swap
  begin
    2dup 2,             \ save this token ( addr len )
    2over search        \ find next separator
  while
    dup negate  here 2 cells -  +!  \ adjust last token length
    2over nip /string               \ start next search past separator
  repeat
  2drop 2drop
  r>  here over -   ( tokens length )
  dup negate allot           \ reclaim dictionary
  2 cells / ;                \ turn byte length into token count

: .tokens ( tokens count -- )
  1 ?do dup 2@ type ." ." cell+ cell+ loop 2@ type ;

s" Hello,How,Are,You,Today" s" ," split .tokens  \ Hello.How.Are.You.Today

Haskell

The necessary operations are unfortunately not in the standard library (yet), but simple to write:

splitBy :: (a -> Bool) -> [a] -> a
splitBy _ [] = []
splitBy f list = first : splitBy f (dropWhile f rest) where
  (first, rest) = break f list

joinWith :: [a] -> a -> [a]
joinWith _ []     = []
joinWith _ [x]    = x
joinWith d (x:xs) = x ++ d ++ joinWith d xs
putStrLn $ joinWith "." $ splitBy (== ',') $ "Hello,How,Are,You,Today"

Groovy

println 'Hello,How,Are,You,Today'.split(',').join('.')


Java

Compiler: JDK 1.0 and up

There are multiple ways to tokenize a String in Java. The first is by splitting the String into an array of Strings, and the other way is to use StringTokenizer with a delimiter. The second way given here will skip any empty tokens. So if two commas are given in line, there will be an empty string in the array given by the split function, but no empty string with the StringTokenizer object.

String toTokenize = "Hello,How,Are,You,Today";

//First way
String word[] = toTokenize.split(",");
for(int i=0; i<word.length; i++) {
    System.out.print(word[i] + ".");
}

//Second way
StringTokenizer tokenizer = new StringTokenizer(toTokenize, ",");
while(tokenizer.hasMoreTokens()) {
    System.out.print(tokenizer.nextToken() + ".");
}

JavaScript

Interpreter: Firefox 2.0

alert( "Hello,How,Are,You,Today".split(",").join(".") );

MAXScript

output = ""
for word in (filterString "Hello,How,Are,You,Today" ",") do
(
    output += (word + ".")
)
format "%\n" output

Perl

Interpreter: Perl any 5.X

As a one liner without a trailing period, and most efficient way of doing it as you don't have to define an array.

print join('.', split(/,/, "Hello,How,Are,You,Today"));

If you needed to keep an array for later use, again no trailing period

my @words = split(/,/, "Hello,How,Are,You,Today");
print join('.', @words);

If you really want a trailing period, here is an example

my @words = split(/,/, "Hello,How,Are,You,Today");
print $_.'.' for (@words);

PHP

Interpreter: PHP any 5.x

<?php
$str = 'Hello,How,Are,You,Today';
$arr = explode(',', $str);
echo implode('.', $arr);
?>

Pop11

Natural solution in Pop11 uses lists:

;;; Declare and initialize variables
lvars str='Hello,How,Are,You,Today';
lvars ls = [], i, j = 1;
;;; Iterate over string
for i from 1 to length(str) do
    ;;; If comma
    if str(i) = `,` then
       ;;; Prepend word (substring) to list
       cons(substring(j, i - j, str), ls) -> ls;
       i + 1 -> j;
    endif;
endfor;
;;; Prepend final word (if needed)
if j <= length(str) then
    cons(substring(j, length(str) - j + 1, str), ls) -> ls;
endif;
;;; Reverse the list
rev(ls) -> ls;

Since the task requires to use array we convert list to array

;;; Put list elements and lenght on the stack
destlist(ls);
;;; Build a vector from them
lvars ar = consvector();
;;; Display in a loop, putting trailing period
for i from 1 to length(ar) do
   printf(ar(i), '%s.');
endfor;
printf('\n');

We could use list directly for printing:

for i in ls do
    printf(i, '%s.');
endfor;

so the convertion to vector is purely to satisfy task formulation.

Python

Interpreter: Python 2.5

text = "Hello,How,Are,You,Today"
tokens = text.split(',')
print '.'.join(tokens)

If you want to print each word on its own line:

for token in tokens:
    print token

or

print "\n".join(tokens)

or the one liner

print '.'.join('Hello,How,Are,You,Today'.split(','))

Raven

'Hello,How,Are,You,Today' ',' split '.' join print

Ruby

    string = "Hello,How,Are,You,Today".split(',')
    string.each do |w|
         print "#{w}."
    end
    puts "Hello,How,Are,You,Today".split(',').join('.')

Seed7

var array string: tokens is 0 times "";
tokens := split("Hello,How,Are,You,Today", ",");

Smalltalk

|array |
array := ('Hello,How,Are,You,Today' findTokens: $,) asArray.
Transcript show:
         (array inject: 
                  into:
                   [:concatenation :string | concatenation , string , '.'])

Standard ML

val splitter = String.tokens (fn c => c = #",");
val main = (String.concatWith ".") o splitter;

Test:

- main "Hello,How,Are,You,Today"
val it = "Hello.How.Are.You.Today" : string

Tcl

Generating a list form a string by splitting on a comma:

 split string ,

Joining the elements of a list by a period:

 join list .

Thus the whole thing would look like this:

 puts [join [split "Hello,How,Are,You,Today" ,] .]

If you'd like to retain the list in a variable with the name "words", it would only be marginally more complex:

 puts [join [set words [split "Hello,How,Are,You,Today" ,]] .]