Remove duplicate elements: Difference between revisions

Content added Content deleted

Inline

Revision as of 00:07, 5 April 2009

Given an Array, derive a data structure containing a sequence of elements, derive a sequence of elements in which all duplicates are removed.

There are basically three approaches seen here:

Put the elements into a hash table which does not allow duplicates. The complexity is O(n) on average, and O(n^2) worst case. This approach requires a hash function for your type (which is compatible with equality), either built-in to your language, or provided by the user.
Sort the elements and remove consecutive duplicate elements. The complexity of the best sorting algorithms is O(n log n). This approach requires that your type be "comparable", i.e. have an ordering. Putting the elements into a self-balancing binary search tree is a special case of sorting.
Go through the list, and for each element, check the rest of the list to see if it appears again, and discard it if it does. The complexity is O(n^2). The up-shot is that this always works on any type (provided that you can test for equality).

This version uses unordered_set, which is part of the TR1, which is likely to be included in the next version of C++. It is not part of the C++ standard library. It requires that its element type have a hash function.

Works with: GCC

include <tr1/unordered_set>
include <iostream>

using namespace std;

int main() {

   typedef tr1::unordered_set<int> TyHash;
   int data[] = {1, 2, 3, 2, 3, 4};

   TyHash unique_set(data, data + 6);

   cout << "Set items:" << endl;
   for (TyHash::iterator iter = unique_set.begin(); iter != unique_set.end(); iter++)
         cout << *iter << " ";
   cout << endl;

} </lang>

Alternative method working directly on the array:

include <iostream>
include <iterator>
include <algorithm>

// helper template template<typename T> T* end(T (&array)[size]) { return array+size; }

int main() {

 int data[] = { 1, 2, 3, 2, 3, 4 };
 std::sort(data, end(data));
 int* new_end = std::unique(data, end(data));
 std::copy(data, new_end, std::ostream_iterator<int>(std::cout, " ");
 std::cout << std::endl;

} </lang>

C#

C# 2.0

Works with: MSVS version 2005 and .Net Framework 2.0

<lang csharp> List<int> nums = new List<int>( new int { 1, 1, 2, 3, 4, 4 } ); List<int> unique = new List<int>(); foreach( int i in nums )

 if( !unique.Contains( i ) )
   unique.Add( i );

</lang>

C# 3.0

Works with: MSVS version 2008 and .Net Framework 3.5

<lang csharp> var unique = (new int { 1, 1, 2, 3, 4, 4 }).Distinct(); </lang>

Common Lisp

To remove duplicates non-destructively:

<lang lisp> (remove-duplicates '(1 3 2 9 1 2 3 8 8 1 0 2)) > (9 3 8 1 0 2) </lang>

Or, to remove duplicates in-place:

<lang lisp> (delete-duplicates '(1 3 2 9 1 2 3 8 8 1 0 2)) > (9 3 8 1 0 2) </lang>

D

<lang d> void main() {

   int[] data = [1, 2, 3, 2, 3, 4];
   int[int] hash;
   foreach(el; data)
       hash[el] = 0;

} </lang>

E

[1,2,3,2,3,4].asSet().getElements()

Erlang

List = [1, 2, 3, 2, 2, 4, 5, 5, 4, 6, 6, 5].
Set = sets:from_list(List).

Factor

 USING: sets ;
 V{ 1 2 1 3 2 4 5 } prune .

 V{ 1 2 3 4 5 }

Forth

Forth has no built-in hashtable facility, so the easiest way to achieve this goal is to take the "uniq" program as an example.

The word uniq, if given a sorted array of cells, will remove the duplicate entries and return the new length of the array. For simplicity, uniq has been written to process cells (which are to Forth what "int" is to C), but could easily be modified to handle a variety of data types through deferred procedures, etc.

The input data is assumed to be sorted.

\ Increments a2 until it no longer points to the same value as a1
\ a3 is the address beyond the data a2 is traversing.
: skip-dups ( a1 a2 a3 -- a1 a2+n )
    dup rot ?do
      over @ i @ <> if drop i leave then
    cell +loop ;

\ Compress an array of cells by removing adjacent duplicates
\ Returns the new count
: uniq ( a n -- n2 )
   over >r             \ Original addr to return stack
   cells over + >r     \ "to" addr now on return stack, available as r@
   dup begin           ( write read )
      dup r@ <
   while
      2dup @ swap !    \ copy one cell
      cell+ r@ skip-dups
      cell 0 d+        \ increment write ptr only
   repeat  r> 2drop  r> - cell / ;

Here is another implementation of "uniq" that uses a popular parameters and local variables extension words. It is structurally the same as the above implementation, but uses less overt stack manipulation.

: uniqv { a n \ r e -- n }
    a n cells+ to e
    a dup to r
    \ the write address lives on the stack
    begin
      r e <
    while
      r @ over !
      r cell+ e skip-dups to r
      cell+
    repeat
    a - cell / ;

To test this code, you can execute:

create test 1 , 2 , 3 , 2 , 6 , 4 , 5 , 3 , 6 ,
here test - cell / constant ntest
: .test ( n -- ) 0 ?do test i cells + ? loop ; 

test ntest 2dup cell-sort uniq .test

output

1 2 3 4 5 6 ok

Haskell

<lang haskell> values = [1,2,3,2,3,4] unique = List.nub values </lang>

IDL

 result = uniq( array[sort( array )] )

J

   ] a=: 4 5 ?@$ 13  NB. 4 by 5 matrix of numbers chosen from 0 to 12
4 3 2 8 0
1 9 5 1 7
6 3 9 9 4
2 1 5 3 2

   , a     NB. sequence of the same elements
4 3 2 8 0 1 9 5 1 7 6 3 9 9 4 2 1 5 3 2
   ~. , a  NB. unique elements
4 3 2 8 0 1 9 5 7 6

The verb ~. removes duplicate items from any array (numeric, character, or other; vector, matrix, rank-n array). For example:

   ~. 'chthonic eleemosynary paronomasiac'
chtoni elmsyarp

Java

Works with: Java version 1.5

<lang java5> import java.util.Set; import java.util.HashSet; import java.util.Arrays;

Object[] data = {1, 2, 3, "a", "b", "c", 2, 3, 4, "b", "c", "d"}; Set<Object> uniqueSet = new HashSet<Object>(Arrays.asList(data)); Object[] unique = uniqueSet.toArray(); </lang>

Logo

Works with: UCB Logo

show remdup [1 2 3 a b c 2 3 4 b c d]   ; [1 a 2 3 4 b c d]

MAXScript

uniques = #(1, 2, 3, "a", "b", "c", 2, 3, 4, "b", "c", "d")
for i in uniques.count to 1 by -1 do
(
    id = findItem uniques uniques[i]
    if (id != i) do deleteItem uniques i
)

Nial

uniques := [1, 2, 3, 'a', 'b', 'c', 2, 3, 4, 'b', 'c', 'd']
cull uniques
=+-+-+-+-+-+-+-+-+
=|1|2|3|a|b|c|4|d|
=+-+-+-+-+-+-+-+-+

Using strand form

cull 1 1 2 2 3 3
=1 2 3

Objective-C

<lang objc> NSArray *items = [NSArray arrayWithObjects:@"A", @"B", @"C", @"B", @"A", nil];

NSSet *unique = [NSSet setWithArray:items]; </lang>

OCaml

<lang ocaml> let uniq lst =

 let unique_set = Hashtbl.create (List.length lst) in
   List.iter (fun x -> Hashtbl.replace unique_set x ()) lst;
   Hashtbl.fold (fun x () xs -> x :: xs) unique_set []

let _ =

 uniq [1;2;3;2;3;4]

</lang>

Perl

Library: List::MoreUtils MoreUtils

<lang perl> use List::MoreUtils qw(uniq);

my @uniq = uniq qw(1 2 3 a b c 2 3 4 b c d); </lang>

Without modules: <lang perl> my %seen; my @uniq = grep {!$seen{$_}++} qw(1 2 3 a b c 2 3 4 b c d); </lang>

PHP

<lang php> $list = array(1, 2, 3, 'a', 'b', 'c', 2, 3, 4, 'b', 'c', 'd'); $unique_list = array_unique($list); </lang>

Pop11

;;; Initial array
lvars ar = {1 2 3 2 3 4};
;;; Create a hash table
lvars ht= newmapping([], 50, 0, true);
;;; Put all array as keys into the hash table
lvars i;
for i from 1 to length(ar) do
   1 -> ht(ar(i))
endfor;

;;; Collect keys into a list
lvars ls = [];
appdata(ht, procedure(x); cons(front(x), ls) -> ls; endprocedure);

Prolog

<lang prolog> uniq(Data,Uniques) :- sort(Data,Uniques). </lang>

Example usage: <lang prolog> ?- uniq([1, 2, 3, 2, 3, 4],Xs). Xs = [1, 2, 3, 4] </lang>

Python

<lang python> items = [1, 2, 3, 'a', 'b', 'c', 2, 3, 4, 'b', 'c', 'd'] unique = list(set(items)) </lang> See also http://www.peterbe.com/plog/uniqifiers-benchmark and http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560

Raven

[ 1 2 3 'a' 'b' 'c' 2 3 4 'b' 'c' 'd' ] as items
items copy unique print

list (8 items)
 0 => 1
 1 => 2
 2 => 3
 3 => "a"
 4 => "b"
 5 => "c"
 6 => 4
 7 => "d"

Ruby

<lang ruby> ary = [1,1,2,1,'redundant',[1,2,3],[1,2,3],'redundant'] uniq_ary = ary.uniq

=> [1, 2, "redundant", [1, 2, 3]]

</lang>

Scala

<lang scala> val list = List(1,2,3,4,2,3,4,99) val l2 = list.removeDuplicates // l2: scala.List[scala.Int] = List(1,2,3,4,99) </lang>

Scheme

<lang scheme> (define (remove-duplicates l)

 (do ((a '() (if (member (car l) a) a (cons (car l) a)))
      (l l (cdr l)))
   ((null? l) (reverse a))))

(remove-duplicates (list 1 2 1 3 2 4 5)) </lang>

Some implementations provide remove-duplicates in their standard library.

Smalltalk

<lang smalltalk>|aCollection| "Example of creating a collection" a := #( 1 1 2 'hello' 'world' #symbol #another 2 'hello' #symbol ). a asSet printNl.</lang>

Output:

Set (1 2 #symbol 'world' #another 'hello' )

Tcl

The concept of an "array" in TCL is strictly associative - and since there cannot be duplicate keys, there cannot be a redundant element in an array. What is called "array" in many other languages is probably better represented by the "list" in TCL (as in LISP).

<lang tcl> set result [lsort -unique $listname] </lang>

UnixPipes

Assuming a sequence is represented by lines in a file. <lang bash> bash$ # original list bash$ printf '6\n2\n3\n6\n4\n2\n' 6 2 3 6 4 2 bash$ # made uniq bash$ printf '6\n2\n3\n6\n4\n2\n'|sort -n|uniq 2 3 4 6 bash$ </lang>

or

<lang bash> bash$ # original list bash$ printf '6\n2\n3\n6\n4\n2\n' 6 2 3 6 4 2 bash$ # made uniq bash$ printf '6\n2\n3\n6\n4\n2\n'|sort -u 2 3 4 6 bash$ </lang>

Remove duplicate elements: Difference between revisions

Revision as of 00:07, 5 April 2009

Ada

APL

AppleScript

C

C++

C#

Common Lisp

D

E

Erlang

Factor

Forth

Haskell

IDL

J

Java

Logo

MAXScript

Nial

Objective-C

OCaml

Perl

PHP

Pop11

Prolog

Python

Raven

Ruby

Scala

Scheme

Smalltalk

Tcl

UnixPipes