Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The hash table implementation can be much faster. Dyalog APL's Unique (∪) computes exactly the same function using a dedicated hash table.

        ∪ 'AABBADCAB'
  ABDC
        ≢∪ a←{⍵[?⍨≢⍵]} {⍵,⍵[(?⍴)⍨≢⍵]} 5e5?2e9  ⍝ 1e6 ints with 50% unique
  500000
        cmpx '∪a'  ⍝ Time taken by ∪a in seconds
  1.8E¯2
So 18ns per element on a million elements (at 3.5GHz). Characteristics of the hash table we use include:

- Hashing with the Murmur3 mixing function (at the top of [1]), not the full hash function

- Open addressing with ints stored directly in the hash table

- Sentinel value for empty slots chosen to be outside the argument's range

- [edit] Preallocated fixed-sized table. It should be resizable when the argument is large enough; I will fix that some day.

We know the range of the argument since we check it in order to possibly apply small-range code, which can use an ordinary lookup table rather than a hash table. In 18.0, which features a largely rewritten Unique implementation (not much faster here), I applied some careful logic in order to use the first entry of the argument as the sentinel rather than trying to find a value not in the hash table.

[1] http://zimbry.blogspot.com/2011/09/better-bit-mixing-improvi...



But Dyalog's sorting actually beats its own Unique!

        cmpx '{(1,2≠/⍵)/⍵} {⍵[⍋⍵]} a'
  1.3E¯2
        ({(1,2≠/⍵)/⍵} {⍵[⍋⍵]} a) ≡ ({⍵[⍋⍵]} ∪a)
  1
The function {⍵[⍋⍵]} (index the argument by its own grade) sorts a vector ascending. {(1,2≠/⍵)/⍵} gets unique elements from a sorted numeric vector by taking all those elements which are first or unequal to their predecessor: 2≠/⍵ tests for inequality on all pairs of adjacent elements, and / uses a boolean vector on the left to filter elements on the right.

The second line tests that this unique-sort code gives the same result as sorting the result of Unique.

We use a radix sort for vectors of 4-byte or larger numbers. Unfortunately we can't use sorting to implement Unique in this way because Unique has to preserve the ordering in the original argument. However, it could be used to implement the sort-Unique or Unique-sort combination.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: