Skip to content

Commit

Permalink
Implement kMV algorithm, exercise, and tests
Browse files Browse the repository at this point in the history
  • Loading branch information
noelwelsh committed Oct 22, 2024
1 parent 155b93b commit 794865f
Show file tree
Hide file tree
Showing 4 changed files with 233 additions and 5 deletions.
89 changes: 89 additions & 0 deletions code/src/main/scala/kmv/KMinimumValues.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
/*
* Copyright 2024 Creative Scala
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package kmv

/** Implement the k-Minimum Values sketch. This skeleton leaves many design
* decisions up to you.
*/
final class KMinimumValues(k: Int) {
// The k minimum values, stored in a mutable array
private val values = Array.ofDim[Double](k)

// Values will be initialized to contain all zeros, which will be less than
// most reasonable input. Hence we need to track how many elements in values
// have been initialized with real data.
private var used = 0

/** Add the given element to this KMinimumValues sketch.
*
* In implementing this method you can choose to imperatively update internal
* state, which might give you a more efficient implementation, or a pure
* implementation that does not mutate state and is easier to reason about.
*/
def add(element: Double): KMinimumValues = {
import java.util.Arrays

// A +ve index indicates the element is in the array.
//
// A -ve index indicates the element is not in the array, and gives the
// insertion point - 1 for the element.
//
// Only search in the elements of values that have been used
val idx = Arrays.binarySearch(values, 0, used, element)

// Element is already in the array
if idx >= 0 then this
else {
if used < values.size then used = used + 1

val insertionPoint = -idx - 1
// Element is larger than any existing value
if insertionPoint >= values.size then this
else {
// Shift all the larger values out of the way and insert element
System.arraycopy(
values,
insertionPoint,
values,
insertionPoint + 1,
values.size - insertionPoint - 1
)
values(insertionPoint) = element
this
}
}
}

/** Get the estimated distinct values from this KMinimumValues sketch.
*
* The text describes this estimate as using the average distance between
* regions. An equivalent estimate can be made using the kth value, which is
* k times the average distance from 0, and hence estimates k / (n + 1),
* where n is the number of distinct values. If we call this values length,
* we can estimate the distinct values from length as
*
* distinct values = (k / length) - 1
*
* This requires less computation.
*/
def distinctValues: Long =
// If we have seen fewer than k values we can return the exact number of
// distinct values
if used < values.size then used.toLong
else Math.round(k.toDouble / values.last - 1.0)

}
49 changes: 49 additions & 0 deletions code/src/test/scala/kmv/KMinimumValuesSuite.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
/*
* Copyright 2024 Creative Scala
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package kmv

import munit.FunSuite

// The k-Minimum Values algorithm is deterministic, so we can test it with
// carefully chosen test data.
class KMinimumValuesSuite extends FunSuite {
test("kMV correctly estimates distinct values from a single point") {
val kmv = KMinimumValues(1).add(0.5)

assertEquals(kmv.distinctValues, 1L)
}

test("kMV correctly estimates distinct values from equally spaced points") {
val kmv = KMinimumValues(4).add(0.2).add(0.4).add(0.6).add(0.8)

assertEquals(kmv.distinctValues, 4L)
}

test("kMV keeps only the minimum values") {
val kmv = KMinimumValues(4)
.add(0.9)
.add(0.9)
.add(0.2)
.add(0.4)
.add(0.9)
.add(0.6)
.add(0.9)
.add(0.8)

assertEquals(kmv.distinctValues, 4L)
}
}
89 changes: 85 additions & 4 deletions docs/src/pages/kmv.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,17 +66,98 @@ In Scala we can use [scala.util.hashing.MurmurHash3][murmur3].

## Implementing k-Minimum Values

We're going to implement a distinct values estimate system using k-Minimum Values.
This means implementing the core algorithm as well as the support code around it that feeds it data.
This latter part is where FS2 will come in.


### The Algorithm

Your first mission is to implement the k-Minimum Values algorithm.
At this point we're not worrying about connecting it to FS2.
There is a code skeleton in `code/src/main/scala/kmv/KMinimumValues.scala`.
There are also simple tests in `code/src/test/scala/kmv/KMinimumValuesSuite.scala`.

@:solution
My solution uses a mutable array. I felt this like challenging myself to muck around with array indices and other concepts that I don't use much in my day-to-day programming. A solution using an immutable data structure would be a lot simpler to write.

```scala
final class KMinimumValues(k: Int) {
// The k minimum values, stored in a mutable array
private val values = Array.ofDim[Double](k)

// Values will be initialized to contain all zeros, which will be less than
// most reasonable input. Hence we need to track how many elements in values
// have been initialized with real data.
private var used = 0

def add(element: Double): KMinimumValues = {
import java.util.Arrays

// A +ve index indicates the element is in the array.
//
// A -ve index indicates the element is not in the array, and gives the
// insertion point - 1 for the element.
//
// Only search in the elements of values that have been used
val idx = Arrays.binarySearch(values, 0, used, element)

// Element is already in the array
if idx >= 0 then this
else {
if used < values.size then used = used + 1

val insertionPoint = -idx - 1
// Element is larger than any existing value
if insertionPoint >= values.size then this
else {
// Shift all the larger values out of the way and insert element
System.arraycopy(
values,
insertionPoint,
values,
insertionPoint + 1,
values.size - insertionPoint - 1
)
values(insertionPoint) = element
this
}
}
}

def distinctValues: Long =
// If we have seen fewer than k values we can return the exact number of
// distinct values
if used < values.size then used.toLong
else Math.round(k.toDouble / values.last - 1.0)
}
```
@:@


### Building a Data Pipeline

We're now going to build the pipeline that will feed the k-Minimum Values algorithm.
This will have the following stages:

- reading text from storage;
- segmenting the text into words; and
- hashing the words into `Double` values between 0 and 1.

For all of these parts we will use FS2.

For data we will use two sources:

1. The 1934 version of Webster's Dictionary, now in the public domain. This file has one word per line, and every word is unique, so it gives us an easy way to test our algorithm.

2. The complete works of William Shakespeare. This is much bigger than the dictionary, contains duplicates, and requires more processing, and so is a more realistic test.

## Hashing Data

#### Reading and Processing Text

## Pipes
#### Hashing Data

#### Pipes

## Reading and Processing Text


## References
Expand Down
11 changes: 10 additions & 1 deletion examples/js/src/main/scala/kmv/KMinimumValues.scala
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ final class KMV(values: Array[Double]) {
}

def cardinality: Long =
KMV.estimateCardinality(averageDistance)
KMV.estimateCardinality(values.size, values.last)
}
object KMV {
def arithmeticMean(elements: IArray[Double]): Double = {
Expand All @@ -74,6 +74,15 @@ object KMV {
loop(0, 0.0)
}

/** Estimate the cardinality from the distance from 0 to the kth element */
def estimateCardinality(k: Int, length: Double): Long = {
// length estimates k / (n + 1)
val estimatedCardinality = Math.round(k.toDouble / length) - 1

estimatedCardinality
}

/** Estimate the cardinality from the average length of regions */
def estimateCardinality(mean: Double): Long = {
val estimatedCardinality = Math.round(1.0 / mean) - 1

Expand Down

0 comments on commit 794865f

Please sign in to comment.