Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix broken anchor links warned about #330

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,13 @@
build/
node_modules/
.docusaurus/
.yarn/**
.yarnrc.yml
yarn-debug.log*
yarn-error.log*
.pnp.*
.pnpm-debug.log*
lerna-debug.log*

########
## Linux
Expand Down
24 changes: 12 additions & 12 deletions docs/manual/cellediting.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ You can also convert cells into null values or empty strings. This can be useful

## Fill down and blank down {#fill-down-and-blank-down}

Fill down and blank down are two functions most frequently used when encountering data organized into [records](exploring#row-types-rows-vs-records) - that is, multiple rows associated with one specific entity.
Fill down and blank down are two functions most frequently used when encountering data organized into [records](exploring#rows-vs-records) - that is, multiple rows associated with one specific entity.

If you receive information in rows mode and want to convert it to records mode, the easiest way is to sort your first column by the value that you want to use as a unique records key, [make that sorting permanent](transforming#edit-rows), then blank down all the duplicates in that column. OpenRefine will retain the first unique value and erase the rest. Then you can switch from “Show as rows” to “Show as records” and OpenRefine will associate rows to each other based on the remaining values in the first column.

Expand Down Expand Up @@ -119,13 +119,13 @@ The clustering pop-up window offers you two categories of clustering methods: 6
* [Beider-Morse](#baider-morse)
- [Nearest Neighbor](#nearest-neighbor)
* [Levenshtein](#levenshtein-distance)
* [PPM](#ppm)
* [PPM (Prediction by Partial Matching)](#ppm)

#### Key Collision {#key-collision}

**Key collisions** are very fast and can process millions of cells in seconds:

**<a name="fingerprinting">Fingerprinting</a>**
##### Fingerprinting {#fingerprinting}

Fingerprinting is the least likely to produce false positives, so it’s a good place to start. It does the same kind of data cleaning behind the scenes that you might think to do manually:

Expand All @@ -138,7 +138,7 @@ Fingerprinting is the least likely to produce false positives, so it’s a good

For an in-depth understanding of fingerprinting, check this [document](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint)

**<a name="n-gram">N-gram Fingerprinting</a>**
##### N-gram Fingerprinting {#n-gram}

N-gram fingerprinting allows you to set the _n_ value to whatever number you’d like and will create n-grams of _n_ size (after doing some cleaning), alphabetize them, then join them back together into a fingerprint.

Expand All @@ -148,23 +148,23 @@ This can help match cells that have typos, or incorrect spaces (such as matching

For an in-depth understanding of N-gram fingerprinting, check this [document](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#n-gram-fingerprint)

**<a name="phonetic-clustering">Phonetic Clustering</a>**
##### Phonetic Clustering {#phonetic-clustering}

The next four methods are phonetic algorithms: they identify letters that sound the same when pronounced out loud, and assess text values based on that (such as knowing that a word with an “S” might be a mistype of a word with a “Z”). They are great for spotting mistakes made by not knowing the spelling of a word or name after hearing it spoken aloud.

**<a name="metaphone3-fingerprinting">Metaphone3 Fingerprinting</a>**
##### Metaphone3 Fingerprinting {#metaphone3-fingerprinting}

Metaphone3 fingerprinting is an English-language phonetic algorithm. For example, “Reuben Gevorkiantz” and “Ruben Gevorkyants” share the same phonetic fingerprint in English.

**<a name="cologne-fingerprinting">Cologne Fingerprinting</a>**
##### Cologne Fingerprinting {#cologne-fingerprinting}

Cologne fingerprinting is another phonetic algorithm, but for German pronunciation.

**<a name="daitch-mokotoff">Daitch-Mokotoff</a>**
##### Daitch-Mokotoff {#daitch-mokotoff}

Daitch-Mokotoff is a phonetic algorithm for Slavic and Yiddish words, especially names.

**<a name="baider-morse">Baider-Morse</a>**
##### Baider-Morse {#baider-morse}

Baider-Morse is a version of Daitch-Mokotoff that is slightly more strict.

Expand All @@ -174,21 +174,21 @@ For an in-depth understanding of phonetics, check this [document](https://github

#### Nearest Neighbor {#nearest-neighbor}

**Nearest Neighbor** clustering methods are slower than key collision methods.
**Nearest Neighbor** clustering methods are slower than previously described [key collision](#key-collision) methods.

They allow the user to set a radius - a threshold for matching or not matching. OpenRefine uses a “blocking” method first, which sorts values based on whether they have a certain amount of similarity (the default is “6” for a six-character string of identical characters) and then runs the nearest-neighbor operations on those sorted groups.

We recommend setting the block number to at least 3, and then increasing it if you need to be more strict (for example, if every value with “river” is being matched, you should increase it to 6 or more).

**Note** that bigger block values will take much longer to process, while smaller blocks may miss matches. Increasing the radius will make the matches more lax, as bigger differences will be clustered.

**<a name="levenshtein-distance">Levenshtein Distance</a>**
##### Levenshtein Distance {#levenshtein-distance}

Levenshtein distance counts the number of edits required to make one value perfectly match another. As in the key collision methods above, it will do things like change uppercase to lowercase, fix whitespace, change special characters, etc. Each character that gets changed counts as 1 “distance.” “New York” and “newyork” have an edit distance value of 3 (“N” to “n”; “Y” to “y”; remove the space).

It can do relatively advanced edits, such as understanding the distance between “M. Makeba” and “Miriam Makeba” (5), but it may create false positives if these distances are greater than other, simpler transformations (such as the one-character distance to “B. Makeba,” another person entirely).

**<a name="ppm">PPM (Prediction by Partial Matching)</a>**
##### PPM (Prediction by Partial Matching) {#ppm}

PPM (Prediction by Partial Matching) uses compression to see whether two values are similar or different. In practice, this method is very lax even for small radius values and tends to generate many false positives, but because it operates at a sub-character level it is capable of finding substructures that are not easily identifiable by distances that work at the character level. So it should be used as a “last resort” clustering method. It is also more effective on longer strings than on shorter ones.

Expand Down
1 change: 1 addition & 0 deletions docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ module.exports = {
themes: ['@docusaurus/theme-mermaid'],
onBrokenLinks: 'throw',
onBrokenMarkdownLinks: 'throw',
onBrokenAnchors: 'warn',
title: 'OpenRefine',
tagline: 'A power tool for working with messy data.',
url: 'https://openrefine.org',
Expand Down
Loading