-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PMP compatibility with string data types. #73
Comments
Based on our chat in discord, we could add a preprocessing function that allows you to convert string data to a numeric representation. There are some caveats to this approach as you may lose some meaning. There isn't really a nice way to provide a generic approach that works for all data sets. @kavj mentioned this paper for reference:
|
I'd like to take a swing at this. @tylerwmarrs, I agree that conversion to numerical values may not be the best approach. I've done quite a bit of work already using SAX and Random Projection that uses string and character data types with no issue. As stated in the Discord channel, the distance is measured by Hamming distance (number of corrections needed to make 2 strings identical). This distance metric would have to be the default when working with string data types. Beyond that, what I envision is a PMP that functions the same way as numerical, but the gradient is represented by the Hamming distance. This is different than the white paper "https://www.cs.ucr.edu/~eamonn/PAN_SKIMP%20%28Matrix%20Profile%20XX%29.pdf" in the sense that you no longer need an identical match (see Fig 4), but instead have the Hamming distance representation. |
Sounds good. Once you have some initial code in place, open a pull request and mention this issue. Dealing with string data is a new concept for this library, so we will need to make sure we fit it into the rest of the api accordingly. |
In the white paper "Matrix Profile XX: Finding and Visualizing Time Series Motifs of All Lengths using the Matrix Profile", section 3 demonstrates the Pan Matrix Profile with string data, however this capability is not included in the current implementation.
This shouldn't be too hard to implement. The example looks for an identical string match, but Hamming distance (or some other distance metric) could be used measure similarity/dissimilarity between strings. Such a feature has the advantage of avoiding the use of Random Projection in order to find string motifs with noise included in the signal, while simultaneously searching all possible sequence lengths.
The alternative is Random projection through every motif sequence length (N/2 where N is the length of the time series) and all reasonable discrepancy counts.
The text was updated successfully, but these errors were encountered: