Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review HuffLZ methods GetNumExtraBits and GetRepeatOffset #134

Open
DanRStevens opened this issue Aug 7, 2018 · 6 comments
Open

Review HuffLZ methods GetNumExtraBits and GetRepeatOffset #134

DanRStevens opened this issue Aug 7, 2018 · 6 comments

Comments

@DanRStevens
Copy link
Member

The GetNumExtraBits and GetRepeatOffset methods contain cascaded if/else blocks. This is slightly undesirable as it embeds data into the code in a hard to read way. Perhaps it can be rewritten as a data table lookup.

Some attention should be paid to performance, as this is an inner part of the decompression algorithm. It is however a branch of the inner part, which might not execute too frequently, so performance might not be a critical concern in practice.

Closing this issue without changes is also a valid option. The code as written is correct, and does not suffer from known performance issues. The issue is merely one of code style.

@Brett208 Brett208 self-assigned this Aug 19, 2018
@Brett208
Copy link
Member

I'm spending some time refactoring HuffLZ. It may not lead to a solution to this problem, but wanted to flag my work to reduce duplicated efforts in HuffLZ.

@DanRStevens
Copy link
Member Author

I was thinking about this today. I was tempted to close it as not a priority, but it managed to grab my curiosity.

If we go with a data structure approach, we need a way to map a point to a range, or rather, values associated with a range. Important details here are the intervals are non-overlapping, and adjacent, and there is a default case that covers all possible values. That is, no point belongs to more than one range, and no point falls between or around ranges. That is, every point belongs to exactly one range.

There are various algorithms that map points to ranges, such as Interval Trees. Specifically, it mentions in the naive case, where intervals do not overlap, a binary search tree will do.

As our data is static, we could implement a binary search of a sorted table. However, I think a linear search would actually be better here. In particular, we want lookup to be fast. Normally, a binary search would have better worst case lookup time. Here however, we are dealing with a decompression algorithm, where input data is not evenly distributed. The codes themselves are variable length to account for this, in that shorter codes are assigned to more likely patterns. As such, we want lookup of shorter codes to be fastest, as they will be most common. Hence, a linear search is most suited here, since it will have a lower amortized lookup cost over the set of compression codes.

The current cascaded if statements are basically an unrolled loop doing a linear search of a data table.

In terms of the data needed, lookup is done by the range end value, though the start value is also used in calculations. There is also the numExtraBits, a shiftFactor, and an additiveFactor. One method is a simple lookup for numExtraBits. The other method is a lookup for the range.start, shiftFactor, and additiveFactor, followed by a calculation using those values.

Seems like a static array of structs, along with a method to do a linear search over the data table with a for loop would close this.

@Brett208 Brett208 removed their assignment Aug 25, 2018
@DanRStevens
Copy link
Member Author

I went down a bit of a rabbit hole about returning multiple values from functions in C++. It was educational though. I think I like the struct method, which is also quite portable between different standards versions. The article had some interesting destructuring example at the end that was introduced in C++17. Or rather, a structured bindings example.

In terms of this issue, the code to range lookup needs to find and return multiple values. That might either be the raw parameters, stored in a static const table, or it might be the result of the GetNumExtraBits and GetRepeatOffset computations now done with a single lookup. For what I think would be a proper separation of concerns, I think it'd be best to return the result of those two methods.

struct OffsetModifiers {
	unsigned int extraBitCount;
	unsigned int offsetUpperBits;
};
static OffsetModifiers GetOffsetModifiers(unsigned int offset);

As a side note, after looking at the code a bit more, I realized the "mod" value was actually the upper bits of the offset value. The code translates some variable portion of the high bits, depending on the ranges, into an expanded set of high bits. Extra low order bits need to be read in to accommodate the increased range of the longer bit strings.

In terms of implementation, a transition implementation for the new method might be:

HuffLZ::OffsetModifiers HuffLZ::GetOffsetModifiers(unsigned int offset) {
	auto extraBitCount = GetNumExtraBits(offset);
	auto offsetUpperBits = GetOffsetBitMod(offset);
	return OffsetModifiers { extraBitCount, offsetUpperBits };
}

@Brett208
Copy link
Member

I might be okay with condensing to:

HuffLZ::OffsetModifiers HuffLZ::GetOffsetModifiers(unsigned int offset) {
	return OffsetModifiers { 
                GetNumExtraBits(offset),
                GetOffsetBitMod(offset)};
}

-Brett

@DanRStevens
Copy link
Member Author

True. At any rate, it's likely to be a transitional point in the code. I think we should get rid of the separate Get functions, so the value lookup can be done once. I figure having a transitional point using the old code through a new interface allows for better tracking in case a bug is introduced. Changes will be needed in both the calling function, and in the called code. The transition point allows the two to be done separately without breaking the code in the process. That can potentially aid testing and debugging.

@DanRStevens
Copy link
Member Author

I was watching a video about how testability relates to good design. I think I agreed with it. One point was about how you should try to test your components through their public interface. In particular, avoid testing private methods. If you feel the need to test private methods, your class is probably too big, and doing too many things, and you should perhaps extract a component that will then have a testable public interface. The Google Test documentation makes a similar claim in their documentation. Black box testing is about testing components through their public interface. If you feel the need to test private methods, perhaps there is a better design.

In regards to this issue, the GetOffsetModifiers method is private, and hard to properly test by going through the public interface. This got me thinking that section of the code should probably be extracted to a separate class. It appears to be doing a static Huffman table lookup, which is actually a fairly generic thing to do, so it makes a lot of sense to extract that code to a new class. As we're dealing with a static table here, an instance of that new class could be used as a static data member of the HuffLZ class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants