Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default field size and decimal length when writing shapefiles #114

Open
karimbahgat opened this issue Aug 29, 2017 · 2 comments
Open

Default field size and decimal length when writing shapefiles #114

karimbahgat opened this issue Aug 29, 2017 · 2 comments

Comments

@karimbahgat
Copy link
Collaborator

Due to recent changes since 1.2.10, the issue of field and value types have been raised as a concern by several users. Most recently, @klasko2 pointed out in #99 that saving a float value to an 'F' field will save it as an integer, because the default number of decimals is 0 when defining a new field. This begs the more general question for the next version of PyShp:

What should be the default field 'size' and 'decimal' for different field types?

I hope this thread can be used as a place for people to voice their concerns and share their experiences and expectations regarding shapefiles and dbf field types.

The Issue

Until now, field size (i.e. how many bytes) has been always set to 50, and decimal always to 0.
Instead, I think the case can be made that any numeric field should default to a decimal number. This leaves us with some open questions:

  1. ...what's a good default size number? Is 50 big enough to store most numbers that an average user would need and at the same time small enough to not waste filesize. For a negative decimal number, this could store a value as low as -100000000000000000000000000000000000000000000000.0, or as detailed as -0.000000000000000000000000000000000000000000000001 (provided the decimal arg is set accordingly)? That might actually seem excessively high for most users so perhaps it should be lowered to produce smaller shapefiles? What's the default in other software?
  2. ...what's a good default decimal number? Would 6 decimal places retain enough information for the average user not to feel they are losing information? This would mean floats being rounded to e.g. 0.123456. Perhaps this is too small, should it be instead 12 or 16? What's the default in other software?
  3. ...should size and decimal be the same for 'F' and 'N' fields? Float fields are decimals by definition, but Numeric fields can be both ints or floats. One might argue that both should default to decimal numbers, since defaulting to ints would result in lost information for unsuspecting users. Manually setting decimal=0 can be done if the user is certain they just want to save ints.

For the remaining field types I think the following would be non-controversial:

  • Type 'C': size=80, decimal irrelevant. Text fields are typically longer than numeric fields, and I believe that's the default QGIS text field size. This would save text values as long as abcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcde.
  • Type 'L': size=1, decimal irrelevant.
  • Type 'D': size=8, decimal irrelevant.

Any and all thoughts are appreciated!

@ChrisBarker-NOAA
Copy link

ChrisBarker-NOAA commented Nov 22, 2023

I have no idea if this is still active, as it was created in 2017, but it's open and it looks like it was added to a milestone in 2022, so here we go:

First, I see from the docs that 'F' and 'N' are the same -- that is a pity, it would be really good to support an actual integer type. For example, we are writing shape files that have truly integer fields, but they are getting detected as Real by other software -- for example OGR (GDAL) (via Python):

print(feature)
OGRFeature(Test-Model2023-11-18T06):1
  Time (String) = 2023-11-18T06:00:00
  LE_id (Real) = 1
  Depth (Real) = 0
  Mass (Real) = 15
  Age (Real) = 43200
  Surf_Conc (Real) = 0.00164
  Status_Cod (Real) = 2
  POINT (-89.3187777840845 28.8072723576694)
print(feature.items())
{'Time': '2023-11-18T06:00:00',
 'LE_id': 1.0,
 'Depth': 0.0,
 'Mass': 15.0,
 'Age': 43200.0,
 'Surf_Conc': 0.00164,
 'Status_Cod': 2.0}

It would be really nice if the integers could come through as integers. I can convert in the client code, but that's a limitation in the discoverability of the data.

(in the above: 'Depth': 0.0 really should be a float, whereas 'LE_id': 1.0 should really be an integer, as as an ID that really matters.

And making 'N' and 'F' be different would help with the defaults -- an integer wold be an integer :-)

I haven't looked carefully at the DBF format yet -- the ESRI shapefile spec helpfully (not) just references the DBASE spec -- without even a link :-(

But according to wikipedia:

"""
Supported field types are: floating point (13 character storage), integer (4 or 9 character storage), date (no time storage; 8 character storage), and text (maximum 254 character storage)
"""
if this is correct, then:

what's a good default size number? Is 50

It doesn't sound like anything over 12 is supported anyway. But if you are correct that it's an option to go larger, maybe these values are reasonable defaults?

...what's a good default decimal number? Would 6 decimal places retain enough information for the average user not to feel they are losing information? This would mean floats being rounded to e.g. 0.123456.

This is a serious challenge -- there simply is no default if you have to have a fixed number of decimal places -- it depends on the order of magnitude of the number. Is actual floating point not an option (that is: 1.234e10 and 1.234e-10 -- same amount of precision, totally different number of places after the decimal point) If it does have to be fixed, I think there should be no default -- it depends on what data you are trying to store, and only the person writing the data can know what's appropriate.

(I got to say -- it is really bad that we are so dependent on such an ancient file format! -- but what can you do?)

Perhaps this is too small, should it be instead 12 or 16? What's the default in other software?

wait! looking now at your docs, it seems it DOES support true float.

e.g:
`>>> r = shapefile.Reader('shapefiles/test/dtype')

assert r.record(0) == [1, 1.32, 1.3217328, -3.2302e-25, 1.3217328, ...`

In that case, a C float is about 8 decimal digits, and double 16 -- a Python double is 16 digits. So 8 or 16 digits would be reasonable defaults.

For integers, 64 bit ints support 20 digits, but those are really big, 32 bit ints are, I think 10 digits, so not a bad default.

If I'm totally wrong here, do you have a pointer to the spec for the DBF format as used by shapefiles? I haven't been able to find it yet.

I did find this:

http://www.manmrk.net/tutorials/database/xbase/data_types.html#DATA_TYPES
but it looks more extensive than what a shapefiles support. But it does limit "N" to 18 chars.

@ChrisBarker-NOAA
Copy link

Looking a bit more, perhaps you could follow similar defaults, etc to the OGR Shapefile writer:

(https://gdal.org/drivers/vector/shapefile.html)

Shapefile feature attributes are stored in an associated .dbf file, and so attributes suffer a number of limitations:

Attribute names can only be up to 10 characters long. The OGR Shapefile driver tries to generate unique field names. Successive duplicate field names, including those created by truncation to 10 characters, will be truncated to 8 characters and appended with a serial number from 1 to 99.

For example:

a → a, a → a_1, A → A_2;

abcdefghijk → abcdefghij, abcdefghijkl → abcdefgh_1

Only Integer, Integer64, Real, String and Date (not DateTime, just year/month/day) field types are supported. The various list, and binary field types cannot be created.

The field width and precision are directly used to establish storage size in the .dbf file. This means that strings longer than the field width, or numbers that don't fit into the indicated field format will suffer truncation.

Integer fields without an explicit width are treated as width 9, and extended to 10 or 11 if needed.

Integer64 fields without an explicit width are treated as width 18, and extended to 19 or 20 if needed.

Real (floating point) fields without an explicit width are treated as width 24 with 15 decimal places of precision.

String fields without an assigned width are treated as 80 characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants