- feat: Add a
--no-leading-zeroes
option to tools that support type inference. - feat: :doc:`/scripts/csvsql` adds a
--engine-option
option. - feat: :doc:`/scripts/csvsql` adds a
--sql-delimiter
option, to set a different delimiter than;
for the--query
,--before-insert
andafter-insert
options. - feat: :doc:`/scripts/sql2csv` adds a
--execution-option
option. - feat: :doc:`/scripts/sql2csv` uses the
stream_results=True
execution option, by default, to not load all data into memory at once. - fix: :doc:`/scripts/csvsql` uses a default value of 1 for the
--min-col-len
and--col-len-multiplier
options.
- feat: :doc:`/scripts/csvsql` adds
--min-col-len
and--col-len-multiplier
options. - feat: :doc:`/scripts/sql2csv` adds a
--engine-option
option. - feat: Add a Docker image:
docker pull ghcr.io/wireservice/csvkit:latest
. - feat: Add man pages to the sdist and wheel distributions.
- fix: :doc:`/scripts/csvstat` no longer errors when a column is a time delta and
--json
is set. - fix: When taking arguments from
sys.argv
on Windows, glob patterns, user directories, and environment variables are expanded.
This is the first major release since December 27, 2016. Thank you to all :ref:`contributors<authors>`, including 44 new contributors since 1.0.0!
Want to use csvkit programmatically? Check out agate, used internally by csvkit.
BACKWARDS-INCOMPATIBLE CHANGES:
- :doc:`/scripts/csvclean` now writes its output to standard output and its errors to standard error, instead of to
basename_out.csv
andbasename_err.csv
files. Consequently:- The
--dry-run
option is removed. The--dry-run
option changed error output from the CSV format used inbasename_err.csv
files to a prosaic format likeLine 1: Expected 2 columns, found 3 columns
. - Summary information like
No errors.
,42 errors logged to basename_err.csv
and42 rows were joined/reduced to 24 rows after eliminating expected internal line breaks.
is not written.
- The
- :doc:`/scripts/csvclean` no longer reports or fixes errors by default; it errors if no checks or fixes are enabled. Opt in to the original behavior using the
--length-mismatch
and--join-short-rows
options. See new options below. - :doc:`/scripts/csvclean` no longer omits rows with errors from the output. Opt in to the original behavior using the
--omit-error-rows
option. - :doc:`/scripts/csvclean` joins short rows using a newline by default, instead of a space. Restore the original behavior using the
--separator " "
option.
In brief, to restore the original behavior for :doc:`/scripts/csvclean`:
csvclean --length-mismatch --omit-error-rows --join-short-rows --separator " " myfile.csv
Other changes:
- feat: :doc:`/scripts/csvclean` adds the options:
--length-mismatch
, to error on data rows that are shorter or longer than the header row--empty-columns
, to error on empty columns--enable-all-checks
, to enable all error reporting--omit-error-rows
, to omit data rows that contain errors, from standard output--label LABEL
, to add a "label" column to standard error--header-normalize-space
, to strip leading and trailing whitespace and replace sequences of whitespace characters by a single space in the header--join-short-rows
, to merge short rows into a single row--separator SEPARATOR
, to change the string with which to join short rows (default is newline)--fill-short-rows
, to fill short rows with the missing cells--fillvalue FILLVALUE
, to change the value with which to fill short rows (default is none)
- feat: The
--quoting
option accepts 4 (csv.QUOTE_STRINGS) and 5 (csv.QUOTE_NOTNULL) on Python 3.12. - feat: :doc:`/scripts/csvformat`: The
--out-quoting
option accepts 4 (csv.QUOTE_STRINGS) and 5 (csv.QUOTE_NOTNULL) on Python 3.12. - fix: :doc:`/scripts/csvformat`: The
--out-quoting
option works with 2 (csv.QUOTE_NONUMERIC). Use the--locale
option to set the locale of any formatted numbers. - fix: :doc:`/scripts/csvclean`: The
--join-short-rows
option no longer reports length mismatch errors that were fixed.
- feat: Add support for Zstandard files with the
.zst
extension, if thezstandard
package is installed. - feat: :doc:`/scripts/csvformat` adds a
--out-asv
(--A
) option to use the ASCII unit separator and record separator. - feat: :doc:`/scripts/csvsort` adds a
--ignore-case
(--i
) option to perform case-independent sorting.
- feat: :doc:`/scripts/csvpy` adds the options:
--no-number-ellipsis
, to disable the ellipsis (…
) if max precision is exceeded, for example, when usingtable.print_table()
--sniff-limit`
--no-inference`
- feat: :doc:`/scripts/csvpy` removes the
--linenumbers
and--zero
output options, which had no effect. - feat: :doc:`/scripts/in2csv` adds a
--reset-dimensions
option to recalculate the dimensions of an XLSX file, instead of trusting the file's metadata. csvkit's dependency agate-excel 0.4.0 automatically recalculates the dimensions if the file's metadata expresses dimensions of "A1:A1" (a single cell). - fix: :doc:`/scripts/csvlook` only reads up to
--max-rows
rows instead of the entire file. - fix: :doc:`/scripts/csvpy` supports the existing input options:
--locale
--blanks
--null-value
--date-format
--datetime-format
--skip-lines
- fix: :doc:`/scripts/csvpy`:
--maxfieldsize
no longer errors when--dict
is set. - fix: :doc:`/scripts/csvstack`:
--maxfieldsize
no longer errors when--no-header-row
isn't set. - fix: :doc:`/scripts/in2csv`:
--write-sheets
no longer errors when standard input is an XLS or XLSX file. - Update minimum agate version to 1.6.3.
- :doc:`/scripts/csvformat` adds a
--skip-header
(-E
) option to not output a header row. - :doc:`/scripts/csvlook` adds a
--max-precision
option to set the maximum number of decimal places to display. - :doc:`/scripts/csvlook` adds a
--no-number-ellipsis
option to disable the ellipsis (…
) if--max-precision
is exceeded. (Requires agate 1.9.0 or greater.) - :doc:`/scripts/csvstat` supports the
--no-inference
(-I
),--locale
(-L
),--blanks
,--date-format
anddatetime-format
options. - :doc:`/scripts/csvstat` reports a "Non-null values" statistic (or a
nonnulls
column when--csv
is set). - :doc:`/scripts/csvstat` adds a
--non-nulls
option to only output counts of non-null values. - :doc:`/scripts/csvstat` reports a "Most decimal places" statistic (or a
maxprecision
column when--csv
is set). - :doc:`/scripts/csvstat` adds a
--max-precision
option to only output the most decimal places. - :doc:`/scripts/csvstat` adds a
--json
option to output results as JSON text. - :doc:`/scripts/csvstat` adds an
--indent
option to indent the JSON text when--json
is set. - :doc:`/scripts/in2csv` adds a
--use-sheet-names
option to use the sheet names as file names when--write-sheets
is set. - feat: Add a
--null-value
option to commands with the--blanks
option, to convert additional values to NULL. - fix: Reconfigure the encoding of standard input according to the
--encoding
option, which defaults toutf-8-sig
. Affected users no longer need to set thePYTHONIOENCODING
environment variable. - fix: Prompt the user if additional input is expected (i.e. if no input file or piped data is provided) in :doc:`/scripts/csvjoin`, :doc:`/scripts/csvsql` and :doc:`/scripts/csvstack`.
- fix: No longer errors if a NUL byte occurs in an input file.
- Add Python 3.12 support.
- fix: :doc:`/scripts/csvjoin` uses the correct columns when performing a
--right
join. - Add SQLAlchemy 2 support.
- Drop Python 3.7 support (end-of-life was June 5, 2023).
- feat: :doc:`/scripts/csvstack` handles files with columns in different orders or with different names.
- feat: :doc:`/scripts/csvsql` accepts multiple
--query
command-line arguments. - feat: :doc:`/scripts/csvstat` adds
--no-grouping-separator
and--decimal-format
options. - Add Python 3.11 support.
- Drop Python 3.6 support (end-of-life was December 23, 2021).
- Drop Python 2.7 support (end-of-life was January 1, 2020).
- fix: :doc:`/scripts/csvcut` extracts the correct columns when
--line-numbers
is set. - fix: Restore Python 2.7 support in edge cases.
- feat: Use 1024 byte sniff-limit by default across csvkit. Improve csvstat performance up to 10x.
- feat: Add support for
.xz
(LZMA) compressed input files. - Add Python 3.10 support.
- Drop Python 3.5 support (end-of-life was September 30, 2020).
Changes:
- :doc:`/scripts/csvstat` no longer prints "Row count: " when
--count
is set. - :doc:`/scripts/csvclean`, :doc:`/scripts/csvcut`, :doc:`/scripts/csvgrep` no longer error if standard input is null.
Fixes:
- :doc:`/scripts/csvformat` creates default headers when
--no-header-row
is set, as documented. - :doc:`/scripts/csvstack` no longer errors when
--no-header-row
is combined with--groups
or--filenames
.
Changes:
- Drop Python 3.4 support (end-of-life was March 18, 2019).
Improvements:
- Output error message for memory error even if not
--verbose
.
Fixes:
- Fix regression in 1.0.4, which caused numbers like
4.5
to be parsed as dates. - :doc:`/scripts/in2csv` Fix error reporting if
--names
used with non-Excel file.
Changes:
- Drop Python 3.3 support (end-of-life was September 29, 2017).
Improvements:
- :doc:`/scripts/csvsql` adds a
--chunk-size
option to set the chunk size when batch inserting into a table. - csvkit is tested against Python 3.7.
Fixes:
--names
works with--skip-lines
.- Dates and datetimes without punctuation can be parsed with
--date-format
anddatetime-format
. - Error messages about column indices use 1-based numbering unless
--zero
is set. - :doc:`/scripts/csvcut` no longer errors on
--delete-empty-rows
with short rows. - :doc:`/scripts/csvjoin` no longer errors if given a single file.
- :doc:`/scripts/csvsql` supports UPDATE commands.
- :doc:`/scripts/csvstat` no longer errors on non-finite numbers.
- :doc:`/scripts/csvstat` respects all command-line arguments when
--count
is set. - :doc:`/scripts/in2csv` CSV-to-CSV conversion respects
--linenumbers
when buffering. - :doc:`/scripts/in2csv` writes XLS sheets without encoding errors in Python 2.
Improvements:
- :doc:`/scripts/csvgrep` adds a
--any-match
(-a
) flag to select rows where any column matches instead of all columns. - :doc:`/scripts/csvjson` no longer emits a property if its value is null.
- :doc:`/scripts/csvjson` adds
--type
and--geometry
options to emit non-Point GeoJSON features. - :doc:`/scripts/csvjson` adds a
--no-bbox
option to disable the calculation of a bounding box. - :doc:`/scripts/csvjson` supports
--stream
for newline-delimited GeoJSON. - :doc:`/scripts/csvsql` adds a
--unique-constraint
option to list names of columns to include in a UNIQUE constraint. - :doc:`/scripts/csvsql` adds
--before-insert
and--after-insert
options to run commands before and after the INSERT command. - :doc:`/scripts/csvpy` reports an error message if input is provided via STDIN.
- :doc:`/scripts/in2csv` adds a
--encoding-xls
option to specify the encoding of the input XLS file. - :doc:`/scripts/in2csv` supports
--no-header-row
on XLS and XLSX files. - Suppress agate warning about column names not specified when using
--no-header-row
. - Prompt the user if additional input is expected (i.e. if no input file or piped data is provided).
- Update to agate-excel 0.2.2, agate-sql 0.5.3.
Fixes:
- :doc:`/scripts/csvgrep` accepts utf-8 arguments to the
--match
and--regex
options in Python 2. - :doc:`/scripts/csvjson` streams input and output only if
--snifflimit
is0
. - :doc:`/scripts/csvsql` sets a DECIMAL's precision and scale and a VARCHAR's length to avoid dialect-specific errors.
- :doc:`/scripts/csvstack` no longer opens all files at once.
- :doc:`/scripts/in2csv` respects
--no-header-row
when--no-inference
is set. - :doc:`/scripts/in2csv` CSV-to-CSV conversion streams input and output only if
--snifflimit
is0
. - :doc:`/scripts/in2csv` supports GeoJSON files with:
geometry
set tonull
, missing Pointcoordinates
, altitude coordinate values.
csvkit is no longer tested on PyPy.
Improvements:
- Add a
--version
flag. - Add a
--skip-lines
option to skip initial lines (e.g. comments, copyright notices, empty rows). - Add a
--locale
option to set the locale of any formatted numbers. - Add a
--date-format
option to set a strptime date format string. - Add a
--datetime-format
option to set a strptime datetime format string. - Make
--blanks
a common argument across all tools. -I
is the short option for--no-inference
.- :doc:`/scripts/csvclean`, :doc:`/scripts/csvformat`, :doc:`/scripts/csvjson`, :doc:`/scripts/csvpy` support
--no-header-row
. - :doc:`/scripts/csvclean` is faster and no longer requires exponential time in the worst case.
- :doc:`/scripts/csvformat` supports
--linenumbers
and --zero (no-op). - :doc:`/scripts/csvjoin` supports
--snifflimit
and--no-inference
. - :doc:`/scripts/csvpy` supports
--linenumbers
(no-op) and--zero
(no-op). - :doc:`/scripts/csvsql` adds a
--prefix
option to add expressions like OR IGNORE or OR REPLACE following the INSERT keyword. - :doc:`/scripts/csvsql` adds a
--overwrite
flag to drop any existing table with the same name before creating. - :doc:`/scripts/csvsql` accepts a file name for the
--query
option. - :doc:`/scripts/csvsql` supports
--linenumbers
(no-op). - :doc:`/scripts/csvsql` adds a
--create-if-not-exists
flag to not abort if the table already exists. - :doc:`/scripts/csvstat` adds a
--freq-count
option to set the maximum number of frequent values to display. - :doc:`/scripts/csvstat` supports
--linenumbers
(no-op). - :doc:`/scripts/in2csv` adds a
--names
flag to print Excel sheet names. - :doc:`/scripts/in2csv` adds a
--write-sheets
option to write the named Excel sheets to files. - :doc:`/scripts/sql2csv` adds an
--encoding
option to specify the encoding of the input query file.
Fixes:
- :doc:`/scripts/csvgrep` no longer ignores common arguments if
--linenumbers
is set. - :doc:`/scripts/csvjson` supports Decimal.
- :doc:`/scripts/csvpy` again supports IPython.
- :doc:`/scripts/csvsql` restores support for
--no-constraints
and--db-schema
. - :doc:`/scripts/csvstat` no longer crashes when
--freq
is set. - :doc:`/scripts/in2csv` restores support for
--no-inference
for Excel files. - :doc:`/scripts/in2csv` restores support for converting Excel files from standard input.
- :doc:`/scripts/in2csv` accepts utf-8 arguments to the
--sheet
option in Python 2.
This is a minor release which fixes several bugs reported in the 1.0.0
release earlier this week. It also significantly improves the output of :doc:`/scripts/csvstat` and adds a --csv
output option to that command.
- :doc:`/scripts/csvstat` no longer crashes when a
Number
column hasNone
as a frequent value. (#738) - :doc:`/scripts/csvlook` documents that output tables are Markdown-compatible. (#734)
- :doc:`/scripts/csvstat` adds a
--csv
flag for tabular output. (#584) - :doc:`/scripts/csvstat` output is easier to read. (#714)
- :doc:`/scripts/csvpy` has a better description when using the
--agate
flag. (#729) - Fix a Python 2.6 bug preventing :doc:`/scripts/csvjson` from parsing utf-8 files. (#732)
- Update required version of unittest to latest. (#727)
This is the first major release of csvkit in a very long time. The entire backend has been rewritten to leverage the agate data analysis library, which was itself inspired by csvkit. The new backend provides better type detection accuracy, as well as some new features.
Because of the long and complex cycle behind this release, the list of changes should not be considered exhaustive. In particular, the output format of some tools may have changed in small ways. Any existing data pipelines using csvkit should be tested as part of the upgrade.
Much of the credit for this release goes to James McKinney, who has almost single-handedly kept the csvkit fire burning for a year. Thanks, James!
Backwards-incompatible changes:
- :doc:`/scripts/csvjoin` renames duplicate columns with integer suffixes to prevent collisions in output.
- :doc:`/scripts/csvsql` generates
DateTime
columns instead ofTime
columns. - :doc:`/scripts/csvsql` generates
Decimal
columns instead ofInteger
,BigInteger
, andFloat
columns. - :doc:`/scripts/csvsql` no longer generates max-length constraints for text columns.
- The
--doublequote
long flag is gone, and the-b
short flag is an alias for--no-doublequote
. - When using the
--columns
or--not-columns
options, you must not have spaces around the comma-separated values, unless the column names contain spaces. - When sorting, null values are greater than other values instead of less than.
CSVKitReader
,CSVKitWriter
,CSVKitDictReader
, andCSVKitDictWriter
have been removed. Useagate.csv.reader
,agate.csv.writer
,agate.csv.DictReader
andagate.csv.DictWriter
.- Drop Python 2.6 support (end-of-life was October 29, 2013).
- Drop support for older versions of PyPy.
- If
--no-header-row
is set, the output has column namesa
,b
,c
, etc. instead ofcolumn1
,column2
,column3
, etc. - csvlook renders a simpler, markdown-compatible table.
Improvements:
- csvkit is tested against Python 3.6. (#702)
import csvkit as csv
defers to agate readers/writers.- :doc:`/scripts/csvgrep` supports
--no-header-row
. - :doc:`/scripts/csvjoin` supports
--no-header-row
. - :doc:`/scripts/csvjson` streams input and output if the
--stream
and--no-inference
flags are set. - :doc:`/scripts/csvjson` supports
--snifflimit
and--no-inference
. - :doc:`/scripts/csvlook` adds
--max-rows
,--max-columns
and--max-column-width
options. - :doc:`/scripts/csvlook` supports
--snifflimit
and--no-inference
. - :doc:`/scripts/csvpy` supports
--agate
to read a CSV file into an agate table. csvsql
supports custom SQLAlchemy dialects.- :doc:`/scripts/csvstat` supports
--names
. - :doc:`/scripts/in2csv` CSV-to-CSV conversion streams input and output if the
--no-inference
flag is set. - :doc:`/scripts/in2csv` CSV-to-CSV conversion uses
agate.Table
. - :doc:`/scripts/in2csv` GeoJSON conversion adds columns for geometry type, longitude and latitude.
- Documentation: Update tool usage, remove shell prompts, document connection string, correct typos.
Fixes:
- Fixed numerous instances of open files not being closed before utilities exit.
- Change
-b
,--doublequote
to--no-doublequote
, as doublequote is True by default. - :doc:`/scripts/in2csv` DBF conversion works with Python 3.
- :doc:`/scripts/in2csv` correctly guesses format when file has an uppercase extension.
- :doc:`/scripts/in2csv` correctly interprets
--no-inference
. - :doc:`/scripts/in2csv` again supports nested JSON objects (fixes regression).
- :doc:`/scripts/in2csv` with
--format geojson
prints a JSON object instead ofOrderedDict([(...)])
. - :doc:`/scripts/csvclean` with standard input works on Windows.
- :doc:`/scripts/csvgrep` returns the input file's line numbers if the
--linenumbers
flag is set. - :doc:`/scripts/csvgrep` can match multiline values.
- :doc:`/scripts/csvgrep` correctly operates on ragged rows.
- :doc:`/scripts/csvsql` correctly escapes
%`
characters in SQL queries. - :doc:`/scripts/csvsql` adds standard input only if explicitly requested.
- :doc:`/scripts/csvstack` supports stacking a single file.
- :doc:`/scripts/csvstat` always reports frequencies.
- The
any_match
argument ofFilteringCSVReader
works correctly. - All tools handle empty files without error.
- Add Antonio Lima to AUTHORS.
- Add support for ndjson. (#329)
- Add missing docs for csvcut -C. (#227)
- Reorganize docs so TOC works better. (#339)
- Render docs locally with RTD theme.
- Fix header in "tricks" docs.
- Add install instructions to tutorial. (#331)
- Add killer examples to doc index. (#328)
- Reorganize doc index
- Fix broken csvkit module documentation. (#327)
- Fix version of openpyxl to work around encoding issue. (#391, #288)
- Write missing sections of the tutorial. (#32)
- Remove -q arg from sql2csv (conflicts with common flag).
- Fix csvjoin in case where left dataset rows without all columns.
- Rewrote tutorial based on LESO data. (#324)
- Don't error in csvjson if lat/lon columns are null. (#326)
- Maintain field order in output of csvjson.
- Add unit test for json in2csv. (#77)
- Maintain key order when converting JSON into CSV. (#325.)
- Upgrade python-dateutil to version 2.2 (#304)
- Fix sorting of columns with null values. (#302)
- Added release documentation.
- Fill out short rows with null values. (#313)
- Fix unicode output for csvlook and csvstat. (#315)
- Add documentation for --zero. (#323)
- Fix Integrity error when inserting zero rows in database with csvsql. (#299)
- Add Michael Mior to AUTHORS. (#305)
- Add --count option to CSVStat.
- Implement csvformat.
- Fix bug causing CSVKitDictWriter to output 'utf-8' for blank fields.
- Add pnaimoli to AUTHORS.
- Fix column specification in csvstat. (#236)
- Added "Tips and Tricks" documentation. (#297, #298)
- Add Espartaco Palma to AUTHORS.
- Remove unnecessary enumerate calls. (#292)
- Deprecated DBF support for Python 3+.
- Add support for Python 3.3 and 3.4 (#239)
- Fix date handling with openpyxl > 2.0 (#285)
- Add Kristina Durivage to AUTHORS. (#243)
- Added Richard Low to AUTHORS.
- Support SQL queries "directly" on CSV files. (#276)
- Add Tasneem Raja to AUTHORS.
- Fix off-by-one error in open ended column ranges. (#238)
- Add Matt Pettis to AUTHORS.
- Add line numbers flag to csvlook (#244)
- Only install argparse for Python < 2.7. (#224)
- Add Diego Rabatone Oliveira to AUTHORS.
- Add Ryan Murphy to AUTHORS.
- Fix DBF dependency. (#270)
- Fix CHANGELOG for release.
- Fix homepage url in setup.py.
- Fix XLSX datetime normalization bug. (#223)
- Add raistlin7447 to AUTHORS.
- Merged sql2csv utility (#259).
- Add Jeroen Janssens to AUTHORS.
- Validate csvsql DB connections before parsing CSVs. (#257)
- Clarify install process for Ubuntu. (#249)
- Clarify docs for --escapechar. (#242)
- Make
import csvkit
API compatible withimport csv
. - Update Travis CI link. (#258)
- Add Sébastien Fievet to AUTHORS.
- Use case-sensitive name for SQLAlchemy (#237)
- Add Travis Swicegood to AUTHORS.
- Fix CHANGELOG for release.
- Add Chris Rosenthal to AUTHORS.
- Fix multi-file input to csvsql. (#193)
- Passing --snifflimit=0 to disable dialect sniffing. (#190)
- Add aarcro to the AUTHORS file.
- Improve performance of csvgrep. (#204)
- Add Matt Dudys to AUTHORS.
- Add support for --skipinitialspace. (#201)
- Add Joakim Lundborg to AUTHORS.
- Add --no-inference option to in2csv and csvsql. (#206)
- Add Federico Scrinzi to AUTHORS file.
- Add --no-header-row to all tools. (#189)
- Fix csvstack blowing up on empty files. (#209)
- Add Chris Rosenthal to AUTHORS file.
- Add --db-schema option to csvsql. (#216)
- Add Shane StClair to AUTHORS file.
- Add --no-inference support to csvsort. (#222)
- Implement geojson support in csvjson. (#159)
- Optimize writing of eight bit codecs. (#175)
- Created csvpy. (#44)
- Support --not-columns for excluding columns. (#137)
- Add Jan Schulz to AUTHORS file.
- Add Windows scripts. (#111, #176)
- csvjoin, csvsql and csvstack no longer hold open all files. (#178)
- Added Noah Hoffman to AUTHORS.
- Make csvlook output compatible with emacs table markup. (#174)
- Add Derek Wilson to AUTHORS.
- Add Kevin Schaul to AUTHORS.
- Add DBF support to in2csv. (#11, #160)
- Support --zero option for zero-based column indexing. (#144)
- Support mixing nulls and blanks in string columns.
- Add --blanks option to csvsql. (#149)
- Add multi-file (glob) support to csvsql. (#146)
- Add Gregory Temchenko to AUTHORS.
- Add --no-create option to csvsql. (#148)
- Add Anton Ian Sipos to AUTHORS.
- Fix broken pipe errors. (#150)
- Begin CHANGELOG (a bit late, I'll admit).