-
-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Second edition: rationale, changes, outline, and feedback #101
Comments
@jeroenjanssens it's really a great thing, thank you very much. A note about your "scrape": the only real problem for me was that it did not work on python3, and for this reason I had built a cli based on it (and I do not must think to the environment). You are right, pup is faster and easier to install, but you cannot do XPATH query using it. I think that if you must use a cli tool to query HTML pages, it's necessary to use something that is able to run both CSS selector and XPATH queries, as your GREAT scrape. |
Thank you @aborruso! For the second edition I would like to only use tools which can be installed easily through some package manager. So to address your point, I guess we could do two things:
What do you think? |
Off the top of my head: I remember that one of the reviewers of the first edition of this book on goodreads.com wrote that he very much liked your introduction to gnu parallel. That supposedly was a highlight of the book. So maybe split chapter 8 into two chapters: one chapter about parallel processing on localhost, and one chapter about parallelization on cloud platforms. |
Dear @jeroenjanssens, both are very good points. But unluckily I'm above all a final user and not a Python or go developer. I have built the cli version of scrape, using another utility :) If there is scrape package it will become a tool which can be installed easily through some package manager. Once again thank you |
Hmmm, maybe something about model deployment? Not sure how it fits into the command-line but some buzzwords to think about in Deep Learning, Optimization, RL? |
Thanks for your book. Suggest you consider switching to printf instead of echo in the second edition though. Seems it is more stable. I spent a while trying to figure out why |
I'm here to advocate for more SQLite coverage. SQLite is a fantastic tool for command-line data science, because it gives you a full relational database without needing to run a PostgreSQL or MySQL server anywhere - each database exists as a single file on disk. My sqlite-utils tool ( While I'm here I'll plug Datasette too ( (Originally discussed on Twitter) |
here is also a list of command line related tools for further reading. |
Thanks for sharing the link. I appreciate your hard work. Hope to learn a lot |
I'm happy to announce that I'll be writing the second edition of Data Science at the Command Line (O'Reilly, 2014). This issue explains why I think a second edition is needed, lists what changes I plan to make, and presents a tentative outline. Finally, I have a few words about the process and giving feedback.
Why a second edition?
While the command line as a technology and as a way of working is timeless, some of the tools discussed in the first edition have either: (1) been superseded by newer tools (e.g.,
csvkit
has been replaced byxsv
), (2) been abandoned by their developers (e.g.,drake
), or (3) been suboptimal choices (e.g.,weka
). Since the first edition was published in October 2014 I have learned a lot, either through my own experience or through the useful feedback from its readers. Even though the book is quite niche because it lies at the intersection of two subjects, there remains a steady interest from the data science community. I notice this from the many positive messages I receive almost every day. By updating the first edition I hope to keep the book relevant for at least another five years.Changes with respect to the first edition
These are the general changes I currently have in mind. Please note that this is subject to change.
csvkit
withxsv
as much as possible.xsv
is a much faster alternative to working with CSV files.xmlstarlet
for working with XML.pup
instead ofscrape
to work with HTML.scrape
is a Python tool I created myself.pup
is much faster, has more features, and is easier to install.Rio
withlittler
.Rio
is a Bash script I created myself.littler
is a much more stable way of using R from the command line and is easier to install.Book outline
In the tentative outline below, 🆕 indicates added and ❌ indicates removed chapters and sections with respect to the first edition.
DrakeMake 🆕csvstatUsing xsv stat 🆕using Rio1011 ConclusionFeedback
In the past five years I have received a lot of valuable feedback in the form of emails, tweets, book reviews, errata submitted to O'Reilly, GitHub issues, and even pull requests. I love this. It has only made the book better.
O'Reilly has graciously given me permission to make the source of the second edition available on GitHub and an HTML version available on https://www.datascienceatthecommandline.com under a Creative Commons Attribution-NoDerivatives 4.0 International License from the start. That's fantastic because this way, I'll be able to receive feedback during the entire journey, which will make the book even better.
And feedback is, as always, very much appreciated. This can be anything ranging from a typo to a command-line tool or trick that might be of interest to others. If you have any ideas, suggestions, questions, criticism, or compliments, then I would love to hear from you. You may reply to this particular issue, create a new issue, tweet me at @jeroenhjanssens, or email me; use whichever medium you prefer.
Thank you.
Best wishes,
Jeroen
The text was updated successfully, but these errors were encountered: