English | 2014 | ISBN: 978-1-491-94785-2 | 212 Pages | PDF | 10 MB
This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
- Obtain data from websites, APIs, databases, and spreadsheets
- Perform scrub operations on plain text, CSV, HTML/XML, and JSON
- Explore data, compute descriptive statistics, and create visualizations
- Manage your data science workflow using Drake
- Create reusable tools from one-liners and existing Python or R code
- Parallelize and distribute data-intensive pipelines using GNU Parallel
- Model data with dimensionality reduction, clustering, regression, and classification algorithms
If your data requires additional functionality than what is offered by (a combination of) these command-line tools, you can use csvsql. This command-line tool allows you to perform SQL queries directly on CSV files. And remember, if after reading this chapter you still need more flexibility, you’re free to use R, Python, or whatever programming language you prefer.
The command-line tools will be introduced on a need-to-use basis. You’ll notice that sometimes we can use the same command-line tool to perform multiple operations, or vice versa, multiple command-line tools to perform the same operation. This chapter is more structured like a cookbook, where the focus is on the problems or recipes, rather than on the command-line tools.
In this chapter, you’ll learn how to:
- Convert data from one format to another
- Apply SQL queries to CSV
- Filter lines
- Extract and replace values
- plit, merge, and extract columns
Common Scrub Operations for Plain Text
In this section we describe common scrubbing operations for plain text. Formally, plain text refers to a sequence of human-readable characters and optionally, some specific types of control characters (e.g., a tab or a newline; for more information, see: www.linfo.org/plain_text.html). Examples include: ebooks, emails, logfiles, and source code.
For the purpose of this book, we assume that the plain text contains some data, and that it has no clear tabular structure (like the CSV format) or nested structure (like the JSON and HTML/XML formats). We discuss those formats later in this chapter. Although these operations can also be applied to CSV, JSON, and HTML/XML formats, keep in mind that the tools treat the data as plain text.
Working with HTML/XML and JSON
As we saw in Chapter 3, our obtained data can come in a variety of formats. The most common ones are plain text, CSV, JSON, and HTML/XML. In this section, we’re going to demonstrate a couple of command-line tools that can convert our data from one format to another. There are two reasons to convert data.
First, oftentimes the data needs to be in tabular form, just like a database table or a spreadsheet, because many visualization and machine-learning algorithms depend on it. CSV is inherently in tabular form, but JSON and HTML/XML data can have a deeply nested structure.
Second, many command-line tools, especially the classic ones such as cut and grep, operate on plain text. This is because text is regarded as a universal interface between command-line tools. Moreover, the other formats are simply younger. Each of these formats can be treated as plain text, allowing us to apply such command-line tools to
the other formats as well.