The 12 best Data engineering books of all time

A data-backed answer



There are countless lists on the internet claiming to be the list of must-read Data engineering books and it seemed that all those lists always recommended that same books minus two or three odd choices.

Finding good resources for learning programming is always tricky. Every-one has its own opinion about what book is the best to learn, and as we say in french, “Color and tastes should not be argued about”.

However I though it would be interesting to trust the wisdom of the crown and to find the books that appeared the most in those “Best Data engineering Book” lists.

If you want to jump right on the results go take a look below at the full results. If you want to learn about the methodology, bear with me.

I’ve simply asked Google for a few queries like “Best Data engineering Books” and its variations of. I have then scrapped all those pages (using ScrapingBee, a web scraping API I’m working on).

I’ve deduplicated the links and ended up with nearly 114 links. Using the title of the pages I was also able to quickly discards:

I ended up with almost 104 HTML files. I went on opening all the files on my browser, open my chrome inspector, found and wrote the CSS selector matching book titles in the article. This took me around 1hours, almost 30 seconds per page.

This also allowed me to discard even more nonrelevant pages, and I discarded a lot. In the end I compiled around 61 lists into this one.

Book titles were then extracted with manuel extraction and some web scraping.

I ended up with a huge list of books, not usable without some post-processing.

To find the most quoted Data engineering books I needed to normalize my results.

I had to play with all the different variation like “{title} by {author}” or “{title} - {author}”.

Or “{title}:{subtitle}” and “{title}”, or even all the one containing edition number.

And afterquite a bit of manual cleaning.

My list now looked like this:

From there it was easy to compute the most recommended books. You can find all the data used to process this list on this repo. Now let’s take a look at the list:

I've also recently used some data from different book sellers in order to not forget important books and try to give more weight to books with incredible reviews.



Database Reliability Engineering: Designing and Operating Resilient Database Systems

Laine Campbell & Charity Majors
% recommend
🛒   Buy
The infrastructure-as-code revolution in IT is also affecting database administration. With this practical book, developers, system administrators, and junior to mid-level DBAs will learn how the modern practice of site reliability engineering applies to the craft of database architecture and operations.

Authors Laine Campbell and Charity Majors provide a framework for professionals looking to join the ranks of today’s database reliability engineers (DBRE). You’ll begin by exploring core operational concepts that DBREs need to master.

Then you’ll examine a wide range of database persistence options, including how to implement key technologies to provide resilient, scalable, and performant data storage and retrieval. With a firm foundation in database reliability engineering, you’ll be ready to dive into the architecture and operations of any modern database

Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists

Alice Zheng & Amanda Casari
% recommend
🛒   Buy
Feature engineering is a crucial step in the machine-learning pipeline, yet this topic is rarely examined on its own. With this practical book, you’ll learn techniques for extracting and transforming features—the numeric representations of raw data—into formats for machine-learning models.

Each chapter guides you through a single data problem, such as how to represent text or image data. Together, these examples illustrate the main principles of feature engineering.

Rather than simply teach these principles, authors Alice Zheng and Amanda Casari focus on practical application with exercises throughout the book. The closing chapter brings everything together by tackling a real-world, structured dataset with several feature-engineering techniques.

Python packages including numpy, Pandas, Scikit-learn, and Matplotlib are used in code examples. You’ll examine: Feature engineering for numeric data: filtering, binning, scaling, log transforms, and power transforms Natural text techniques: bag-of-words, n-grams, and phrase detection Frequency-based filtering and feature scaling for eliminating uninformative features Encoding techniques of categorical variables, including feature hashing and bin-counting Model-based feature engineering with principal component analysis The concept of model stacking, using k-means as a featurization technique Image feature extraction with manual and deep-learning techniques

Build a Career in Data Science

Emily Robinson & Jacqueline Nolis
% recommend
🛒   Buy
Summary You are going to need more than technical knowledge to succeed as a data scientist. Build a Career in Data Science teaches you what school leaves out, from how to land your first job to the lifecycle of a data science project, and even how to become a manager.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. Table of Contents: PART 1 - GETTING STARTED WITH DATA SCIENCE 1.

What is data science? 2. Data science companies 3.

Getting the skills 4. Building a portfolio PART 2 - FINDING YOUR DATA SCIENCE JOB 5.

The search: Identifying the right job for you 6. The application: Résumés and cover letters 7.

The interview: What to expect and how to handle it 8. The offer: Knowing what to accept PART 3 - SETTLING INTO DATA SCIENCE 9.

The first months on the job 10. Making an effective analysis 11.

Deploying a model into production 12. Working with stakeholders PART 4 - GROWING IN YOUR DATA SCIENCE ROLE 13.

When your data science project fails 14. Joining the data science community 15.

Leaving your job gracefully 16. Moving up the ladder

Machine Learning Engineering

Andriy Burkov
% recommend
🛒   Buy
From the author of a world bestseller published in eleven languages, The Hundred-Page Machine Learning Book, this new book by Andriy Burkov is the most complete applied AI book out there. It is filled with best practices and design patterns of building reliable machine learning solutions that scale.

Andriy Burkov has a Ph.D. in AI and is the leader of a machine learning team at Gartner.

This book is based on Andriy's own 15 years of experience in solving problems with AI as well as on the published experience of the industry leaders. Here's what Cassie Kozyrkov, Chief Decision Scientist at Google tells about the book in the Foreword: "You're looking at one of the few true Applied Machine Learning books out there.

That's right, you found one! A real applied needle in the haystack of research-oriented stuff. Excellent job, dear reader...

unless what you were actually looking for is a book to help you learn the skills to design general-purpose algorithms, in which case I hope the author won't be too upset with me for telling you to flee now and go pick up pretty much any other machine learning book. This one is different." [...] "So, what's in [...] the book? The machine learning equivalent of a bumper guide to innovating in recipes to make food at scale.

Since you haven't read the book yet, I'll put it in culinary terms: you'll need to figure out what's worth cooking / what the objectives are ( decision-making and product management), understand the suppliers and the customers ( domain expertise and business acumen), how to process ingredients at scale ( data engineering and analysis), how to try many different ingredient-appliance combinations quickly to generate potential recipes ( prototype phase ML engineering), how to check that the quality of the recipe is good enough to serve ( statistics), how to turn a potential recipe into millions of dishes served efficiently ( production phase ML engineering), and how to ensure that your dishes stay top-notch even if the delivery truck brings you a ton of potatoes instead of the rice you ordered ( reliability engineering). This book is one of the few to offer perspectives on each step of the end-to-end process." [...] "One of my favorite things about this book is how fully it embraces the most important thing you need to know about machine learning: mistakes are possible...

and sometimes they hurt. As my colleagues in site reliability engineering love to say,"Hope is not a strategy." Hoping that there will be no mistakes is the worst approach you can take.

This book does so much better. It promptly shatters any false sense of security you were tempted to have about building an AI system that is more "intelligent" than you are.

(Um, no. Just no.) Then it diligently takes you through a survey of all kinds of things that can go wrong in practice and how to prevent/detect/handle them.

This book does a great job of outlining the importance of monitoring, how to approach model maintenance, what to do when things go wrong, how to think about fallback strategies for the kinds of mistakes you can't anticipate, how to deal with adversaries who try to exploit your system, and how to manage the expectations of your human users (there's also a section on what to do when your, er, users are machines). These are hugely important topics in practical machine learning, but they're so often neglected in other books.

Not here." "If you intend to use machine learning to solve business problems at scale, I'm delighted you got your hands on this book. Enjoy!"

Data Science For Dummies (For Dummies (Computers))

Lillian Pierson & Jake Porway
% recommend
🛒   Buy
Discover how data science can help you gain in-depth insight into your business - the easy way! Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles. Data Science For Dummies is the perfect starting point for IT professionals and students who want a quick primer on all areas of the expansive data science space.

With a focus on business cases, the book explores topics in big data, data science, and data engineering, and how these three areas are combined to produce tremendous value. If you want to pick-up the skills you need to begin a new career or initiate a new project, reading this book will help you understand what technologies, programming languages, and mathematical methods on which to focus.

While this book serves as a wildly fantastic guide through the broad, sometimes intimidating field of big data and data science, it is not an instruction manual for hands-on implementation. Here’s what to expect: Provides a background in big data and data engineering before moving on to data science and how it's applied to generate value Includes coverage of big data frameworks like Hadoop, MapReduce, Spark, MPP platforms, and NoSQL Explains machine learning and many of its algorithms as well as artificial intelligence and the evolution of the Internet of Things Details data visualization techniques that can be used to showcase, summarize, and communicate the data insights you generate It's a big, big data world out there―let Data Science For Dummies help you harness its power and gain a competitive edge for your organization.

Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems)

Jiawei Han & Micheline Kamber & Jian Pei
% recommend
🛒   Buy

Python for Finance: Mastering Data-Driven Finance

Yves Hilpisch
% recommend
🛒   Buy
The financial industry has recently adopted Python at a tremendous rate, with some of the largest investment banks and hedge funds using it to build core trading and risk management systems. Updated for Python 3, the second edition of this hands on book helps you get started with the language, guiding developers and quantitative analysts through Python libraries and tools for building financial applications and interactive financial analytics.

Using practical examples throughout the book, author Yves Hilpisch also shows you how to develop a full fledged framework for Monte Carlo simulation based derivatives and risk analytics, based on a large, realistic case study. Much of the book uses interactive IPython Notebooks.

Database Internals: A Deep Dive into How Distributed Data Systems Work

Alex Petrov
% recommend
🛒   Buy
When it comes to choosing, using, and maintaining a database, understanding its internals is essential. But with so many distributed databases and tools available today, it’s often difficult to understand what each one offers and how they differ.

With this practical guide, Alex Petrov guides developers through the concepts behind modern database and storage engine internals. Throughout the book, you’ll explore relevant material gleaned from numerous books, papers, blog posts, and the source code of several open source databases.

These resources are listed at the end of parts one and two. You’ll discover that the most significant distinctions among many modern databases reside in subsystems that determine how storage is organized and how data is distributed

Python Data Science Handbook: Essential Tools for Working with Data

Jake VanderPlas
% recommend
🛒   Buy
For many researchers, Python is a first-class tool mainly because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the Python Data Science Handbook do you get them all—IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and other related tools.

Working scientists and data crunchers familiar with reading and writing Python code will find this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL

Walter Shields
% recommend
🛒   Buy
"THE BEST SQL BOOK FOR BEGINNERS IN 2021 - HANDS DOWN!" *INCLUDES FREE ACCESS TO A SAMPLE DATABASE, SQL BROWSER APP, COMPREHENSION QUIZZES & SEVERAL OTHER DIGITAL RESOURCES!* *| #1 NEW RELEASE & #1 BEST SELLER |* Not sure how to prepare for the data-driven future? This book shows you EXACTLY what you need to know to successfully use the SQL programming language to enhance your career! Are you a developer who wants to expand your mastery to database management? Then you NEED this book. Buy now and start reading today! Are you a project manager who needs to better understand your development team’s needs? A decision maker who needs to make deeper data-driven analysis? Everything you need to know is included in these pages! The ubiquity of big data means that now more than ever there is a burning need to warehouse, access, and understand the contents of massive databases quickly and efficiently.

That’s where SQL comes in. SQL is the workhorse programming language that forms the backbone of modern data management and interpretation.

Any database management professional will tell you that despite trendy data management languages that come and go, SQL remains the most widely used and most reliable to date, with no signs of stopping. In this comprehensive guide, experienced mentor and SQL expert Walter Shields draws on his considerable knowledge to make the topic of relational database management accessible, easy to understand, and highly actionable.

SQL QuickStart Guide is ideal for those seeking to increase their job prospects and enhance their careers, for developers looking to expand their programming capabilities, or for anyone who wants to take advantage of our inevitably data-driven future—even with no prior coding experience! SQL QuickStart Guide Is For: Professionals looking to augment their job skills in preparation for a data-driven future Job seekers who want to pad their skills and resume for a durable employability edge Beginners with zero prior experience Managers, decision makers, and business owners looking to manage data-driven business insights Developers looking to expand their mastery beyond the full stack Anyone who wants to be better prepared for our data-driven future! In SQL QuickStart Guide You'll Discover: The basic structure of databases—what they are, how they work, and how to successfully navigate them How to use SQL to retrieve and understand data no matter the scale of a database (aided by numerous images and examples) The most important SQL queries, along with how and when to use them for best effect Professional applications of SQL and how to “sell” your new SQL skills to your employer, along with other career-enhancing considerations * LIFETIME ACCESS TO SQL RESOURCES *Each book comes with free lifetime access to tons of exclusive online resources to help you master SQL, such as workbooks, cheat sheets and reference guides. * GIVING BACK : *ClydeBank Media proudly supports the non-profit AdoptAClassroom whose mission is to advance equity in K-12 education by supplementing dwindling school funding for vital classroom materials and resources.* Scroll Up To The Top Of The Page And Click The Orange "Buy Now" Icon On The Right Side, Right Now!

Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython

Wes McKinney
% recommend
🛒   Buy
Get complete instructions for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.6, the second edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively.

You’ll learn the latest versions of pandas, NumPy, IPython, and Jupyter in the process. Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python.

It’s ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Martin Kleppmann
% recommend
🛒   Buy
Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability.

In addition, we have an overwhelming variety of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords? In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data.

Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications


I hope that you liked this list. Please do not hesitate to check out the other ones I've published.

Keep me updated!

Receive weekly update about best programming books!
Just that, no spam, no bs.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.