Data Science Industry in Plain English

Screen Shot 2015-11-17 at 10.28.09 AM.png
Screen Shot 2015-11-17 at 10.28.09 AM.png

Data Science Industry in Plain English


This is a concise booklet on the data science industry. It explores large companies like Google as well as startups like DataBricks, while delving into their technical contributions and strategic decisions. It includes a few illustrations on technical frameworks like MapReduce and charts.

Add To Cart

From the Introduction

The recent evolution in data science has been led by some key individuals and organizations. However, with the overflow of information surrounding data science, it may be difficult to grasp the key trends in data science and which technologies and initiatives are interesting. One way to make sense of the trend is to look at these companies, in chronological order where it makes sense, and look at the relevant contributions the companies have made. Sometimes the contributions are purely technical, like when Google published the MapReduce paper. Other times the significance is strategic, like when Cloudera built an industry out of an open source technology. Coca Cola was able to make a multi-billion dollar industry from a secret recipe for Coke, but Cloudera made an industry out of a technology whose recipe is out there for anybody to see and use.

Who this booklet is for

If you’re an aspiring data scientist, this booklet will tell you about the companies and technologies you should know about, and highlight a few potential interview questions.
If you’re a hiring manager, you may find things to look for in resumes of data scientists or where to target your sourcing efforts.
If you’re an investor or an analyst, you may better understand the underlying technologies and the reasoning behind some of the strategic decisions of these companies.
If you’re a concerned parent of a child who just came home from college and said, “look ma! I got an offer as a data scientist from <insert name of a company that sounds like a joke>.com!” this book will give you some sense of what your kid is getting into.
If you're a practicing data scientist or software engineer or otherwise somebody familiar with the tech industry and these technologies, you’ll notice that this booklet has some simplifications in favor of brevity and clarity, but it may provide you with some entertainment nonetheless.

About the author

John Brooks is a data scientist who has worked for large companies such as IBM and small startups in Silicon Valley. His title has changed over the years from things like systems engineer to consultant to data scientist, but in short he’s been having a love-hate relationship with zero’s and one’s for the past 10 years. He lives in California with his family.

Table of Contents

  • Introduction
  • Google
  • Cloudera
  • AWS
  • GitHub
  • RStudio
  • DataBricks
  • RedisLabs
  • Kaggle
  • Summary

Charts and Illustrations

  • Illustration: MapReduce - distributing information
  • Illustration: MapReduce - distributed count
  • Chart: Market share of Hadoop installations
  • Table: AWS billing statement for Twitter streaming API analytics
  • Chart: Popular programming languages for data science as mentioned on Twitter
  • Table: Sampling of popular GitHub repositories for data science
  • Chart: DataBricks product stack
  • Screenshot: Kaggle leaderboard

Delivery Methods and Transaction Details

The file will be delivered in PDF format. We will email a download link that is active for 24 hours. So you can purchase on any platform and download later on your computer. All transactions are handled by Stripe through a secure HTTP connection.