Exploring California Public School District Data

California has a plethora of data available for its school district system, such as data available from the California Department of Education. I explored this data, particularly the data around wages and also API scores.

Here are the top ten school districts in California in terms of total wages paid out to school district employees. Los Angeles Unified pays $4.8Bn in total to its employees.

The box plot illustrates the distribution of wages, with the box representing the inner quartile range or 25th percentile to 75th percentile and the line in the middle is the median. The whiskers and the dots show the extremes of the distribution. For example, the max pay in Los Angeles Unified is close to $500k. But ignoring those extremes, the range seems reasonable, the bulk of ranges falling under $100k and the median around $50k.

Do these wages make a difference? I wanted to see the correlation between wages paid, normalized by student body (so total wages paid / total number of students), to the API score of the district. The following scatter plot with regression line shows that there is absolutely no correlation between wages and API scores.

There is one variable that I found that has a very strong correlation to API score, and that is education level of the parent. 1 means high school drop out and 5 means graduate level education. So would be parents out there, definitely start sharpening your pencils and get an advanced degree because your education is the best predictor of your child's education.

The code for this analysis can be found on our Github page. Also, if you liked this post you should follow us on Twitter by clicking on the bird below.

Book Review: Programming in Scala (atrtima)

Written by Martin Odersky, the creator of the language, Programming in Scala, 2nd Edition is a comprehensive guide to the language. It is a must read for every Scala programmer, because it goes beyond the technicalities of the language, and goes into the design philosophy and heart of the language. One of the big themes throughout the book is that functional programming and object oriented programming are not opposites of each other. In fact, both concepts can live in tandem to create a truly scalable language, which is where the name Scala comes from.

Scala can be written like Java because it runs on the JVM and can do all the things that Java can do. The authors of the book do a good job at convincing the reader to write Scala like Scala, and adopt a functional style of structuring algorithms. For example, a simple program to print arguments to stdout in imperative style would be:

var i = 0
while (i < args.length) {
    i += 1

But in a more functional style, the code would be much more concise and readable:

args.foreach(arg => println(arg))

The example code in the book is well thought out, demonstrating the power of the language through utilizing Scala's features such as functional data structures, traits, pattern matching and case classes. The book applies these techniques to problem spaces like constructing a representation of rational numbers and the n-queens problem.

With more than 800 pages of content, the book is very detailed, covering a wide range of topics such as testing, annotations and concurrent programming using the Actor model.

The most recent edition is the 2nd edition published in 2011, updated for Scala 2.8, but the contents of this book are still very relevant and should be on the desk of every Scala developer.

Hands on with TensorFlow

Google released a new open source machine learning library called TensorFlow. I was excited to try it out not only because it is released from Google, but also because it is being used in production at Google for various products like smart response and translation. I quickly went through the tutorials and documentation, and here are my initial thoughts.

TensorFlow framework

As with any good library, building a model with TensorFlow feels like building something out of Lego blocks. The building blocks are straightforward, and the properties and functions map well to how a scientist thinks about machine learning. In TensorFlow, you represent computations as graphs. The graph is composed of inputs which are tensors and operations which produce zero or more Tensors. The library provides facilities to build this graph from the bottom up, supporting many of the deep learning techniques used by machine learning researchers.

Going through the first example on the tutorial for MNIST digit recognition for example, the code to represent a Softmax Regression model looks like this:

y = tf.nn.softmax(tf.matmul(x,W) + b)

This model will be able to predict a digit between 0 and 9, given a handwritten image of that digit. You can refer to the tutorial itself for the meanings of the variables. Note that at this point the equation is an abstraction, there is no real data populated in there yet. It's just a representation of the model.

After writing the model, in order to train it, you write a cost function and optimize the model by minimizing the loss with respect to the actual labeled truth. Lastly you run a session to feed the data through the graph you created. All together the code to build and test this model is less than 30 lines of python.

I would say that from going through this tutorial the library is very well thought out, with the right tools in place like graph visualization. I definitely need to explore further the capabilities of TensorBoard, the graph visualization library that comes with TensorFlow.

Why Python?

I am a bit confused that Python is the language of choice for the initial interface of TensorFlow. By going with Python it seems to me that the library needs to account for type safety in a way that seems awkward to the language, like this: x = tf.placeholder("float", [None, 784]). Go seems to be a much better fit, given their emphasis on performance and type safety. Plus Go was created at Google, what better synergy is there.

What this means for researchers

With it's well thought out structure and ease of use, this looks like a library that researchers will be interested in for sure, however this is by no means a silver bullet to all of your machine learning headaches. For example, I mentioned earlier that the Softmax Regression code is less than 30 lines, but there is a huge caveat. Google kindly prepared a seperate class which is much bigger (approx. 200 LOC) that will pull down data from Yann LeCun's homepage and prepare the data for training. Using lines of code as a proxy, this means that about 80% of the work is still in data preparation and 20% is actual modelling. Also, one of the selling points of the library is the ability to leverage GPUs and train large models in production environments. Unless one of the big cloud providers like AWS start offering a template for this, it is still hard for individual researchers to get access to the infrastructure required to pull off these massive models.

What this means for Google

This library is a great move for Google in that:

  1. It positions Google as an innovator in the increasingly competitive field of deep learning
  2. As more researchers utilize the library in their research, Google will become a popular destination for the leading minds in this field
  3. Open sourcing it will lead to more contributions from the public

If you liked this post, follow us on Twitter for updates on future blog posts. Also, check out our new eBooklet, Data Science Industry in Plain English

Other articles and resources on TensorFlow


Backyard Fox is live.

In this blog we will:

  • Discuss topics related to data science and education
  • Bring data science practices to everyday topics for new insights
  • Launch products such as eBooks, apps, etc.

For our first post I'd like to introduce our logo. The mixture of green and brown represents the grass and trees in the backyard. A tail of a playful yet clever fox peeks from behind the tree. The imagery is playful and warm, reminding us of a calm and warm Spring day.

Thanks for reading this blog and welcome to Backyard Fox.