I’ve recently been getting pretty far into the weeds about what the future of data programming is going to look like. I use pandas and dplyr in python and R respectively. But I’m starting to see the shape of something that’s interesting coming down the pike. I’ve been working on a project that involves scatterplot visualizations at a massive scale–up to 1 billion points sent to the browser. In doing this, two things have become clear:
- Computers have gotten much, much faster in the last couple decades
- Our languages for data analysis have failed to keep up.
- Things will be weird, but also maybe good?
I tweeted about it once, after I had experimented with binary, serialized alternatives to JSON.
As webgpu and new binary serialization formats--like Arrow--come of age, it's going to be harder and harder to stomach geojson's slowness. More and more of R and python will become js or wasm wrappers. Just like in the 2000s they were wrappers around Java. It'll be very weird.— Benjamin Schmidt (@benmschmidt) December 23, 2020
I’m writing about Python and R because they’re completely dominant in the space of data programming. (By data programming, I mean basically ‘data science’; not being a scientist, I have trouble using it to describe what I do.) Some dinosaurs in economists still use Stata, and some wizards use Julia, but if you want to work with data that’s basically it. The big problem with the programming lessons we use to work with data they run largely on CPUs, and often predominantly on a single core. This has always been an issue in terms of speed; when I first switched to Python around 2011, I furiously searched ways around the GIL (global interpreter lock) that keeps the language from using multiple cores even on threads. Things have gotten a little better on some fronts–in general, it seems like at least linear algebra routines can make use of a computer’s full resources.
JS/HTML is the low-level language for UI and Python and R.
I’ve been relieved to be able to use Altair instead of matplotlib for
visualizing pandas dataframes; and I don’t think twice about dropping
ggplotly into lessons about ggplot for students who start wondering about
tooltips on mouseover.
matplotlib are still king of the roost
for publication-ready plots, but after becoming accustomed to interactive,
responsive charts on the web, we are coming to expect exploratory charts to
do the same thing; just as select menus and buttons from HTML fill this role
in notebook interface, JS charting libraries do the same for chart interface.
The GPU-laptop interface is an open question
Let me be clear–something I’ll say in this following section is certainly wrong. I’m not fully expert in what I’m about to say. I don’t know who is! There are some analogies to web cartography, where I’ve learned a lot from Vladimir Agafonkin. Many of the tools I’m thinking about I learned about in a set of communications with Doug Duhaime and David McClure. But the field is unstable enough that I think others may stumble in the same direction I have.
This whole period, GPUs have also been displacing CPUs for computation. The R/Python interfaces to these are tricky. Numba kind of works; I’ve fiddled with gnumpy from time to time; and I’ve never intentionally used a GPU in R, although it’s possible I did without knowing it. The path of least resistance to GPU computation in Python and R is often to use Tensorflow or Torch even for purposes that don’t really a neural network library–so I find myself, for example, training UMAP models using the neural network interface rather than the CPU one even though I’d prefer the other.
Most of these rely on CUDA to access GPUs. (When I said I don’t know what I’m talking about–this is the core of it.) If you want to do programming on these platforms, you increasingly boot up a cloud server and run heavy-duty models there. Cuda configuration is a pain, and the odds are decent your home machine doesn’t have a GPU anyway. If you want to run everything in the cloud, this is fine–Google just gives away TPUs for free. But doing a group-by/apply/summarize on a few million rows, this is overkill; and while cloud compute is pretty cheap compared to your home laptop, cloud storage is crazy expensive. Digital Ocean charges me like a hundred dollars a year just to keep up the database backing RateMyProfessor; for the work I do on several terabytes of data from the HathiTrust, I’d be lost without a university cluster and the 12TB hard drive on my desk at home.
But I want these operations to run faster.
iterations have made it in many cases relatively easy to program with.
for ... of ... all work much like you’d expect (unlike the days
when I spent a couple hours hunting out a rarely occuring bug in one data
visualization that turned out to occur when I was making visualizations of
wordcounts that included the word
constructor somewhere in the vocabulary);
and many syntactic features like classes, array destructuring, and arrow
are far more pleasant than their Python equivalents. (Full disclosure–even
after a decade in the language, I still find Python’s whitespace syntax
gimmicky and at heart just don’t like the language. But that’s a post for
In online cartography, protobuffer-based vector files do something similar in
deck.gl. The overhead of JSON-based formats
for working with cartographic data is hard to stomach once you’ve seen how
fast, and how much more compressed, binary data can be.
WebGL is hell on rollerskates
In working with WebGL, I’ve seen just how fast it can be. For things like array smoothing, counting of points to apply complicated numeric filters, and group-by sums, it’s possible to start applying most of the elements of the relational algebra on data frames in a fully parallelized form.
But I’ve held back from doing so in any but the most ad-hoc situations because WebGL is also terrible for data computing. I would never tell anyone to learn it, right now, unless they completely needed to. Attribute buffers can only be floats, so you need to convert all integer types before posting. In many situations data may be downsized to half precision points, and double-precision floating points are so difficult that there are entire rickety structures built to support them at great cost Support for texture types varies across devices (Apple ones seem to pose special problems), so people I’ve learned from like Ricky Reusser go to great lengths to support various fallbacks. And things that are essential for data programming, like indexed lookup of lists or for loops across a passed array, are nearly impossible. I’ve found writing complex shaders in WebGL fun, but doing so always involves abusing the intentions of the system.
WebGPU and wasm might change all that
But the last two pieces of the puzzle are lurking on the horizon. Web Assembly–
A few projects that are churning along in Rust hold the promise of making in-browser
computation even faster. (If I were going to go all-in on a new programming language
for a few months right now, it would probably be Rust; in writing webgl programs
I increasingly find myself doing the equivalent of writing my own garbage collectors,
but as a high-level guy I never learned enough C to really know the basic concepts.)
Back in the 2000s, the python and R ecosystems were littered with packages
that relied on the Java virtual machine in various ways. In the 2010s, it felt
to me like they shifted to underlying C/C++ dependencies. But given how much
Virtual Machine more and more. When I want to use some of D3 spherical projections
in R, that’s how I call them; and Jerome Ooen’s V8 package (for running the JSVM,
or whatever we call it) is approaching the same level of downloads as the
If it starts becoming a realistic way to run pre-compiled
Rust and C++ binaries on any system… that’s interesting.
The last domino is a little off, but could be titanically important. WebGL is slowly dying, but the big tech companies have all gotten together to create WebGPU as the next-generation standard for talking to GPUs from the browser. It builds on top of the existing GPU interfaces for specific devices (Apple, etc.) like Vulkan and Metal, about which I have rigorously resisted learning anything.
WebGPU will replace WebGL for fast in-browser graphics. But the capability to do heavy duty computation in WebGL is so tantalizing that some lunatics have already begun to do it. The stuff that goes on into Reusser’s work] is amazing; check out this notebook about “multiscale Turing patterns” that creates gorgeous images halfway between organic blobs and nineteenth-century endplates
I haven’t read the draft WebGPU spec carefully, but it will certainly allow a more robust way to handle things. There is already at least one linear algebra library (i.e., BLAS) for WebGPU out there. I can only imagine that support for more data types will make many simple group-by-filter-apply functions plausible entirely in GPU-land on any computer that can browse the web.
When I started in R back in 2004, I spent hours tinkering with SQL backing for what seemed at the time like an enormous dataset: millions of rows giving decades of data about student majors by race, college, gender, and ethnicity. I’d start a Windows desktop cranking out charts before I left the office at night, and come back to work the next morning to folders of images. Now, it’s feasible to send an only-slightly-condensed summary of 2.5 million rows for in-browser work and the whole dataset could easily fit in GPU memory. In general, the distinction between generally available GPU memory (say, 0.5 - 4GB) and RAM (2-16GB) is not so massive that we won’t be sending lots of data there. Data analysis and shaping is generally extremely parallelizable.
JS and WebGPU will stick together
Once these things start getting fast, the insane overhead of parsing CSV and JSON, and the loss of strict type definitions that they come with, will be far more onerous. Something–I’d bet on parquet, but there are are possibilities involving arrow, HDF5, ORC, protobuffer, or something else–will emerge as a more standard binary interchange format.
Why bother with R and Python?
So–this is the theory–the data programming languages in R and Python are going to rely on that. Just as they wrap Altair and they wrap HTML click elements, you’ll start finding more and more that the package/module that seems to just work, and quickly, that the 19-year-olds gravitate towards, runs on the JSVM. There will be strange stack overflow questions in which people realize that they have an updated version of V8 installed which needs to be downgraded for some particular package. There will python programs that work everywhere but mysteriously fail on some low-priced laptops using a Chinese startup’s GPU. And there will be things that almost entirely avoid the GPU because they’re so damned complicated to implement that the Rust ninjas don’t do the full text, and which–compared to the speed we see from everything else–come to be unbearable bottlenecks. (From what I’ve seen, Unicode regular expressions and non-spherical map projections seem to be a likely candidate here.)
But I’ve already started sharing elementary data exercises for classes
using observablehq, which provides a far more coherent
approach to notebook programming than Jupyter or RStudio. (If you haven’t
tried it–among many, many other things, it parses the dependency relations between cells in a notebook
avoids the incessant state errors that infect expert and–especially–novice
programming in Jupyter or Rstudio.) And if you want to work with data
rather than write code, it is almost as refreshing as the moment in computer history it
tries to recapitulate, the shift from storing business data in COBOL to
running them in spreadsheets. The tweet above that forms of the germ of this
rant has just a single, solitary like on it; but it’s from Mike Bostock, the creator
of D3 and co-founder of Observable, and that alone is part of the reason I
bothered to write this whole thing up. The Apache Arrow platform I keep rhapsodizing about
is led by Wes McKinney, the creator of pandas, who views it as the germ of a faster, better
from a position initially sponsored by RStudio and subsequently with funding
from Nvidia. Speculative as this all
is, it’s also–aside from massive neural-network gravitational of the tensorflow/torch solar systems–
where the tools that become hegemonic in the last decade are naturally drifting.
I wish more of the data analysts, not just the insiders, saw this coming, or were excited that it is.