Partition all the datas!

It’s been a while, but I wanted to put up a quick post around some new functionality in the Tables.jl package; as of the v1.1 release, a new function Tables.partitions has been introduced into the Tables.jl interfaces. Tables.partitions adds a “batch processing” layer to the already established Tables.rows/Tables.columns interfaces to allow input source tables to signal that they have natural “batches” or partitions. If a source overloads Tables.partitions, each partition/iteration should return a valid “table”, aka an object that satisfies the rows or columns interfaces.

Why partitions? There are certain workflows and algorithms that work best on “batches” of data, by employing the popular map-reduce methodology, or in cases of extremely large datasets, just merely processing such large tables chunk by chunk.

What are some concrete uses already out there? Here are a few examples:

  • In the Arrow.jl package, data is naturally stored in “record batches”. The package provides the Arrow.Stream type to allow iterating record batches from arrow data. The Arrow.write also allows writing input table partitions out as separate record batches, using separate threads to write each batch.
  • In the CSV.jl package, the CSV.Chunks type is provided for processing extremely large files in “chunks”, where each iteration is a CSV.File object. Because CSV.File is a valid Tables.jl table, CSV.Chunks naturally satisfies the Tables.partitions interface by returning itself. CSV.write also now supports the partition=true keyword argument that allows writing out multiple files at once using multiple threads, writing input partitions out to the separate files.

There has also been efforts at providing several convenience functions to make working with partitions easier. For example:

  • Tables.partitioner: is basically a lazy version of map, where each element returned must satisfy the Tables.jl source interface. This allows for easy “partitioning”, of multiple tables where each table is a partition. For example, Tables.partitioner(CSV.File, array_of_csv_filenames) will returned a object where Tables.partitions applys CSV.File to each element of the array_of_csv_filenames array. This allows treating a bunch of csv file as a single “long” table, as long as each csv file has the same schema.
  • TableOperations.joinpartitions: can take a partitioned input table, and return a single “table” that satisfies the Tables.columns interface; it does this by taking the columns of each partition and “appending” them together using the ChainedVector array type from the SentinelArrays.jl package.
  • TableOperations.makepartitions: Take an input Tables.jl compatible table source, along with an integer number of rows, and return an object that implements Tables.partitions by returning the # of specified rows in each partition.

Anyway, it’s exciting to already hear about people adopting these behaviors/functionality and enjoying the benefits of partitioning data seamlessly across packages. As always, feel free to ping me in the JuliaLang #data channel, or hit me up on twitter.

Advertisement

Data Structure Design in CSV.jl

I’m currently working through a refactor of CSV.jl internals and once again am faced with some tricky questions surrounding which data structures to use. In csv reading, like many things, there are trade-offs between performance/efficiency and convenience: we want to read delimited files as fast as possible, but we also want the most natural, standard data structures. Some of the complexity in these decisions come from the csv spec (or lack thereof!) itself, but to fully understand that, let’s talk through the primary task of csv reading.

The aim for CSV.jl, is to take plain text data–columns separated by a delimiter, and rows separated by newlines–parse text values into native typed values, and output the data in columnar data structures, like a native Julia Vector. Unfortunately, csv files do not contain any metadata to inform how many rows, columns, or what data types columns will be, as well as if they will have null values or not. This lack of metadata means we have to use tricks like guessing the # of rows in a file, “detecting” data types, and being flexible in case we detect one type and need to “promote” to a different data type later. The complexities are compounded when considering multi-threaded parsing because you either have to play the synchronization dance between threads, or give each thread a chunk of the file and “merge” their results afterwards (the current CSV.jl approach).

Given these processing considerations, here are some of the data structure questions I’m currently pondering:

Vector{Union{T, Missing}} vs. SentinelVector{T}: Base Julia includes an optimization that allows Arrays with isbits Union elements to be stored inline, which means they can be extremely efficient when working with, for example, Union{Float64, Missing}. I’ve been prototyping a new SentinelArrays.jl package that allows wrapping any AbstractArray and specifying a special “sentinel” value of the array type that should return a different “special value”, like missing. For Float64 for example, we can use a non-standard NaN bit pattern as a sentinel for misssing and this works quite well: we’re basically working with a plain Vector{Float64} in terms of storage, but get the advantage of representing Union{Float64, Missing} through our sentinel.

The pros of Vector{Union{T, Missing}} is that it’s Base julia, built-in, standard, and much of the data ecosystem has become familiar with the type and integrated well. The cons are that it takes up a little extra space (1 extra byte per element which signals whether an element has a Float64 or a missing), and that there currently isn’t a way to “truncate” the array like convert(Vector{Float64}, A::Vector{Union{Float64, Missing}}) without copying data. This is desirable because while parsing, we need to assume Union{T, Missing} in case missing values are encountered, but might finish parsing and know that there were in fact none, in which case we’d like to just return Vector{Float64}. With a SentinelVector, this is trivial because we can just “unwrap” the underlying array and return that, given no sentinels were used while parsing. The disadvantages of SentinelArrays are just that they’re non-standard, though the Julia ecosystem has evolved pretty well to just rely on the general AbstractArray interface instead of needing specific array types.

The 2nd question is How to return/represent String columns? String columns are a bit of the odd duck with respect to parsing because you’re not really “parsing” anything but just noting the start/end positions of a cell of the csv file. And indeed, one representation of a String column is just that: a custom array type that holds on to the original file byte buffer, and each “element” is just a byte position and length of each cell. Upon indexing, the String can be fully materialized, but otherwise, this lazy-materializing structure provides a lot of efficiency. The disadvantage of this approach is needing to hold on to the original file buffer, which has caused confusion for users in the past when they try to modify/delete the file after parsing and get errors saying the file is still in use. Another disadvantage is trying to support the full AbstractArray interface with this lazy structure, particularly with regards to mutating operations (push!, append!, etc.). The WeakRefStrings.jl package provides a structure that can mostly be used for this kind of task, but there are a few internal mismatches with how the position/lengths are represented. The alternative of just materializing a full Vector{String} has the advantage of being extremely standard and easy to work with, but just expensive because each string cell in the file must be copied/allocated, even if it never ends up getting used.

The 3rd data structure question involves multi-threaded parsing: How to best return/represent columns when multiple threads are involved? As noted earlier, CSV.jl currently chunks up large files and lets each thread process a chunk separately: detecting types, guessing rows, the full parsing task. After each thread has finished, a “merge” operation is performed where columns will be promoted together, recoded, and ensure that each has a consistent type. CSV.jl currently defines its own CSV.Column2 type (admittedly not the greatest name 😜) that “chains” each threads’ column together into a single “column”. This seems to work pretty well in practice, with iteration still being extremely efficient, but there’s a gotcha if you tried to do

for i = 1:length(column)
    x = column[i]
    # do stuff with x
end

That is, linear getindex operations are slower (O(log N) slow) because it has to determine which underlying chained array i belongs to.
The most obvious alternative solution is to just vcat all the thread columns together, but my worry there is we’re essentially doubling the required memory for parsing, if only for the brief operation of appending columns together.

Anyway, these are some of the questions I’ve been sleeping on for the past few days; I don’t have firm commitments to choose one solution over the other, but with the other internal refactoring, I thought it would be good to think through the ideals and see if we can improve things somehow. My ulterior motive in writing this all up in a blog post is to hopefully elicit some discussion/ideas around ways we can accomplish some of the things I’ve discussed here. As always, feel free to ping me on Twitter or the #data julialang slack channel to chat.

Tables.jl 1.0 Release

Since its inception on a couple of whiteboards at JuliaCon 2018 in London, the Tables.jl package has grown to become a foundational set of interfaces for data packages in Julia, even accumulating 74 direct dependencies in the General registry as of this writing (Feb 2020)!

So why the 1.0 release? An official 1.0 release can signal stability and maturity, particularly in the case of an “interface package” like Tables.jl. With very little change in API since the package started, we figured it was time to send that signal out to the broader community: hey! this package can be pretty useful and we’re not going to break it (intentionally)! It also gives the opportunity to polish up a few APIs, iron out a few wrinkles, and cleanup a lot of documentation.

For those who aren’t familiar, the Tables.jl package is about providing powerful interfaces for “table types” in all their shapes, sizes, and formats, to talk with each other seamlessly, and most importantly, without direct knowledge of each other! At the highest level, it provides the Tables.rows and Tables.columns functions, which provide two of the most common access patterns to table data everywhere (row-by-row, or by entire columns). Table types become valid “sources” by implementing access to their data via Tables.rows or Tables.columns (or both!), and “sinks” can then operate on any source by calling Tables.rows or Tables.columns and processing the resulting data appropriately. A key feature for sinks is the “orientation agnostic” nature of Tables.rows and Tables.columns; i.e. sinks don’t need to worry if the source is row or column-oriented by nature, since performant, generic fallbacks are provided both ways. That is, calling Tables.rows on a source that only defined Tables.columns fallsback to a generic lazy row iteration over the input columns. And vice versa, calling Tables.columns on a source that only defined Tables.rows will process the input rows to build up columns. This works because it’s work the sink would have had to do anyway; if the sink really needs columns to do its work, then it would have to turn rows into columns anyway, so Tables.jl provides that with the best, community-sourced implementation possible.

Another core concept, and now clarified in the 1.0 release, is the interface for accessing data on an individual row, as well as a set of columns. It turns out, by viewing a “row” as an ordered set of named column values, and “columns” as an ordered set of named columns, lends itself naturally to a common interface, simplified implementations, and ability to provide powerful default functionality. Check out new the docs, particularly about Tables.AbstractRow and Tables.AbstractColumns to learn more. You can also checkout the Discourse announcement, which is geared a little bit more towards 1.0 release notes and upgrade guide.

A recent popular blog post highlighted features of Julia that lend itself to such widespread composability between packages and Tables.jl is a powerful example of this. Through just a few simple APIs, it allows a DataFrame to be serialized to disk in the binary feather format, read back, converted to JSON to be sent over the web, and loaded into a mysql database. All without any of those packages knowing about each other, and without needing a single intermediary table type to convert between. Applications and processing tasks can feel free to leverage any number of in-memory representations, disk formats, and databases to handle table manipulations.

Here’s to more integrations, more composability, and more table types in the future!

BYO-Closures For Performance

Some may be familiar with the idea of closures, which, in short, are local functions that capture state from enclosing context. Closures are obviously supported in Julia, which often just look like anonymous functions; in those docs, it mentions that “Functions in Julia are first-class objects“, which means you can think about them as being defined in the language itself. Indeed, we could take the example of the exponent method for IEEFloats in Base, which is defined (slightly abbreviated) as:

function exponent(x::T) where T<:IEEEFloat
    xs = reinterpret(Unsigned, x) & ~sign_mask(T)
    k = Int(xs >> significand_bits(T))
    if k == 0 # x is subnormal
        m = leading_zeros(xs) - exponent_bits(T)
        k = 1 - m
    end
    return k - exponent_bias(T)
end

And think of this method definition being “lowered” to:

struct exponentFunction <: Function
end

function (f::exponentFunction)(x::T) where T<:IEEEFloat
 xs = reinterpret(Unsigned, x) & ~sign_mask(T)
    k = Int(xs >> significand_bits(T))
    if k == 0 # x is subnormal
        m = leading_zeros(xs) - exponent_bits(T)
        k = 1 - m
    end
    return k - exponent_bias(T)
end

const exponent = exponentFunction()

So here we’re defining a struct exponentFunction, which is a subtype of Function, that all function types inherit from (you can check this yourself by querying supertype(typeof(Base.exponent))). Then we’re defining a method with some unusual syntax to make instances of exponentFunction callable, like exponentFunction()(3.14), which is accomplished with the syntax function (f::exponentFunction)(x::T). Finally, we declare our const exponent to just be an instance of our exponentFunction, which is often known as a “functor”. (Functors are covered in more detail in the Julia manual).

Ok, so why start off a blog post going over a bunch of stuff in the JuliaLang docs manual? Well, in a recent refactoring, I ran into a decently well-known performance issue with closures, which suggested a few different solutions, but none which quite fit my use-case. Now, I have to admit to not fully understanding the fundamental language issue causing the performance hit here; what I do understand from my own code factorings/use is that when you try to use a variable that gets captured as closure state after the closure definition/use, it ends up creating a Core.Box object to put the variable’s value into (which massively affects performance because the variable’s inferred type is essentially Any and then relies on runtime/dynamic dispatch at every use).

Part of my aforementioned refactoring involved moving some common code into higher-order functions that applied functor arguments to each field of a struct, for example, which meant my code was now relying on closures passed to the higher-order functions. Luckily, I tend to use my favorite JuliaLang feature, its code-inspection tools (see @code_typed, for example) to just see what core functions are getting inferred/lowered to, and noticed a bunch of red-flags in the form of Core.Box for key variables. Digging a little further, it became clear that I was a victim of issue #15276 and would need to figure out a solution. One solution suggested in the issue thread was to use let blocks, declaring closure-capture state variables to make it explicit which variables will be captured. In my case, however, I needed the variables to be updated within the closure and then needed those updated values afterwards to pass along (so my parsing functions passed current parsing state down into the closures and need to then pass it along to parse the next object).

So my solution? Well, an extremely unique feature of the Julia language is how much of the language is written in itself, and how many major constructs are true, first-class citizens of the language. So I decided to write my own closure object!

mutable struct StructClosure{T, KW}
    buf::T
    pos::Int
    len::Int
    kw::KW
end

@inline function (f::StructClosure)(i, nm, TT)
    pos_i, x_i = readvalue(f.buf, f.pos, f.len, TT; f.kw...)
    f.pos = pos_i
    return x_i
end

So similar to our exponent example before, we define a StructClosure functor object, which this time has a few fields, which represent the closure-captured state variables that we need access to inside our actual function code. Also note that we made our functor mutable struct because in our function, we actually want to update our position variable after we’ve read a value (f.pos = pos_i).

We end up using our home-grown closure like:

@inline function read(::Struct, buf, pos, len, b, ::Type{T}; kw...) where {T}
    if b != UInt8('{')
        error = ExpectedOpeningObjectChar
        @goto invalid
    end
    pos += 1
    @eof
    b = getbyte(buf, pos)
    @wh
    if b == UInt8('}')
        pos += 1
        return pos, T()
    elseif b != UInt8('"')
        error = ExpectedOpeningQuoteChar
        @goto invalid
    end
    pos += 1
    @eof
    c = StructClosure(buf, pos, len, kw)
    x = StructTypes.construct(c, T)
    return c.pos, x

@label invalid
    invalid(error, buf, pos, T)
end

So we first create an instance of our closure c = StructClosure(buf, pos, len, kw), and then pass it to the higher-order function like x = StructTypes.construct(c, T). Finally, you’ll notice how we return our closure variable at the end with return c.pos, x. How’s the performance? Back in-line with our fully-unrolled, pre-higher-order function code. Ultimately, this actually felt like a pretty simple, even clever solution in order to cleanup my code and use some common, well-tested higher-order functions to do some fancier code unrolling.

As always, hit me up on twitter with any comments or questions and I’m happy to discuss further.

Everyone’s Favorite Blogpost: CSV Benchmarks

We all know ’em, we all hate ’em, let’s get a good benchmarking blogpost up in here. Most people who groggily glance at their phone at 5:30 AM just roll over and go back to sleep. For some of us, you also open up the email just to see if there’s anything interesting really quick. And for the very small minority out there, we have a “performance issue” opened on one of our precious darling open-source libraries and THAT’S IT! no more sleep until this internet stranger can be proven wrong! (for the record, the issue in question is here and xiaodiagh isn’t an internet stranger, but a great buddy I got to meet at JuliaCon 2019 in Baltimore this year who is doing some really cool work on grouping performance in Julia; but you know, “internet stranger” is a lot funnier).

Some of you may heard/seen that multithreaded csv parsing support recently landed in CSV.jl, my aforementioned precious darling. So naturally, it’s a good time to round up some benchmarks and show how competitive we are in the csv parsing landscape. I apologize for the lack of fancy graphics and pretty charts, but I’m more interested in the numbers; pretty chart PRs welcome to CSV.jl!

Hey! Look at that! CSV.jl is basically on par with R’s fread!

Now, I’ll add my personal opinion on these kind of benchmark comparisons: always take them with a grain of salt. Benchmarking tends to rely on contrived data that can sometimes be biased one way or another; it can be system dependent in a bunch of ways, they are only accurate for a given amount of time while packages continue to develop/decay, caveat, caveat, caveat. BUT, they also tend to be directionally accurate, and that’s what I’m most pleased with here, particularly with regards to fread‘s one remaining advantage over CSV.jl in terms of multithreading support. (For full feature comparison between CSV.jl, fread, and pandas, see the CSV.jl 0.5 release announcement). Is CSV.jl perfect? Far from it; we need to improve memory usage a bit on really large files (and multithreading), and it looks like float performance lags fread a bit. But overall, things are looking pretty good for csv reading in Julia these days.

The files + benchmark script can be found here; the single-typed files and mixed.csv are derived from this benchmark site. The benchmark numbers shown above were run on my system: 2019 MacBook Pro, 2.4 GHz Intel Core i9, 32 GB 2400 MHz DDR4 RAM. The benchmarks were run using CSV.jl#master branch, as the last few things get ironed out before a release which will include multithreading support for Julia versions 1.3+.

As always, hit me up on Twitter to chat or follow my #JuliaLang tweets.

A Tour of the Data Ecosystem in Julia

Julia 1.0 was released at JuliaCon 2018 and it’s been a quick year for the package ecosystem to build upon the first long-term stable release. In a lot of ways, pre-1.0 for packages involved a lot of experimentation; a lot of trying out various ideas, shotgun-style and seeing what sticks, in addition to trying to keep up with the evolving core language. One of my favorite things about Julia is the great efforts that have been made to collaborate, coordinate, and modularize not just the base language and standard libraries, but the entire package ecosystem. Julia was born in the age of GitHub, Discourse, and Slack, which has led to an exceptional amount of public communication from and with core developers, targeted efforts to automatically update package code and test the impact of core changes to the ecosystem, and a proliferation of user-group meetups and domain-specific GitHub organizations. I don’t feel dishonest in saying I think Julia is the most collaborative programming language that exists today.

With all this context, one might be wondering: so what is the current status of working with data in Julia? How do various packages like CSV.jl, DataFrames.jl, JuliaDB.jl, and Query.jl play together. And so I present, “A Tour of the Data Ecosystem in Julia”. Let’s begin…

Data I/O: How do I get data in and out of Julia?

Text Files
I once heard Jeff Bezanson say tongue-in-cheek, “It doesn’t really matter what fancy features you put in a programming language, all people really want to do is read csv files. That’s it, just csv file reading.” Julia has grown a lot since the earliest days of dlmread, with several excellent packages for reading csv and other delimited files, each growing out of a unique idea for approaching this old problem or addressing a specific integration need.

CSV.jl
As blog author, you’ll have to forgive the shameless plug for my own packages: I started CSV.jl in 2015 to try and make a strong effort to match csv parsing functionality provided by other popular languages (in particular, Pandas and fread in R). Since then, it’s accumulated some ~400 commits from over 25 collaborators, and recently hit issue #464. It also released a significant upgrade with the recent 0.5 release, bringing performance inline with the fastest parsers in other languages, while providing some powerful and unique features not available elsewhere, including: “perfect” column typing without needing to restart parsing at all, auto delimiter detection, automatic handling of invalid rows, and several options/layers for lazily parsing files, all with unparalleled performance (a full feature comparison with pandas and R can be found in the latest discourse announcement post). Part of CSV.jl’s speed comes from the hand-tuned type parsers made available in the Parsers.jl package, which includes extremely performant parsing for ints, floats, bools, and dates/datetimes, all in pure Julia.

Tables.jl
We’re going to take a quick detour from I/O packages here to mention a key building block in the I/O story. The Tables.jl package was born during the 2018 JuliaCon hack-a-thon. It was a collaboration between several data-related package authors to come up with essentially two access patterns for “table” like data: “rows” and “columns”. The core idea is simple, yet powerful: any table-like data format that can implement one of the access patterns (via row iteration, or column access), can “automatically” integrate with any downstream package that then uses one of the access patterns, all without needing to take bi-directional dependencies. Even being a relatively new package, the power of this interface can be seen already in the integrations already available:

  • In-memory datastructures
    • DataFrames.jl
    • TypedTables.jl
    • IndexedTables.jl/JuliaDB.jl
    • FunctionalTables.jl
  • Data Formating/Processing Packages
    • DataKnots.jl
    • FreqTables.jl
    • Mustache.jl
    • FormattedTables.jl
    • PrettyTables.jl
    • TableView.jl
    • BrowseTables.jl
  • Data File Format Packages
    • CSV.jl
    • Feather.jl
    • StataDTAFiles.jl
    • Taro.jl
    • XLSX.jl
  • Database Packages
    • ODBC.jl
    • SQLite.jl
    • MySQL.jl
    • LibPQ.jl
    • JDBC.jl
  • Statistics Packages
    • StatsModels.jl
    • MLJ.jl
    • GLM.jl
  • Plotting Packages
    • StatsMakie.jl
    • StatsPlots.jl
    • TableWidgets.jl

Ok, back to data I/O. We also have the wonderful TextParse.jl, which started as the data ingestion engine for JuliaDB.jl. While blazing fast, it doesn’t have quite the maturity or breadth of functionality compared to CSV.jl, like this long-standing issue of incorrect float parsing. The CSVFiles.jl package also exists to provide FileIO.jl integration, meaning you can just do load("data.csv") and FileIO.jl automatically recognizes the .csv extension and knows how to load the data. (A full feature comparison between CSV.jl and CSVFiles.jl can be found here). Another interesting approach to csv reading popped up recently in the form of TableReader.jl from the ever-talented Kenta Sato, using finite state machines. A pretty decent collection of benchmarks can be found here.

In addition to the excellent packages for handling text-based, delimited files, there are also some great files for managing/handling data formats in general.

  • RData.jl: for reading .rda files from R into Julia
  • DBFTables.jl: for reading .dbf files into Julia
  • StatFiles.jl: for reading SAS, SPSS, and Stata files into Julia
  • StataDTAFiles.jl: for reading and writing Stata files
  • DataDeps.jl: for declaring data dependencies and managing reproducible setups for with data in Julia
  • ExcelFiles.jl: for reading excel files into Julia and integration with FileIO.jl

Another popular option for data storage is binary-based formats, including feather, apache arrow, parquet, orc, avro, and BSON. Julia has pretty good coverage of these formats, including native Julia implementations in Feather.jl (and FileIO.jl integration in FeatherFiles.jl), Arrow.jl, and Parquet.jl (again, with FileIO.jl integration with ParquetFiles.jl). There’s also the BSON.jl package for binary JSON support. Beginning support for avro has been started in Avro.jl, and I’m personally interested in diving into the ORC format.

Database Support
As mentioned above in Tables.jl integrations, there are also great packages in place to support extracting data from databases, including:

  • ODBC.jl: provides generic support for any database that provides a compatible ODBC driver. Supports parameterized queries for inserting data, as well as extracting data from a query, and with Tables.jl support, “exporting” the results to any Tables.jl-compatible sink. It does require setting up an ODBC administrator tool on OSX and linux (windows has builtin support), and then proper setup/installation of the specific database’s ODBC driver, but once setup, things generally work pretty seamlessly.
  • JDBC.jl: the JDBC.jl package also provides access for databases that support JDBC access. It does so via JavaCall.jl, which does require an active JDK to work, which can also be tricky to setup, but database vendors tend to have pretty good support for a JDBC driver
  • MySQL.jl/LibPQ.jl: specific packages for mysql/postgres databases, respectively. Both provide integration with Tables.jl and aim to provide support for database-specific types and functionality (beyond the more generic interfaces of ODBC/JDBC). They can be easier to setup since there’s no “middle man” interface library and you just need to interact with the database libraries directly.
  • SQLite.jl: a library providing integration with the excellent sqlite database; supports in-memory databases as well as file-based. Supports parameterized queries and Tables.jl integration for loading data and extracting query results.

Data Processing: what can I do with my data once it’s in Julia?

Once you identify the right I/O package for reading your data, you then have a choice to make with regards to what you need/want to do with it. Clean it? Reshape it? Filter/calculate/group/sort it? Just browse around for a little while? Do advanced statistics, model training, or other machine learning techniques? Here we take a tour of the current landscape of packages that provide various types of functionality with regards to data processing, including table-like structures, query functionality, and statistics/machine learning.

Table Structures
Yes, in Julia we can have an entire section where we talk about table structures, plural. Sometimes when I mention to people that there are several dataframe-like packages in Julia, they are initially confused: why does Julia need more than one table package? Why doesn’t everyone just work on a single package to focus on quality over quantity? My most common response goes something like this: for a project like Pandas, or data.table, or data.frame, most of the actual code is written in what language? Python? Or R? It’s actually C/C++. The common belief is that because R/Python are dynamic, “high-level” languages, they can’t be fast, but that’s no worry, you just write the parts that need to be fast in C. But therein lies a core issue: for languages like Python and R, where the user-to-developer ratio is so high, having core pieces of functionality written in a lower-level language like C makes the code much less accessible to someone who wants to contribute, let alone just take a peek under the hood to see how things work. It automatically splits the code into a lower-level “black box” component, and a higher-level “wrapper” component. I believe it stunts innovation due to such a high “barrier to entry” to take a new approach to a table structure. I might be a budding R or Python user, getting into developing things, but shoot, if I want to make a meaningful contribution to Pandas or data.table, all the sudden I’m wading into deep make/cmake/build issues, compiler versions, and manual memory management that I’ve never dealt with before. The other alternative is I write something in pure R/Python which will just surely be doomed to performance issues, regardless of how novel my interfaces or APIs might be.
Julia, however, is a famously declared solution to this “two-language problem”. No longer do budding developers need to fear plain for-loops, or vectorize every operation, or rely on C/C++ for the “core stuff”. You can just write plain Julia, down in the trenches, and up at the highest-level user APIs. Julia, all the way down.
I firmly believe this has led to greater innovation in Julia for experimenting with unique table structures, as well as making packages more accessible to those hoping to contribute.
Ok, enough soap-boxing, let’s talk packages.

DataFrames.jl
The DataFrames.jl package is one of the very oldest packages in the Julia ecosystem. This tenure and naming proximity with its cousins in Pandas and R have also made it one of the most popular packages for those hoping to give Julia a try. The package has evolved quite a bit since its early days, and is rapidly approaching its own 1.0 release (expected around JuliaCon 2019). Development over the last year or two has focused on core performance, safety of APIs, and overall consistency with Base APIs. The amount of thought, effort, discussion, and documentation by numerous collaborators makes it the most mature “table” package in my opinion. So what’s unique about DataFrames? I’ll try to give what I deem to be notable highlights of how DataFrames.jl approaches representing tabular data:

  • A DataFrame stores columns internally as a Vector{AbstractVector}; but wait, you might ask, isn’t that type unstable (since we’re essentially lumping all columns, regardless of individual column type, as AbstractVector)? Yes! And on purpose! Experienced Julia developers are quick to point out that sometimes code can get “overly typed”, leading to “compilation overdrive”, where the compiler is having to generate very specialized code for every operation, with compiled code reuse rare. A DataFrame can represent any number of columns, with any combination of column types, so it’s a natural scenario where you may want to “hide” too much type information from the compiler by slapping the lowest common denominator abstract type as the type label (AbstractVector in this case). This design decision has certainly been extensively discussed, but remains as-is, if not as a more compiler-friendly option than other “strongly typed table” types.
  • DataFrames.jl includes specialized subtypes for representing SubDataFrames and GroupedDataFrames, as opposed to returning full DataFrames; these “lazy” structures are mostly used in intermediate operations and is a useful way to avoid too much unnecessary data copying/movement
  • Core manipulation operations included in the package itself include grouping, joining, and indexing; in particular, supporting column indexing via regex matches, functional selecting, Not indexing (inverted indices), and flexible interfaces similar to Base.Arrays in terms of filtering/selecting specific indices for rows or columns.
  • A lot of work in recent years has also been to simplify and move code out of the DataFrames.jl package, to focus on the core types, functionality, and reduce the dependency burden, being a common dependency for packages around the ecosystem; this has included notably the creation of the DataFramesMeta.jl package to support other common query/filter/manipulation operations
  • DataFrames.jl supports the Tables.jl interface, which means any I/O package also supporting it can automatically convert it’s table format into an in-memory DataFrame, and convert back to the format for output

IndexedTables.jl / JuliaDB.jl

JuliaDB.jl splashed onto the Julia scene in early 2017, touting a new approach to query/table operations that utilized type stability and Julia’s built in parallelism to provide “out of core” functionality like a database. JuliaDB.jl’s core table type actually lives in the IndexedTables.jl package, which in turn uses the clever StructArrays.jl package to turn NamedTuple rows into an efficient struct-of-arrays structure more suitable for columnar analytics. JuliaDB.jl itself then, with the use of Dagger.jl, adds the “parallel” layer on top of IndexedTables.jl. The benefits of “type stability” come from a table being essentially encoded as Table{Col1T, Col2T, Col3T, ...}, where the full type of each column is encoded in the top-level table type. This allows operations like selecting columns, filtering, and aggregating to be extremely efficient due to the compiler knowing the exact types it’s dealing with (as opposed to having to do “runtime” checks like in the DataFrames.jl case). Now, as discussed in the DataFrames.jl section, this doesn’t come without a cost; indeed, there’s a long-standing issue for dealing with the compilation cost for tables with a large number of columns. But, in the case of a manageable number of columns, and with the ability to scale tables up beyond a single machine’s memory limits is powerful functionality. JuliaDB.jl is also sponsored by JuliaComputing, which gives it a nice stamp of support and stability. While it may lack some of the maturity of DataFrames.jl long-discussed APIs, I’m excited by the unique, “Julian” approach JuliaDB.jl provides in the big data analytics space. Go check out the docs here and give it a spin. An additional package providing experimental manipulation functions for JuliaDB is JuliaDBMeta.jl, which mostly mirrors the before-mentioned DataFramesMeta.jl, but for JuliaDB tables.

TypedTables.jl
The TypedTables.jl package is one that started out in “experimentation” mode for a while until recently being declared “ready” by its primary author, the venerable Andy Ferris. Andy has long been known in the Julia community for his eye for API consistency and being able to strike that elusive balance between theoretical ideals and practical use. In TypedTables.jl, the “fully type stable” approach is taken, similar to IndexedTables.jl/JuliaDB.jl, with a Table being defined literally as <: AbstractVector{T} where {T <: NamedTuple}, that is, a collection of “rows” or NamedTuples. While TypedTables.jl includes two @Select and @Compute macros for simple manipulations, some of the more interesting promise comes from the TypedTables.jl “supporting cast” packages:

  • AcceleratedArrays.jl: a tidy package to turn any array into an “indexed” array (in the database sense), to provide optimized “search” functions: findall, findfirst, filter, unique, group, join, etc. An AcceleratedArray can thus be used in a TypedTable to provide powerful indexing behavior for an entire table (though note that it works on any AbstractArray, which means these indexed columns could even be used in a DataFrame).
  • SplitApplyCombine.jl: this package provides a powerful set of functions to perform common split, apply, and combine operations on generic collections like mapmany, group, product, groupreduce, and innerjoin. The aim is to provide the basic building blocks of relational algebra functions that work on any kind of collection, obviously including a TypedTable as a specialized type of AbstractVector of NamedTuples.

CSV.File
One more honorable mention for table structures (and another shameless plug) actually comes from the CSV.jl package. While CSV.read materializes a file as a DataFrame, a CSV.File, which supports all the same keyword arguments as CSV.read, can be used to transfer data to any other Tables.jl sink, or used itself as a table directly. It supports getproperty for column view access and iterates a CSV.Row type (which acts like a NamedTuple). I think as packages continue to evolve, we’ll see more and more cases of customized structures like this, which allow for certain efficiencies or specialized views into raw data formats, and with sufficiently general interfaces like Tables.jl and SplitApplyCombine.jl, users won’t need to worry as much about conforming to a single table structure for everything, but can focus on understanding more generic interfaces, and using data structures optimized for specific use-cases, data formats, and workflows.

Query.jl
Another package (set of packages really) that must be discussed in the Julia data ecosystem is Query.jl. Pioneered by David Anthoff, Query.jl provides a LINQ implementation for Julia. Query.jl and its sister package QueryOperators.jl provide a custom “query dsl” by a set of macros that allow convenient “query context” syntax for common manipulation tasks: selection and projection, filtering, grouping, and joining. These “query verbs” also are able to operate on any iterator, in true LINQ fashion, which makes the processing functions extremely versatile. While currently Query.jl/QueryOperators.jl hold the sole implementations of the query verbs, the grander scheme of having a custom dsl is the ability to represent entire queries in an AST (abstract syntax tree), which could then allow custom implementations that “execute” a structured query. This is most immediately useful when one considers being able to use a single “query dsl” to operate on both DataFrames and database tables, having a query translated into a vendor-specific SQL syntax. While not currently fully fleshed out, the ambitious undertaking is exciting to track.

DataValues.jl
One note for users on the use of Query.jl is the current reliance on the DataValues.jl package to represent missing data. What that means is that query functions in Query.jl/QueryOperators.jl aren’t integrated with the Base-builtin representation of missing, but rely on the DataValues.jl package, which defines a DataValue{T} wrapper type that also holds whether a value is missing or not (along with what “type” of missing value it is). While the history of missing data in Julia is long and storied, Andy Ferris wisely noted that no missing value representation is perfect. missing was included in Base largely due to the ease of working with a single sentinel value, and compiler support for code generation involving Union{T, Missing}. Inherent in the use of Union{T, Missing}, however, is a current compiler complexity involving inference of Union values that are stored in a parametric struct field. This affects the @map macro in Query.jl with NamedTuple inputs to NamedTuple outputs, hence, Query.jl relies on the use of DataValue{T} to more conveniently pass type information through projections. There are also active efforts explore ways the core language can avoid the need for an explicit wrapper type while still propagating the type information soundly through projections.

Additional Data Efforts

Other efforts integrating with the data ecosystem include (but are certainly not limited to):

  • StatsModels.jl: for specifying (in familiar “formula” notation), fitting and evaluating statistical models; can operate on any Tables.jl-compatible
  • GLM.jl for working specifically with generalized linear models
  • LightQuery.jl: another budding approach to type-stable querying capabilities
  • Flux.jl for an extremely extensible approach to machine learning modelling, GPU integration, and pure Julia, all the way down
  • Distributions.jl: for sampling, moments, and density/mass functions for a wide variety of distributions
  • StatsMakie.jl for GPU-enabled statistical plotting goodness
  • MLJ.jl: another new holistic approach to representing a variety of machine learning models sponsored by the Alan Turing Institute
  • ScikitLearn.jl for Julia access to the scikit-learn APIs for machine learning, including pure Julia implementations and integration with python models via PyCall.jl
  • OnlineStats.jl: a mature, fully featured statistical package for “online” statistical algorithms, including well-documented source code for published algorithms
  • TextAnalysis.jl: providing algorithms, statistical support, and feature engineering for text analysis
  • Dagger.jl: a dask-like Julia framework for distributed, parallel computation graphs
  • MultivariateStats.jl: stats, but for multiple variables!
  • RCall.jl: package that allows integrating with the R statistical language; transfer objects between languages, call R functions/libraries, etc.
  • StatsPlots.jl: another strong statistical plotting package

Future of Data in Julia

So now JuliaCon 2019 is upon us and we have to wonder: what’s next for working with data in Julia? As I’ve tried to illustrate in this long showcase of Julia packages, the data ecosystem has come along way since the official 1.0 release of the language itself. Support for the most common data formats is about as mature as any other language, but there’s always room to improve. The in-memory processing is an exciting space to watch in Julia due to the number of approaches being fleshed out, with varying degrees of maturity. DataFrames.jl is solid and should be a main utility knife for any Julia programmer, but one of the wonders of Julia, as mentioned above, is the ease of developing high-level, performant solutions in the language itself, so it’s exciting to see alternative approaches that can offer trade-offs or additional features that may enable better workflows depending on the environment. But given all that, here’s a shortlist of things swimming in my head around the future of the data ecosystem in Julia:

  • Ensure the performance and usability of Union{T, Missing} to represent missing data in Julia; currently, 95% of uses and workflows work amazingly well, but we always want to track down corner cases and do everything we can to improve the compiler, core language, or APIs to improve
  • Data format support: the job is never done here, but on my mind are an officially blessed (and integrated) implementation of apache arrow, write support for Parquet.jl, and a Julia package for supporting the ORC data format
  • Working towards a common API package/definitions for common table processing tasks; while exploratory efforts are always encouraged, it could be immensely convenient to users of various table types if there were a common set of operations that “just worked”. While Query.jl currently provides the best solution for this, it has other integration issues in the ecosystem; so working to resolve those or define a new common “table operations” type package
  • Relatedly, defining a full “structured query graph” model could be one way to provide a lower-level shared representation of querying tasks. This could catalyze “frontend” efforts (custom DSLs like Query.jl’s, or new dplyr-like verbs, or even a plain SQL parsing package) to “lower” to this common representation, while allowing similar types of backend innovation in *how* these structured query graphs are executed (with parallel support, out-of-core, etc.). I’ve recently been studying an old effort to do something like this for inspiration

For those who made it this far, kudos! As always, follow and ping me on twitter to chat data in Julia. And if you’ll be at JuliaCon 2019 in Baltimore, hit me up in the official Julia slack to meet up and chat.

Generated constant propagation; hold on to your britches!

In a recent JuliaLang Discourse post, OP was looking for a simple way to convert their Vector{CustomType} to a DataFrame, treating each CustomType element as a “row”, with the fields of CustomType as fields in the row. The post piqued my interest, given my work on the Tables.jl interface over the last year. Tables.jl is geared specifically towards creating generic access patterns to “table” like sources, specifically providing Tables.rows and Tables.columns as two ways to access table data from any source. Tables.columns returns a “property-accessible object” of iterators, while Tables.rows returns the dual: an iterator of “property-accessible objects” (property-accessible object here is defined as any object that supports both propertynames(obj) and getproperty(obj, prop), which includes all custom structs by default).

I decided to respond by showing how simple it can be to “hook in” to the Tables.jl world of interop; my code looked like:

using DataFrames, Random, Tables

struct Mine
a::Int
b::Float64
c::Matrix{Float64}
end

Tables.istable(::Type{Vector{Mine}}) = true
Tables.rowaccess(::Type{Vector{Mine}}) = true
Tables.rows(x::Vector{Mine}) = x
Tables.schema(x::Vector{Mine}) = Tables.Schema((:a, :b), Tuple{Int, Float64})

Random.seed!(999)
v = [Mine(rand(1:10), rand(), rand(2,2)) for i ∈ 1:10^6];
df = DataFrame(v)

But when doing a quick benchmark of the conversion from Vector{Mine} to DataFrame, it was 3x-5x slower than other methods already proposed. Huh? I started digging. One thing I noticed was by ***not*** defining Tables.schema, the code was a bit faster! That led me to assume something was up in our schema-based row-to-column code.

Part of the challenge with this functionality is wrestling with providing a fast, type-stable method for building up strongly-typed columns for a set of rows with arbitrary number of columns and types. Just iterating over a row in a for-loop, accessing each field programmatically can be disastrous; because each extracted field value in the hot for-loop could be a number of types, the compiler can’t do much more than create a Core.Box to put arbitrary values in, then pull out again for inserting into the vector we’re building up. Yuck, yuck, yuck; if you see Core.Boxs in your @code_typed output, or values inferred as Any, it’s an indication the compiler just couldn’t figure out anything more specific, and in a hot for-loop called over and over and over again, any hope of performance is toast.

Enter meta-programming: Tables.jl tries to judiciously employ the use of @generated functions to help deal w/ the issue here. Specifically, when doing this rows-to-columns conversion, with a known schema, Tables.eachcolumn can take the column names/types as input, and generate a custom function with this aforementioned “hot for-loop” unrolled. That is, instead of:

for rownumber = 1:number_of_rows
row = rows[rownumber]
for i = 1:ncols
column[i][rownumber] = getfield(row, i)
end

end

For a specific table input’s schema, the generated code looks like:

for rownumber = 1:number_of_rows
row = rows[rownumber]
column[1][rownumber] = getfield(row, 1)
column[2][rownumber] = getfield(row, 2)
# etc.
end

What makes this code ***really*** powerful is the compiler’s ability to do “constant propagation”, which means that, while compiling, when it encounters getfield(row, 1), it can realize, “hey, I know that row is a Mine type, and I see that they’re calling getfield to get the 1 field, and I happen to know that the 1 field of Mine is an Int64, so I’ll throw that information into dataflow analysis so things can get further inlined/optimized”. It’s really the difference between the scenario I described before of the compiler just having to treat the naive for-loop values as Any vs. recognizing that it can figure out the exact types to expect which can make the operation we’re talking about here boil down to a very simple “load”, then “store”, without needing to box/unbox or rely on runtime dynamic dispatch based on types.

Back to the original issue: why was this code slower than the unknown schema case? Well, it turns out that our generated code was actually trying to be clever by calculating a run-length encoding of consecutive types for a schema, then generating mini type-stable for-loops; i.e. the generated code looked like:

for rownumber = 1:number_of_rows
row = rows[rownumber]
# if columns 1 & 2 are Int64
for col = 1:2
column[col][rownumber] = getfield(row, col)
end
# column 3 is a Float64
for col = 3:3
column[col][rownumber] = getfield(row, col)
end
# columns 4 & 5 are Int64 again
for col = 4:5
column[col][rownumber] = getfield(row, col)
end
# etc.
end

While this is clever, and can generate less code when schema types tend to be “sticky” (i.e. lots of the same type consecutively), the problem is in the getfield(row, col) call. Even though the call to each getfield ***should*** be type-stable, as calculated from our run-length encoding, the compiler unfortunately can’t figure that out inside these mini type-stable for-loops. This means no super-inlining/optimizations, so things end up being slower.

So what’s the conclusion here? What can be done? The answer was pretty simple, instead of trying to be extra clever with run-length encodings, if the schema is small enough, just generate the getfield calls directly for each column. The current heuristic is for schemas with less than 100 columns, the code will be generated directly, and for larger schema cases, we’ll try the run-length encoding approach (since it’s still more efficient than the initial naive approach).

Hope you enjoyed a little dive under the hood of Tables.jl and some of the challenges the data ecosystem faces in the powerful type system of Julia. Ping me on twitter with any thoughts or questions.

Cheers.

Tricksy Tuple Types…

Some of you may be aware of my obsession with JSON libraries, and it’s true, there’s something about simple data formats that sends my brain into endless brainstorming of ways to optimize reading, writing, and object-mapping in the Julia language. JSON3.jl is my latest attempt at a couple of new ideas for JSON <=> Julia fun. The package is almost ready for a public release, and I promise I’ll talk through some of the fun ideas going on there, but today, just wanted to point out a tricky performance issue that took a bit of sleuthing to track down.

Here’s the scenario: we have a string of JSON like {"a": 1}, super simple right? In the standard Julia JSON.jl library, you just call JSON.parse(str) and get back a Dict{String, Any}. In JSON3.jl, we have a similar “plain parse” option which looks like JSON3.read(str), which returns a custom JSON3.Object type which I can talk about in another post in more detail. Another option in JSON3.jl, is to do JSON3.read(str, Dict{String, Any}), i.e. we can specify the type we’d like to parse from any string of JSON. While doing some quick benchmarking to make sure things look reasonable, I noticed JSON3.jl was about 2x slower compared to both JSON.parse, and JSON3.read(str, Dict{String, Int}). Hmmm, what’s going on here??

I first turned to profiling, and used the wonderful StatProfilerHTML.jl package to inspect my profiling results. That’s when I noticed around ~40% of the time was spent on a seemingly simple line of code:

Hmmmm……a return statement with a simple ifelse call? Seems fishy. Luckily, there’s a fun little project called Cthulhu.jl, which allows debugger “stepping” functionality with Julia’s unparalleled code inspection tools (@code_lowered, @code_typed, @code_llvm, etc.). As I “descended into madness” to take a look at the @code_typed of this line of code, I found this:

%1865 = (JSON3.ifelse)(%1864, %1857, %1851)::Union{Float64, Int64}
%1866 = (Core.tuple)(%1853, %1865)::Tuple{Int64,Union{Float64, Int64}}

Ruh-roh Shaggy…….the issue here is this Tuple{Int64,Union{Float64,Int64}} return type. It’s not concrete and leads to worse type inference in later code that tries to access this tuple’s second element. This is also undesirable because we know that the value should be either an Int64 or Float64, so ideally we could structure things so that code generation can just do a single branch and generate nice clean code the rest of the way down. If we change the code to:

Let’s take another cthulic descent and check out the generated code:

%1863 = (%1857 === %1862)::Bool
│ │ @ float.jl:484 within `==' @ float.jl:482
│ │┌ @ bool.jl:40 within `&'
│ ││ %1864 = (Base.and_int)(%1861, %1863)::Bool
│ └└
└──── goto #691 if not %1864
@ /Users/jacobquinn/.julia/dev/JSON3/src/structs.jl:330 within `read' @ /Users/jacobquinn/.julia/dev/JSON3/src/structs.jl:99
690 ─ %1866 = (Core.tuple)(%1853, %1857)::Tuple{Int64,Int64}
└──── goto #693
@ /Users/jacobquinn/.julia/dev/JSON3/src/structs.jl:330 within `read' @ /Users/jacobquinn/.julia/dev/JSON3/src/structs.jl:101
691 ─ %1868 = (Core.tuple)(%1853, %1851)::Tuple{Int64,Float64}
└──── goto #693

Ah, much better! Though there’s a few more steps, we can now see we’re getting what we’re after: our return type will be Tuple{Int64,Int64} or Tuple{Int64,Float64} instead of Tuple{Int64,Union{Int64,Float64}}. And the final performance results? Faster than JSON.jl!

Thanks for reading and I’ll try to get things polished up in JSON3.jl soon so you can take it for a spin.

Feel free to follow me on twitter, ask questions, or discuss this post there 🙂

Cheers.