-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support pluggable/arbitrary (de)serialization formats/targets (CSV, JSON, YAML, etc.) #34
Comments
One way to interpret/implement Option 3 is to say all |
FWIW Beacon's internal systems basically implement option 3 already to enable storing different row extensions in the same table; for that we use 3 columns:
if we were to add this as part of the specification, I think I'd only want to include the qualified string, since that contains all necessary information (and additionally the parent information, which |
I added TOML serialization support to an internal Beacon package. I just had 1 top-level key which was schema = "myschema@1"
[[row]]
col1 = "hi"
col2 = 5
[[row]]
col1 = "bye"
col2 = 6 Seems to work well enough as a simple plain-text format, assuming you're OK with the very limited type support. Having top-level keys is nice to support the metadata. I could imagine something similar working for JSON as well, and I guess YAML. Not sure about CSV, there I think the first row thing makes more sense since skipping the first row is a thing many CSV readers already know how to do. (I haven't seen that option in JSON or TOML readers, on the other hand). |
Some thoughts while working on #54
|
I haven't actually thought this through for more than about a minute but would it be possible to upstream the foundation of Legolas, i.e. versioned schemas for Tables.jl-compliant rows, into Tables.jl? There is also now a function |
Ha, funny you say that - I kinda see Legolas as "incubating" at Beacon for now, but my fuzzy hope is that it eventually can be donated to JuliaData under a more generic/discoverable name ( EDIT: I kinda like the reason to not to do that immediately IMO is that there are still a few important open issues (like this one) that might (might not) entail breaking changes, and i don't see much reason to go through churn of moving it / promoting it in wider community before we figure some of those out and battle-test our solutions at Beacon I hadn't really considered moving this stuff into Tables.jl itself, but would be very happy if it evolves to the state where folks would find that worthwhile - I think Julia's package ecosystem would benefit from a bit more centralization lol. but my default assumption is that folks would prefer a separate package at least to start. But maybe this a good opportunity for the middle ground of a subpackage (e.g. Arrow.jl vs. ArrowTypes.jl) |
xref JuliaLang/julia#47695 which could be useful here as well for obviating the need for special separate LegolasJSON / LegolasArrow packages Another thought, brought on from some internal Beacon stuff: Often, a given (de)serialization process is informed by both the target serialization format (e.g. Arrow, JSON) and by relevant application-specific semantics (e.g. "sure, we expect you to serialize this as a string to JSON, but when you communicate w/ service However, there might be some circumstances where this simple approach is problematic. For example, if you need to convert a given schema version's required field to an intermediate type that has necessary overloaded serialization behaviors w.r.t. your target format (e.g. your own custom This implies that Legolas should expose some functional pre/post-(de)serialization hooks that are orthogonal to choice of (de)serialization format. Note that if we supported such hooks + pluggable target formats, we'd definitely be in a regime where some diagrams describing (de)serialization flow would be useful. This also somewhat relates to another idea, which is that different Legolas-aware systems may implement/support different subsets of semantics (e.g. schema version registration, extension, etc.) that may cause these different systems to favor different strategies for (de)serializing Legolas metadata. Regardless, we probably should just standardize a uniform "lowest common denominator" strategy for each supported target format (as we've done for Arrow), but we do have another possible option of more loosely defining a mere target interface instead ("a Legolas-aware system must define a strategy for emitting/retrieving |
From the FAQ:
Legolas rows are valid Tables.jl AbstractRows, so can already be fed into any package that (de)serializes Tables.jl tables.
The only "Legolas-specific" bit one might want to be captured is the schema metadata (e.g. the fully qualified schema string), in order to enable consumers to load tables into the appropriate schema. Arrow has a convenient custom table-level metadata mechanism we use for this currently; note that regardless of what we support as a result of this issue, we could still specialize on Arrow and continue to use its custom table-level metadata instead of some worse generic fallback.
You can imagine defining a little hook in Legolas' API that enables authors to define how to (de)serialize schema metadata to any given file format.
However, it'd be nice to not ever really need to build up a library of "supported formats" and instead have a generic approach. There are three ways I can imagine doing this that are fully agnostic, and one additional way that is fully agnostic to the format but still specialized on the storage system:
Simply write the Legolas metadata as a newline-terminated magic string/header before the table content, regardless of format. This is a little bit annoying for Legolas-unaware consumers of Legolas-written data (e.g. an arbitrary JSON reader might break), but at least it'd be pretty discoverable.
Always encode the Legolas metadata as the first row in the table; e.g.
(a=[1,2,3], b=[5,6,7])
would be written out like(a=["schema@1",1,2,3], b=[nothing,5,6,7])
. This solves the portability problem of the previous approach, but would mean that we couldn't guarantee homogenous column eltypes for the whole table for Legolas-unaware consumers that don't know to "skip" the first rowJust add a whole column to the table that contains the schema string duplicated for each row. This solves the previous two problems, but is a bit wasteful. but maybe not too bad given that the bloat is constant w.r.t. row size, and is easily compressible.
If your target storage system supports a notion of object metadata separate from object content (e.g. S3), you can just always store the Legolas metadata there. IMO this approach is the most convenient one if you only cared about specific storage systems.
(ref also #1, though that's kind of orthogonal)
The text was updated successfully, but these errors were encountered: