Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ak.from_iter equivalent to convert from various Julia types into AwkwardArray. #10

Closed
Tracked by #5
jpivarski opened this issue Aug 11, 2023 · 10 comments · Fixed by #19
Closed
Tracked by #5

ak.from_iter equivalent to convert from various Julia types into AwkwardArray. #10

jpivarski opened this issue Aug 11, 2023 · 10 comments · Fixed by #19
Assignees

Comments

@jpivarski
Copy link
Member

No description provided.

@Moelf
Copy link
Member

Moelf commented Aug 11, 2023

at least for the fallback implementation, you can rely on keys() being implemented for each item in the "iterator"

julia> a = (x = 1, y=2)
(x = 1, y = 2)

julia> keys(a)
(:x, :y)

julia> b = Dict("x" => 1, "y"=>2)
Dict{String, Int64} with 2 entries:
  "x" => 1
  "y" => 2

julia> keys(b)
KeySet for a Dict{String, Int64} with 2 entries. Keys:
  "x"
  "y"

most likely you will get an iterator of dict-like or named-tuple like objects

@jpivarski
Copy link
Member Author

That's good to know.

I've also been wondering about structs: they're somewhat like Python's dataclass and Scala's case classes in their regularly, and since reflection is available, maybe I could get the field names and types, and thus ingest arrays of these classes. What I'm thinking of probably has to be a macro operating on types, whereas keys is operating on values.

I'm planning on writing symmetric from_awkward, to_awkward today. These would be converting between heap-based Julia Vectors and structs (maybe Dicts, but Dicts ought to go to "sorted_map" in Awkward, not records) and Awkward Arrays. I know that StructArray also does an array-of-structs to struct-of-arrays conversion, but this is for the Awkward format and more data types.

@Moelf
Copy link
Member

Moelf commented Aug 11, 2023

maybe I could get the field names and types,

nah, no macro needed:

julia> a = 1 + 3im;

julia> dump(a)
Complex{Int64}
  re: Int64 1
  im: Int64 3

julia> fieldnames(typeof(a))
(:re, :im)

julia> fieldtypes(typeof(a))
(Int64, Int64)

however, you might want to use propertynames instead, because say you have a dataframe, df.col1 is lowered to getproperty(df, :col1) instead of getfield(df, :col1), and the reason for this is because you may not want to store every user-facing "access" as a real field.

In general, for simple structs (this is default if you don't specialize anything), properties and fields are the same thing; and for complex structs, properties are probably the user-facing one as far as data is concerned anyway.

julia> propertynames(a)
(:re, :im)

julia> df = DataFrame(x=[1,2,3], y=[4,5,6]);

julia> fieldnames(typeof(df))
(:columns, :colindex, :metadata, :colmetadata, :allnotemetadata)

julia> propertynames(df)
2-element Vector{Symbol}:
 :x
 :y

For any table-like back-and-forth translation, we simply need to implement a few interface from Tables.jl and we will be able to interpolate with almost all Julia table ecosystems.

@jpivarski
Copy link
Member Author

Tables.jl looks like the perfect conversion target, as long as it's possible to put Tables inside of Tables, which I can experiment with. That would nicely narrow the scope from "all Julia objects" to "dataset-like Julia objects."

@Moelf
Copy link
Member

Moelf commented Aug 11, 2023

as long as it's possible to put Tables inside of Tables

I guess? I think because "a column of the table" can be anything <:AbstractVector, you can put a table-like object there as long as length and getindex etc. make sense? It's just that conceptually we should probably only have one top-level type, this reduces confusion for people and type system (imagine you implement conversion by recursively dispatching or something)

@jpivarski
Copy link
Member Author

I think that would be possible, but it would be complicated to set up and should maybe be a stretch goal. I'll go back to the idea of converting to and from generic data, which doesn't have "row ###, column ###" concepts built-in. (It's possible to slice columns through nested lists, so I think we would always be able to give an answer for a requested row and column number—but later.)

@Moelf
Copy link
Member

Moelf commented Aug 11, 2023

I think RecordArray is unambiguously ColumnTable right? And you simply return each column when asked. (It's non copy)

Other Table-compatible implementation won't complain at all what column you give them, as long as it satisfy those very basic properties

@Moelf
Copy link
Member

Moelf commented Aug 11, 2023

but yeah, don't worry about it for now, I can give it a try later if you want!

@jpivarski
Copy link
Member Author

The biggest mismatch to the Table data model is that an Awkward Array without any RecordArrays (at any level of nesting) has no columns. When interacting with Arrow, I solved that problem by making the column named "" (empty string) the single column for these sorts of arrays. I tried some variations on

julia> Symbol("")
Symbol("")

julia> :("")
""

and wasn't convinced I was getting empty-string symbols. (If Symbol has a [a-zA-Z_][a-zA-Z0-9_]* naming constraint, then that's not an option.)

But the reason that I think Table's model of accessing everything by (row, column) coordinates (e.g. random access column on an AbstractRow from the iterator) would work is that column access can "pass through" row access. That is, __getitem__ by integer (row) is partly commutative with __getitem__ by string (column).

For some records inside of lists of lists:

>>> array = ak.Array([[[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}], []], [], [[{"x": 3.3, "y": [1, 2, 3]}]]])
>>> array.show(type=True)
type: 3 * var * var * {
    x: float64,
    y: var * int64
}
[[[{x: 1.1, y: [1]}, {x: 2.2, y: [1, 2]}], []],
 [],
 [[{x: 3.3, y: [1, 2, 3]}]]]

You don't have to explicitly dig through the lists of lists to pick out one column:

>>> array["x"].show(type=True)
type: 3 * var * var * float64
[[[1.1, 2.2], []],
 [],
 [[3.3]]]

>>> array["y"].show(type=True)
type: 3 * var * var * var * int64
[[[[1], [1, 2]], []],
 [],
 [[[1, 2, 3]]]]

So if nested lists are row-oriented Tables within Tables, all with the same set of columns, asking for column x at an outer Table can forward the request inward until it finds a nested Table that has x, and selects that, as Awkward does in Python.

@Moelf
Copy link
Member

Moelf commented Aug 11, 2023

that's indeed an empty symbol:

julia> Symbol() === Symbol("")
true

yea, I see what you mean, indeed that's too "flexible" (i.e. let's do something instead of error philosophy) for a regular Table.jl table.

@jpivarski jpivarski self-assigned this Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants