first draft for a potential JOSS paper #496

Saransh-cpp · 2024-08-16T12:54:04Z

Description

The draft PDF can be downloaded from - https://github.com/scikit-hep/vector/actions/runs/10420364323

JOSS paper format - https://joss.readthedocs.io/en/latest/paper.html
JOSS submission guidelines - https://joss.readthedocs.io/en/latest/submitting.html

Checklist

Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't any other open Pull Requests for the required change?
Does your submission pass pre-commit? ($ pre-commit run --all-files or $ nox -s lint)
Does your submission pass tests? ($ pytest or $ nox -s tests)
Does the documentation build with your changes? ($ cd docs; make clean; make html or $ nox -s docs)
Does your submission pass the doctests? ($ pytest --doctest-plus src/vector/ or $ nox -s doctests)

Before Merging

Summarize the commit messages into a brief review of the Pull request.

codecov · 2024-08-16T13:05:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.85%. Comparing base (13a6370) to head (c66629a).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #496   +/-   ##
=======================================
  Coverage   86.85%   86.85%           
=======================================
  Files          96       96           
  Lines       11919    11919           
=======================================
  Hits        10352    10352           
  Misses       1567     1567

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jpivarski

Looking good!

Are you sure you want to write the paper in the Vector repo itself, rather than as a separate repo just for the paper? How is it normally done?

jpivarski · 2024-08-16T15:04:18Z

paper/paper.md

+
+# Summary
+
+Vector algebra is a crucial component of data analysis pipelines in high energy


"Vector algebra" could mean different things to different people. For me, "algebra" makes me think "abstract algebra," so I'd be looking for addition-like and multiplication-like operators with some properties like associativity or distributivity and some sense of closure. That would lead me to think it's a linear algebra library, like BLAS. That's not what you mean here!

Depending on the reader's background, "vector" can mean

a 2D, 3D, or 4D physical space vector, like what the Vector library is actually about,

an N×1 or 1×N matrix without physical interpretation, as in ordinary linear algebra,

an N×1 or 1×N vector or covector ("1-form") of geometric algebra, which could live in a non-Euclidean metric (as our 4D Lorentz vectors already do),

a member of an infinite-dimensional space, like a Hilbert space, such as a quantum state represented by a bra <x| or ket |x>,

the input or output of a machine learning model, consisting of a fixed number of features or predictions,

the direction that an airplane is flying,

a collection-type data structure that isn't quite an array because it has variable length, like a C++ std::vector,

a collection-type data structure that isn't quite an array because it's immutable, like a Lisp vector or a vector in other functional libraries,

an organism, object, or environmental current that carries disease from one population to another,

a plasmid or virus that carries genetic material into a host cell in genetic engineering,

graphic primitives that use precisely positioned elements with infinite resolution, as in SVG or PDF file formats, rather than rasterized images like PNG or JPG,

the direction and magnitude of a literary or cultural trend, a philosophical argument or line of thought, a narrative in fiction, or an architectural design.

So you'll need to narrow in quickly and let the reader know that this is about 2D and 3D Euclidean vectors and 4D Lorentz vectors that can be used as physical quantities, such as position, momentum, and forces. Instead of "algebra," a word like "common operations" or "mathematical manipulations"?

jpivarski · 2024-08-16T15:06:44Z

paper/paper.md

+physics, enabling physicists to transform raw data into meaningful results that
+can be visualized. Given that high energy physics data is not uniform, the
+vector algebra frameworks or libraries are expected to work readily on
+non-uniform or jagged data, allowing users to perform operations on an entire


A definition of "jagged" will be needed. Ever since I started using this word, I've found that "ragged" is more common. A potentially confusing thing is that sometimes a "vector" is a collection-type data structure, and what we have here is a collection that contains (non-collection) vectors.

jpivarski · 2024-08-16T15:09:04Z

paper/paper.md

+scientific or engineering application. The library houses 3+2 numerical
+backends for experimental physicists and 1 symbolic backend for theoretical


I'm not sure what 3+2 numerical backends means. It could be worthwhile to present all of the numerical backends and their purposes in a bulleted list. A strong point is the diversity of types of backends, from scalars (builtin), to collection types (NumPy and Awkward), to symbolic (SymPy).

jpivarski · 2024-08-16T15:12:32Z

paper/paper.md

+Vector has become the de facto library for vector algebra in Python based high
+energy physics data analysis pipelines. The library has been installed over


de facto library

That's too strong: many high energy physics data analyses use ROOT's new-style LorentzVectors and TLorentzVector, which has been deprecated for decades, but people still use it.

I see that you specify "Python", but there's PyROOT, so I'm not sure that the Vector uses outnumber the PyROOT-TLorentzVector uses.

It's enough to say that it's widely used, and you quote some numbers below.

jpivarski · 2024-08-16T15:29:55Z

paper/paper.md

+
+Vector has become the de facto library for vector algebra in Python based high
+energy physics data analysis pipelines. The library has been installed over
+2 million times and 314 GitHub repositories use it as a dependency at the time


Download count is a notoriously misleading metric of how often software is used. ("Notorious" because even though its problems are known, people still use it. It's hard to do anything better.) In particular,

continuous testing frameworks and some highly parallel workloads will pip install vector or conda install vector as a first step, which inflates the numbers,

it doesn't capture the difference between

users who download it once, never update versions, but use it every day and

(non-)users who update it daily with all the rest of the software on their computer, but never use it,

it unintentionally captures the difference between

periods in which you release patches frequently (and users who live at head download every one of them),

periods in which there are no bugs so you don't release new versions at all (but users are still using it, all the same).

If you want to get quantitative in this paper, you might want to consider following the method of https://github.com/jpivarski-talks/2023-08-14-awkward-stats-update to

use GitHub's dependency graph to get a list of all the repos that might be using Vector,

git clone them all,

search them for "import vector": egrep -ral "(import\b.*\bvector|vector\b.*\bimport)" * --include="*.py" --include="*.ipynb",

use git log --format=%cd "$z" on all the matching files to find out when they were last touched,

make a plot like this:

This would quantify the number of direct users of Vector, the people who know that they are using it, as opposed to the people who get it through another library. Pretty soon, this won't be a good metric anymore because of indirect users through Coffea, but you could do a Vector + Coffea-vector plot then.

Another benefit of an analysis like this is that you can find out how users are using your interface—which functions they use most, and in what ways. I talked about that in this presentation and I did a similar analysis for Numba.

I'm not sure "inflate the numbers" is correct - The number is "downloads", not "number of users". It isn't a measure of number of users due to the issues listed above, but it does measure interest in/usefulness of the package in some form. A CI job doing something with vector still means someone is doing something with vector.

Also, on the flip side, CI jobs may use caches (uv and pixi both have fairly popular actions that cache by default), so that might hide "downloads" by not actually downloading from PyPI.

Other analyses are very useful, but having the download count is still useful as well and it wouldn't replace it, it's just a different metric.

Thanks for the resources here! I went through them and also through several other JOSS papers. I did not statistics in any of them, so I think I will just remove them entirely.

henryiii · 2024-08-26T15:58:33Z

paper/paper.md

+Vector is currently the only Lorentz vector library providing a Pythonic
+interface but a C++ (through Awkward Array [@Pivarski:2018]) computational
+backend. Vector integrates seamlessly with the existing high energy physics


I'm not sure about this statement. NumPy is a compiled backend (though C instead of C++). And PyROOT is Python backed by C++. I think I'd rework it a bit to something to state what it is, and not focus on the "only".

henryiii · 2024-08-26T16:11:28Z

paper/paper.md

+
+Vector has become the de facto library for vector algebra in Python based high
+energy physics data analysis pipelines. The library has been installed over
+2 million times and 314 GitHub repositories use it as a dependency at the time


I'm not sure "inflate the numbers" is correct - The number is "downloads", not "number of users". It isn't a measure of number of users due to the issues listed above, but it does measure interest in/usefulness of the package in some form. A CI job doing something with vector still means someone is doing something with vector.

Also, on the flip side, CI jobs may use caches (uv and pixi both have fairly popular actions that cache by default), so that might hide "downloads" by not actually downloading from PyPI.

Other analyses are very useful, but having the download count is still useful as well and it wouldn't replace it, it's just a different metric.

Saransh-cpp · 2024-09-06T18:42:11Z

Thanks for the reviews, @jpivarski and @henryiii! I have made corrections, but could you please review it again whenever you get the time?

Are you sure you want to write the paper in the Vector repo itself, rather than as a separate repo just for the paper? How is it normally done?

The JOSS submission guidelines say -

Your paper (paper.md and BibTeX files, plus any figures) must be hosted in a Git-based repository together with your software.

The paper may be in a short-lived branch which is never merged with the default, although if you do this, make sure this branch is created from the default so that it also includes the source code of your submission.

I will prefer not merging this in and just using this PR to review the content. Once everything is reviewed, I will submit the paper from the short-lived branch and delete the branch once the paper is published

paper/paper.md

* better definition of vector algebra * don't use only/de-facto - mention PyROOT, fix language * expand on the backends * jagged -> ragged + a definition for ragged

Saransh-cpp force-pushed the paper branch 3 times, most recently from bad61ad to c66629a Compare August 16, 2024 13:00

Saransh-cpp requested review from jpivarski and henryiii August 16, 2024 13:11

jpivarski approved these changes Aug 16, 2024

View reviewed changes

henryiii reviewed Aug 26, 2024

View reviewed changes

Saransh-cpp force-pushed the paper branch 2 times, most recently from 7785117 to 459dcd7 Compare September 6, 2024 18:39

Saransh-cpp requested review from henryiii and jpivarski September 6, 2024 18:42

Saransh-cpp commented Sep 10, 2024

View reviewed changes

paper/paper.md Outdated Show resolved Hide resolved

Saransh-cpp added 6 commits September 29, 2024 11:55

first draft for JOSS paper

e6cd4d0

pre-commit

85f412a

reviews

27ee104

* better definition of vector algebra * don't use only/de-facto - mention PyROOT, fix language * expand on the backends * jagged -> ragged + a definition for ragged

Mention LorentzVectors

0d2cffa

remove stats

5026b0a

Update title

47b4385

Saransh-cpp force-pushed the paper branch from 86ed1aa to 47b4385 Compare September 29, 2024 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

first draft for a potential JOSS paper #496

first draft for a potential JOSS paper #496

Saransh-cpp commented Aug 16, 2024 •

edited

Loading

codecov bot commented Aug 16, 2024

jpivarski left a comment

jpivarski Aug 16, 2024

jpivarski Aug 16, 2024

jpivarski Aug 16, 2024

jpivarski Aug 16, 2024

jpivarski Aug 16, 2024

henryiii Aug 26, 2024

Saransh-cpp Sep 6, 2024

henryiii Aug 26, 2024

henryiii Aug 26, 2024

Saransh-cpp commented Sep 6, 2024


		# Summary

		Vector algebra is a crucial component of data analysis pipelines in high energy

		scientific or engineering application. The library houses 3+2 numerical
		backends for experimental physicists and 1 symbolic backend for theoretical

		Vector has become the de facto library for vector algebra in Python based high
		energy physics data analysis pipelines. The library has been installed over

first draft for a potential JOSS paper #496

Are you sure you want to change the base?

first draft for a potential JOSS paper #496

Conversation

Saransh-cpp commented Aug 16, 2024 • edited Loading

Description

Checklist

Before Merging

codecov bot commented Aug 16, 2024

Codecov Report

jpivarski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Saransh-cpp commented Sep 6, 2024

Saransh-cpp commented Aug 16, 2024 •

edited

Loading