Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening a GATE XML document results in the wrong document name #62

Open
greenwoodma opened this issue Oct 9, 2018 · 12 comments
Open

Opening a GATE XML document results in the wrong document name #62

greenwoodma opened this issue Oct 9, 2018 · 12 comments

Comments

@greenwoodma
Copy link
Contributor

If you rename a document to something sensible, save the document as GATE XML and then reload it, the sensible name disappears and you get the name generated from the filename again.

@ianroberts
Copy link
Member

This has always been the case, it's the difference between using a datastore (which is conceptually saving a document and then letting you re-load the same document) vs using GATE XML (where you're saving a representation of a document and then later creating a new document from that saved representation).

No other document format changes the document name as part of the parse process, indeed by the time the doc format kicks in mid way through DocumentImpl.init it has no way of knowing whether the current name of the document is a generated one (which you could argue is reasonable to override) or one that was explicitly passed to createResource (which I'd say should be left alone).

@greenwoodma
Copy link
Contributor Author

Hmmmmm, I'm all conflicted now as to what it should do. My problem with the current approach is that we always claim that GATE XML is a lossless representation of the document, but it turns out that when we save in GATE XML we throw away the document name (so it isn't actually available when we reload anyway).

I'm tempted to say that we should be saving the name and restoring it when using GATE XML so that we are being truly lossless, but I can see that might open up a whole other can of worms.

@ianroberts
Copy link
Member

I see where you're coming from, and GATE XML is a lot closer to being lossless now than it used to be but there's still some things it can't represent - object graphs, for example, where two different annotations that have features whose value refers to the same Java object will end up each pointing to a separate instance of the object when re-loaded.

@johann-petrak
Copy link
Contributor

I do not understand what this is about: the name of a document is meta-information, not content and should not be part of what is saved as the document content in the first place? By default the name inside gate reflects what the name on the file system was, but this is a convention that could get changed by other ways of reading in a representation of a document and can be changed after the document has been loaded.
So would lossless not just refer to the content of the document, not the metainformation GATE keeps around about the document?

@greenwoodma
Copy link
Contributor Author

Maybe but in theory so are the document features and we save those (including things like source URL and mimetype which will also be wrong I think). I spotted this as I changed the names from the filenames to something sensible and that info was lost. If the name isn't saved then we shouldn't claim the GATE XML format is lossless as I think a lot of people would assume that includes it's name. If we aren't saving the name (except in datastores which see less and less use as we use larger and larger corpora) what's really the point in allowing people to change the name at all?

@johann-petrak
Copy link
Contributor

I disagree: lossless compression of a file saves and restores the content of the file but does not influence your choice of how to name the uncompressed file. That is a feature not a bug.
But with this, your file could still contain a field that contains metainformation like the original name which would get restored as part of the content.

I think the question about changing the name inside of GATE is important: in most GUIs a name of a loaded file always reflects the file and is immutable (and also does NOT contain some auto-generated stuff at the end). But I think one could argue that allowing to change the name inside GATE can be useful in situations where one wants to better organize things.

What I think is relevant here is how the GATE GUI should map not just the file name to a document when loading, but also the document name to a file name when saving -- most editors do this properly but GATE does not. So if you rename the document, then save it, the convention from other editors would be to offer the new name as the default when saving.
I really think following that convention as it is used in practically all other frameworks and editors would be best to avoid confusion.

@greenwoodma
Copy link
Contributor Author

That assumes that you view a GATE document as having a one-to-one relationship with a file and I would claim that in many cases that isn't true; tweets for example. Yes when you reload a GATE XML document there is a one-to-one relationship but not always.

As I said I'm now really not sure what it should do. All I know is that we have an API that allows us to set the name of a Document object and a method of saving that object which we claim is lossless but which doesn't store the complete set of information about the object. To me that seems an odd situation to be in.

@johann-petrak
Copy link
Contributor

Hmm I think I understand this a bit better now and I thing I start to agree with you.

We have really two completely different kinds of names here:

  • there is an "extrinsic" name that is something outside the document pointing to the document, e.g. a file name, URL, or the name the GATE GUI uses to make a document accessible to the user. This is a name "associated with" the document
  • there is an "intrinsic" name that is the property of a document, and as you say it should be preserved when serialising and deserialising the document. This is a name that is part of the document.

So if we would already store the intrinsic name, we could have a document stored as doc1.xml that has the intrinsic name doc2.xml. Or one with extrinsic name doc1.xml that has the intrinsic name "Wikipedia document Barack Obama".

But getting rid of either of the two can be extremely confusing: ignoring extrinsic and loading "doc1.xml" and getting doc2.xml would be terrible. But as you said setting the intrinsic name to "Wikipedia document Barack Obama" before saving and then getting back doc1.xml because we ignore intrinsic is also confusing.

So I think we are really mixing up two entirely different kinds of names here, aren't we? (I was only talking about the extrinsic name previously because this is the only think other editors care about).

The whole matter gets even more complicated in GATE since sometimes those random hex strings get appended to the name making the extrinsic and intrinsic names different even when they could easily be identical.

I think we have a number of options of how to deal with the fact that there are two names really: e.g. always force the intrinsic one to mirror the extrinsic one (if there is an extrinsic one), have the GUI use and show both (e.g. one as a tooltip), always ignore one or the other etc.

Not sure how confusing the one I probably like best personally (have the GUI show and use both names) would be for the average user.

I think the exactly same problem really exists with any Resource that can be serialised: e.g. a pipeline has some intrinsic name and can be saved to a file with a completely different name (I always disliked this and tried to manually make them match to avoid confusion).

@greenwoodma
Copy link
Contributor Author

Having thought about this a little further, I think what annoys me the most is the inconsistency. The name of a GATE resource is set through a param in NameBearer which means it's part of any resource. If you save a datastore or an application then the name is persisted and set correctly when reloaded. The names of PRs are also persisted correctly when saved as part of a pipeline. It's only LRs where this doesn't happen.

I'm aware that not every LR can hold the name through a save/load cycle, but when it could I think we should support that if for no other reason than consistency.

I can understand that the approach may lead to weird things happening where the name in GATE looks like a filename but doesn't reflect the name of the file on disc but that doesn't really bother me. My view is that we should actually avoid making the names of resources look like filenames by default as it's misleading, especially as in many cases they get out of sync as soon as you do a save/load cycle anyway. You also already have the sourceUrl document feature that shows you where the document was originally created from; although again this is weird with GATE XML as it shows the original URL and not the URL from which it was loaded on this occasion (should they differ), plus the mime type document feature is nearly always wrong on GATE XML documents as well.

@greenwoodma
Copy link
Contributor Author

Thinking even further, we already break the link between a document object and the file from which it came by not having a "save" option only "save as", even if that option does offer the original filename, semantically it's not the same as simply saving back to where the document came from.

@greenwoodma
Copy link
Contributor Author

and of course even more confusingly, documents in a corpus which are stored as part of a pipeline are reloaded with the correct name if it has been changed

@johann-petrak
Copy link
Contributor

For what it is worth, the Python gatenlp package and the formats implemented by the Format_Bdoc plugin see the name of the document as a property of the document now (and not as meta-information about the document) so the name does get saved and re-stored, no matter what the file name used for storing/loading is.
Basically I now think that all name bearers should get their names get saved as indeed happens in most cases already anyway. Documents really are the odd one out here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants