Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcription restarts corner cases difficult to handle with combined transcription+translation #117

Open
angrave opened this issue Oct 5, 2020 · 0 comments
Assignees

Comments

@angrave
Copy link
Collaborator

angrave commented Oct 5, 2020

Transcription update Will fail to update (by design) if there are multiple Transcription entities for the same language and video.
Some comments:

Most of this invalid data 2020 data but we should address the whole dataset and then implement a constraint, together with adding a new column eg. "source" or "kind" to allow multiple transcripts per video

The new implementation looks for the min-max across all-languages. i.e. Find the last caption for each language. Determine the earliest one and then trim the audio from there. The max time is then used to ensure captions are only added once we reach an unprocessed time for that particular language.

This is useful because sometimes one particular language is lagging e.g. it stopped when one translation never arrived.

However some videos have large portions of time where there is no transcription (event=NOMATCH). We don't want to have to transcribe that audio again when we do a restart, but simply recording the lastsuccesstime is insufficient.

Also ... If we add a new translation it would start from the beginning (and use the uncorrected transcriptions).

This suggests a future design should separate out the transcription from the translation, would save some credits, rather than paying for NOMATCH regions twice, if we have to restart the task. (This would also allow translations of artificially inserted captions e.g. [silence] etc)

The worst case is an hour long silence, which fails half way. The restart would start from the beginning. Fortunately, "ServiceTimeouts" do not seem to occur if there are no transcriptions to translate.

@angrave angrave self-assigned this Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant