Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence motif analysis #176

Closed
4 tasks done
douweschulte opened this issue Jun 22, 2022 · 2 comments
Closed
4 tasks done

Sequence motif analysis #176

douweschulte opened this issue Jun 22, 2022 · 2 comments
Labels
A-html-report Area: Related to the HTML output report C-enhancement Category: New feature or request

Comments

@douweschulte
Copy link
Member

douweschulte commented Jun 22, 2022

There are some recent cases with polyclonal datasets which have multiple sequences on a single template that need some way to find the motifs in the reads. This means that varieties have to be tracked to see which ones correlate. For ideas see: https://meme-suite.org/meme/.

A naive algorithm to create such results would be to go over all reads and combine the ones that fit together (using a fuzzy matching based on the alignment) into patches of sequence. All patches (with at least 2% of all reads or some other cutoff) can then be presented on the right location. This would allow the user to see the bigger picture of the alignment with the number of sequencing mistakes drastically reduced.

Example reads

TEMPLATESEQUENCE
TEMPLETE
   PLATESEQ
  METING
SOME

Should compress to the following

TEMPLATESEQUENCE
TEMPLATESEQ
SOMETING

The main missing parts right now are:

  • User control over the threshold for ambiguity
  • Annotated in somewhere where the ambiguous nodes are located
  • Analysis over multiple nodes
  • Detailed information on the supporting reads (similar to Tree branch node detail page #144)
@douweschulte douweschulte added C-enhancement Category: New feature or request A-html-report Area: Related to the HTML output report labels Jun 22, 2022
@douweschulte
Copy link
Member Author

douweschulte commented Oct 4, 2022

With some more discussion the goal has been rephrased to be: how to correlate ambiguous positions in the final sequence for each template. Which could end up looking like the graph shown below. Above you see where in the sequence the ambiguous nodes are located, and below you see indicated with arrow how good the support is for a link between the two ambiguous positions.

.................1......2.......................3.........4..........

 flowchart LR;
  A1-.->A2;
  A1==>B2;
  B1-->A2;
  A2==>A3;
  A2-.->B3;
  A3-->A4;
  A3-.->B4;
  B2-->B3;
  B3==>B4;
Loading

This could hard to fully complete if EnforceUnique is turned on, but work like #146 & #190 & #191 could also help in these cases.

The intend is to run this algorithm after the whole alignment has been done, as all positions for all reads are then known. The ambiguous positions should be identified by the code (first possibility <75% score?) and connections between positions should be found in the placed reads.

douweschulte added a commit that referenced this issue Oct 10, 2022
douweschulte added a commit that referenced this issue Oct 11, 2022
Added ambiguous positions annotation in sequence consensus, added ambiguity threshold to batchfile, added range warnings to batchfiles.
@douweschulte
Copy link
Member Author

For analysis over multiple ambiguous nodes the following idea came to mind: the user can select a single position which will remove all traces except the ones coming from that position. And the higher order traces from this position will be shown as well. To give the user some feedback which nodes do have a nice level of higher order information there should be some bar showing the sum of all higher order traces for each position or something similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-html-report Area: Related to the HTML output report C-enhancement Category: New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant