Recent

Author Topic: How to determine if a result is an integer? [SOLVED]  (Read 18956 times)

alpine

  • Hero Member
  • *****
  • Posts: 1060
Re: How to determine if a result is an integer?
« Reply #15 on: August 31, 2021, 02:38:45 pm »
@cdesim
BornAgain did not say the sequences can overlap or include each other:
*snip*
I need to find a substring here that 1) begins with the triplet ATG, ends with the triplet TAG, and is a multiple of 3.
*snip*
He didn't say also the shortest or longest possible match he is looking for.
I suspect there are more domain specific rules in that, I'm not familiar with and I can't say either.
As long the above 2 rules were given, winni's solution should be OK.
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

alpine

  • Hero Member
  • *****
  • Posts: 1060
Re: How to determine if a result is an integer?
« Reply #16 on: August 31, 2021, 03:23:21 pm »
@cdesim
You're right, I sincerely apologize for my previous post. Apparently I didn't read the entire thread.

Edit: Same for my Reply #17 - it shall not be considered for the same reason.

But, we know that when comparing intervals, we can have the following cases:
* one is entirely outside the other
* they overlap
* they are nested

While the 1-st case is clear, what about the other two cases? Are they possible in these sequences?
When they overlap can we consider their union for a separate group?
When they are nested which one shall be considered, the inner or the outer?
« Last Edit: August 31, 2021, 03:27:04 pm by y.ivanov »
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

alpine

  • Hero Member
  • *****
  • Posts: 1060
Re: How to determine if a result is an integer?
« Reply #17 on: August 31, 2021, 03:42:15 pm »
Yes, each class is independent. Each class will produce only one response (0 or more characters)

This is by design. You cannot mix the mod 1 with the mod 2, they may as well be on different planets. Amirite?
Sorry, I didn't get you. Classes? As of your use of 'class' in the:
Actually I asked those questions and we're interested in the longest match.

Here is another proposal:

Create a class with 2 lists, one containing the positions of ATG and the other of TAG.

We'll need 3 instances of this class, 1 for numbers where mod = 2, another for mod = 1 and finally another where mod = 0.

*snip*
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

Zvoni

  • Hero Member
  • *****
  • Posts: 2327
Re: How to determine if a result is an integer?
« Reply #18 on: August 31, 2021, 04:03:04 pm »
Chipping in.
From an algorithm's POV i'd use at least one (more likely two) loop(s) with PosEx using the sample from #20: CTGCTAATGGTATGAGGACTTGGTAG

Outer Loop: Use PosEx to find Occurence of "ATG" with starting point the last found "ATG"-Position (First Run LastATG would be 1)
If ATGFound Then Inner Loop: Use PosEx  (Or maybe RegEx) looking for "TAG" with starting Point the Position of the found "ATG"
If TAGFound Then Check if length(ResultString) mod 3 = 0 and If currentlength > LengthOf LastResult
If Yes save result - If no continue to look for "TAG" using last "TAG"'s Position as starting point
End Inner Loop
End Outer Loop


That said: I've no idea if i just talked crap :-)
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Bart

  • Hero Member
  • *****
  • Posts: 5288
    • Bart en Mariska's Webstek
Re: How to determine if a result is an integer?
« Reply #19 on: August 31, 2021, 04:24:14 pm »
Thank you, @cdesim. Yes, I apologize for not being very clear. Let me explain. I am working with DNA sequences. A DNA sequence can look like this:

ACTGCTAATGATTTGGACTTGGTAGCGTTACCTG

I need to find a substring here that 1) begins with the triplet ATG, ends with the triplet TAG, and is a multiple of 3.

Normally there would be a sewuence that starts the DNA codon sequence (like AUG, see related forum post about decoding codons ).
Once you know the start, you can just iterate 3 chars at a time to find ATG and just count triplest from there until you hit TAG.

Bart

alpine

  • Hero Member
  • *****
  • Posts: 1060
Re: How to determine if a result is an integer?
« Reply #20 on: August 31, 2021, 06:38:49 pm »
Now, for the question "What if TAG immediately follows ATG?" That's an excellent question. It would consitute a trivial case and such a substring would be be rejected. In reality, one can expect a number of substrings that would satisfy all three conditions.
As an absolute rookie in that matter, I have the following question:
Is it possible to have the following sequence into a single reading frame (+1, +2, or +3) ?
Code: [Select]
ATG ... ATG ... TAG i.e. does it every ATG must be strictly followed by a TAG in a single reading frame?

I am already aware that two sequences can overlap into multiple reading frames, e.g. MT-ATP6, MT-ATP8 genes.

We usually choose the longest one, as the sequence with the highest probability of coding for a protein.
What do you mean by "longest one"? If it is supposed that every ATG to be strictly followed by TAG (in a single frame), then shouldn't they be considered separate entities for processing?
Otherwise, if ATG can be followed by another ATG (in a single frame), which one is considered "longest one"?
Code: [Select]
ATG(1) ... ATG(2) ... TAG(3) ... TAG(4)* from (1) to (4), i.e. TAG(3) ends ATG(2)
* longer from (1) to (3), and (2) to (4), i.e. TAG(3) ends ATG(1) and TAG(4) ends ATG(2)

As I said, I'm a rookie in genetics, so forgive me for the stupid questions.  :-[
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: How to determine if a result is an integer?
« Reply #21 on: August 31, 2021, 07:32:49 pm »
Hi!

"If my grandma had wheels she was a bus" said former german chancellor Willy Brandt.

We should wait until BornAgain tells us exactly the parameters and conditions.
Especially the question of  y.ivanov

Winni

BornAgain

  • New Member
  • *
  • Posts: 34
Re: How to determine if a result is an integer?
« Reply #22 on: August 31, 2021, 09:09:59 pm »
My goodness! Needless to say, I am very happy that this has triggered so much interest among the experts. Alas, also needless to say, I am completely lost :-).

First, thanks @Kays for the codon example (although that addresses a different problem). By the way, a triplet of DNA molecules (called nucleotides) that codes for another type of molecule called amino acid is calle a codon (for anyone who is interested). So, when strings of nucleotide triplets code for amino acids, this is called an ORF (or open reading frame). Note that I am simplifying this here and ignoring other details that are not relevant here.

The first triplet (codon) of an ORF (for our purposes) has to be ATG. This denotes the start of an ORF. And then there are a bunch of codons until a termination codon is reached (and there are three options here: TAG, TGA, and TAA; I had mentioned only one of them).

Next, in an actual ORF, any other ATGs found among the codons after the initial ATG and before the termination codon are just like any other codon and do not perform the role of the start codon. Also, the moment a termination codon is encountered that is in the same frame as the start codon, the ORF terminates.

So in the example given by @y.ivanov, ATG....ATG....TAG....TAG

the ORF is likely from the first ATG to the first TAG. I say "likely" because proteins are complex molecules and so one usually assumes that the longer ORF is more likely to code for a protein than a shorter one.

There were a couple other questions about overlapping and nested ORFs. I hope my answers here help answer these questions as well. Thank you all so much for your interest.

Let me know if there are other questions.


« Last Edit: August 31, 2021, 09:19:32 pm by BornAgain »

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: How to determine if a result is an integer?
« Reply #23 on: August 31, 2021, 11:27:57 pm »
Hi!

Read your text twice.
My example from #8 fits your needs, but I ennhanced it a bit:

It now shows the string that hits the requirements and the length measured in triplets including start and end.

So 2 triplets mean a string of length zero.

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button2Click(Sender: TObject);
  2. const data : string=    'ACTGCTAATGATTTGGACTTGGTAGCGTTACCTG';
  3.       TripStart = 'ATG';
  4.       TripEnd  = 'TAG';
  5. var p,q, offset, delta: Integer;
  6.     msg, DNAhit : string;
  7. begin
  8.   offset := 1;
  9.   repeat
  10.   p := PosEx(TripStart,data,offset);
  11.   if p > 0 then
  12.     begin
  13.     offset := p+3;
  14.     q := PosEx(TripEnd,data,offset);
  15.     if q >0 then
  16.        begin
  17.         delta := (q-p ) mod 3;
  18.         if delta = 0 then msg := 'Hit' else msg := 'No hit';
  19.         msg := msg +lineEnding+ TripStart+' at '+IntToStr(p)+
  20.                      lineEnding+TripEnd+' at '+IntToStr(q) +LineEnding;
  21.         if delta = 0 then
  22.           begin
  23.           DNAhit := copy(data,p,q+3-p);
  24.           msg := msg + DNAhit + LineEnding+'Length Triplets: '+IntToStr(Length(DNAhit) div 3) ;
  25.           end;
  26.         showMessage (msg);
  27.           end;
  28.         end;
  29.         until (p=0) or (q=0);
  30.   showMessage ('No more hits');
  31. end;
  32.  

Last time faced with DNA was in school. Long time ago.
But always learning ....

Winni

alpine

  • Hero Member
  • *****
  • Posts: 1060
Re: How to determine if a result is an integer?
« Reply #24 on: September 01, 2021, 12:34:51 am »
@winni
Not quite right, IMHO. For data : string = 'ATGAATGTGGTAGTTTAGT';
Code: [Select]
No hit
ATG at 1
TAG at 11

Hit
ATG at 5
TAG at 11
ATGTGGTAG
Length Triplets: 3
No more hits
One hit. But in my understanding:
Code: Pascal  [Select][+][-]
  1. data : string = 'ATGAATGTGGTAGTTTAGT';
  2. // should be considered as
  3. 'ATG AAT GTG GTA GTT TAG T' // hit, frame +1, length 6
  4. 'A TGA ATG TGG TAG TTT AGT' // hit, frame +2, length 3
2 hits in different frames.

It is not the best example, but should be good enough to illustrate overlapping in multiple reading frames.

Also, ATGTAG gives a hit.

"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

BornAgain

  • New Member
  • *
  • Posts: 34
Re: How to determine if a result is an integer?
« Reply #25 on: September 01, 2021, 04:08:54 am »
You're right, @y.ivanov. Two hits in two frames. All three frames need to be looked into. Actually, there is yet another interesting twist, which is that DNA is double stranded (I have shown only one strand here) and the longest ORF may be on the other strand, and so that needs to be checked out too (all three frames). There is a method to figuring out what the other strand is, but I have already coded that part. Just thought I'd mention it since there is interest in understanding the biology here.

alpine

  • Hero Member
  • *****
  • Posts: 1060
Re: How to determine if a result is an integer?
« Reply #26 on: September 01, 2021, 01:59:08 pm »
Hi there!

It is the most KISSed code I was able to produce. Hope it is correct:
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. const
  4.   Data: String = 'ACTGCTAATGATTTGGACTTGGTAGCGTTACCTG';
  5.   //Data: String = 'AATGAATGTGGTAGTTTAGTTATGTAG';
  6.   CodonStart: String = 'ATG';
  7.   CodonEnd: String = 'TAG';
  8.   NoStart = MaxInt;
  9.  
  10. procedure FindFrames(AData: String);
  11. var
  12.   I, DataLen, FN: Integer;
  13.   Start: array[0..2] of Integer;
  14.   Starts: Integer;
  15. begin
  16.   DataLen := Length(AData);
  17.   Start[0] := NoStart;
  18.   Start[1] := NoStart;
  19.   Start[2] := NoStart;
  20.   Starts := 0;
  21.  
  22.   I := 1;
  23.   while (I < DataLen - 1) do
  24.   begin
  25.     FN := I mod 3; // Frame number
  26.  
  27.     // For start codons
  28.     if (AData[I] = CodonStart[1]) then
  29.     begin
  30.       // All 3 letters match?
  31.       if (AData[I + 1] = CodonStart[2]) and (AData[I + 2] = CodonStart[3]) then
  32.       begin
  33.         // No start codon for the frame?
  34.         if (Start[FN] = NoStart) then
  35.         begin
  36.           Start[FN] := I; // Save start
  37.           Inc(Starts); // Increment Starts count
  38.         end;
  39.         Inc(I, 3); // Skip codon
  40.       end
  41.       else
  42.         Inc(I); // Skip one
  43.     end
  44.  
  45.     // For end codons
  46.     else if (Starts > 0) and (AData[I] = CodonEnd[1]) then
  47.     begin
  48.       // All 3 letters match?
  49.       if (AData[I + 1] = CodonEnd[2]) and (AData[I + 2] = CodonEnd[3]) then
  50.       begin
  51.         Inc(I, 3); // Skip codon
  52.  
  53.         // Start codon for the frame?
  54.         if (Start[FN] <> NoStart) then
  55.         begin
  56.  
  57.           // Show frame --------------------------------------
  58.           if (I - Start[FN] < 3 * 3) then
  59.             WriteLn('Empty frame.')
  60.           else
  61.           begin
  62.             WriteLn('Frame +', (2 + FN) mod 3 + 1, // 0->3, 1->1, 2->2
  63.               ', Start: ', Start[FN],
  64.               ', End: ', I,
  65.               ', Seq: ', Copy(AData, Start[FN], I - Start[FN]));
  66.           end;
  67.           //---------------------------------------------------
  68.  
  69.           // Reset start for this frame
  70.           // Q: Shall we reset on emty frame?
  71.           Start[FN] := NoStart;
  72.         end;
  73.       end
  74.       else
  75.         Inc(I); // Skip one
  76.     end
  77.  
  78.     else
  79.       Inc(I); // Skip one
  80.  
  81.   end;
  82. end;
  83.  
  84. begin
  85.   FindFrames(Data);
  86. end.
  87.  

"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

BornAgain

  • New Member
  • *
  • Posts: 34
Re: How to determine if a result is an integer?
« Reply #27 on: September 02, 2021, 06:48:20 am »
@y.inanov, that looks FAR more efficient than the code I am still trying to complete. I am a stage where it outputs multiple ORFs but I am trying to sort it (a TStringList) to find the longest. Surprisingly, there seems to be no easy way of sorting a TStringList based on string length.


bytebites

  • Hero Member
  • *****
  • Posts: 639
Re: How to determine if a result is an integer?
« Reply #28 on: September 02, 2021, 08:45:26 am »
Easy way is

Code: Pascal  [Select][+][-]
  1. function longestfirst(List: TStringList; Index1, Index2: Integer): Integer;
  2. begin
  3.    result:=Length(list[index2])-Length(list[index1]);
  4. end;  
  5.  
  6. astringlist.customSort(@longestfirst);
  7.  

alpine

  • Hero Member
  • *****
  • Posts: 1060
Re: How to determine if a result is an integer?
« Reply #29 on: September 02, 2021, 06:08:50 pm »
@BornAgain
If you need just the longest sequence there is no need to put them all in TStringList and sort, you can just keep track which is the longest:
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. const
  4.   Data: String = 'ACTGCTAATGATTTGGACTTGGTAGCGTTACCTG';
  5.   //Data: String = 'AATGAATGTGGTAGTTTAGTTATGTAG';
  6.   CodonStart: String = 'ATG';
  7.   CodonEnd: String = 'TAG';
  8.   NoStart = MaxInt;
  9.  
  10. procedure FindFrames(AData: String; out AStart, ALen: Integer);
  11. var
  12.   I, DataLen, FN: Integer;
  13.   Start: array[0..2] of Integer;
  14.   Starts: Integer;
  15.   LLen, FLen: Integer;
  16. begin
  17.   DataLen := Length(AData);
  18.   Start[0] := NoStart;
  19.   Start[1] := NoStart;
  20.   Start[2] := NoStart;
  21.   Starts := 0;
  22.  
  23.   LLen := 0;
  24.   AStart := 0;
  25.   ALen := 0;
  26.   I := Pos(CodonStart, AData);
  27.   if I > 0 then while (I < DataLen - 1) do
  28.   begin
  29.     FN := I mod 3; // Frame number
  30.  
  31.     // For start codons
  32.     if (AData[I] = CodonStart[1]) then
  33.     begin
  34.       // All 3 letters match?
  35.       if (AData[I + 1] = CodonStart[2]) and (AData[I + 2] = CodonStart[3]) then
  36.       begin
  37.         // No start codon for the frame?
  38.         if (Start[FN] = NoStart) then
  39.         begin
  40.           Start[FN] := I; // Save start
  41.           Inc(Starts); // Increment Starts count
  42.         end;
  43.         Inc(I, 3); // Skip codon
  44.       end
  45.       else
  46.         Inc(I); // Skip one
  47.     end
  48.  
  49.     // For end codons
  50.     else if (Starts > 0) and (AData[I] = CodonEnd[1]) then
  51.     begin
  52.       // All 3 letters match?
  53.       if (AData[I + 1] = CodonEnd[2]) and (AData[I + 2] = CodonEnd[3]) then
  54.       begin
  55.         Inc(I, 3); // Skip codon
  56.  
  57.         // Start codon for the frame?
  58.         if (Start[FN] <> NoStart) then
  59.         begin
  60.           FLen := I - Start[FN];
  61.  
  62.           // Show frame --------------------------------------
  63.           if (FLen < 3 * 3) then
  64.             WriteLn('Empty frame.')
  65.           else
  66.           begin
  67.  
  68.             // Keeping track of the longest one
  69.             if FLen > LLen then
  70.             begin
  71.               LLen := FLen;
  72.               AStart := Start[FN];
  73.             end;
  74.  
  75.             WriteLn('Frame +', (2 + FN) mod 3 + 1, // 0->3, 1->1, 2->2
  76.               ', Start: ', Start[FN],
  77.               ', End: ', I,
  78.               ', Seq: ', Copy(AData, Start[FN], FLen));
  79.  
  80.           end;
  81.           //---------------------------------------------------
  82.  
  83.           // Reset start for this frame
  84.           // Q: Shall we reset on emty frame?
  85.           Start[FN] := NoStart;
  86.         end;
  87.       end
  88.       else
  89.         Inc(I); // Skip one
  90.     end
  91.  
  92.     else
  93.       Inc(I); // Skip one
  94.  
  95.   end;
  96.  
  97.   // Return length of the longest
  98.   ALen := LLen;
  99. end;
  100.  
  101. begin
  102.   FindFrames(Data);
  103. end.
I have added two output arguments to the procedure: AStart, ALen, where the start index and the length of the longest sequence will be returned.

Also added a small modification before the scan, instead of assigning I := 1 used the I := Pos(CodonStart, AData) assuming it will be a little bit faster than the sequential scan. Not sure, though.

Further enhancements:
* Logic can be included for premature exit in case we need only the longest sequence and there is no chance to find longer one into the unscanned portion
* Boyer-Moore like pattern scan, but then two scan counters should be used because of the shift irregularities
 
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

 

TinyPortal © 2005-2018