Recent

Author Topic: Compare two text lines and highlight difference  (Read 5844 times)

totya

  • Hero Member
  • *****
  • Posts: 720
Re: Compare two text lines and highlight difference
« Reply #15 on: February 09, 2023, 08:50:54 am »
...
Thank you, I'd like to try, but where can I found this LGenerics/lgSeqUtils ?
...

It lives here.

Thank you!

I install trunk version, because I see readme, and I'd like to see the json implemetation.
var declaration "I" is missing from sample, but I put it within 5 secs.

Well, this is a simplified solution, because works only with whole words.
This is mean, if only 1 letter different from words, this go to SourceChanges, and the TargetChanges list, and I don't see what letter changed exactly.
But it not a big problem, my priority looking for a simple solution.

Next problem, if any word repeated, I don't know, what word changed, so I can't colored the changes. For example:

Code: Pascal  [Select][+][-]
  1.     s1 := 'sun day sun';
  2.     s2 := 'sunx day sun';
  3.  

result:

Deleted from s1(i.e. not present in s2):
sun

Inserted into s2(i.e. not present in s1):
sunx

So thank you for this library, and the sample, but it isn't usable solution for me.
As I see TextDiff (new version, see this topic: https://forum.lazarus.freepascal.org/index.php/topic,62219.msg470413.html#msg470413 much better options, but as I see UTF8 not supported.

Thank you again!

Roland57

  • Sr. Member
  • ****
  • Posts: 421
    • msegui.net
Re: Compare two text lines and highlight difference
« Reply #16 on: February 09, 2023, 09:08:24 am »
@totya

Not sure that it will match your needs, but there is also this project: https://github.com/DomingoGP/lazIdeDiffCompareFiles

And, since we are on this topic, I would like to mention diffoscope that I discovered recently.
« Last Edit: February 09, 2023, 09:11:26 am by Roland57 »
My projects are on Gitlab and on Codeberg.

avk

  • Hero Member
  • *****
  • Posts: 752
Re: Compare two text lines and highlight difference
« Reply #17 on: February 09, 2023, 09:45:51 am »
@totya, the vector of boolean values SourceChanges corresponds to the elements of the source sequence and contains True in those positions, the elements of which are not included in the target sequence. That is, if SourceChanges[2] is True, it means that the source sequence element with index 2 is not in the target sequence.

totya

  • Hero Member
  • *****
  • Posts: 720
Re: Compare two text lines and highlight difference
« Reply #18 on: February 09, 2023, 09:51:19 am »
@totya

Not sure that it will match your needs, but there is also this project: https://github.com/DomingoGP/lazIdeDiffCompareFiles

Thanks, this based on diff.pas (diff2.pas) but nothing important changed in this code (I need UTF8 supprt).

totya

  • Hero Member
  • *****
  • Posts: 720
Re: Compare two text lines and highlight difference
« Reply #19 on: February 09, 2023, 12:36:02 pm »
@totya, the vector of boolean values SourceChanges corresponds to the elements of the source sequence and contains True in those positions, the elements of which are not included in the target sequence. That is, if SourceChanges[2] is True, it means that the source sequence element with index 2 is not in the target sequence.

Thanks for this information!

totya

  • Hero Member
  • *****
  • Posts: 720
Re: Compare two text lines and highlight difference
« Reply #20 on: February 09, 2023, 12:56:20 pm »
... looks like this feature
https://github.com/rickard67/TextDiff

This is the best solution, as I see.

I wrote this isn't UTF8 comatible, for example "őóú" ets chars are lost when compare.

But I thinking. Tdiff is a delphi unit, with {$mode delphi}. Delhpi uses 2 byte coded chars (UTF-16). But {$mode delphi} do not works perfectly, so if I modify, for example:

char->widechar
string->WideString

The compare is working.


Thaddy

  • Hero Member
  • *****
  • Posts: 14201
  • Probably until I exterminate Putin.
Re: Compare two text lines and highlight difference
« Reply #21 on: February 09, 2023, 01:11:25 pm »
Delhpi uses 2 byte coded chars (UTF-16).
Wrong! UTF16 has between 2 and 4 bytes.
Quote
But {$mode delphi} do not works perfectly, so if I modify, for example:

char->widechar
string->WideString
Because you use the wrong mode: you should have used {$mode delphiunicode}

Also note that LCS - what you need for a diff - is a bytewise comparision, not a character based comparison and the latest TDiff is known for that reason to work with UTF8 too..
« Last Edit: February 09, 2023, 01:20:43 pm by Thaddy »
Specialize a type, not a var.

totya

  • Hero Member
  • *****
  • Posts: 720
Re: Compare two text lines and highlight difference
« Reply #22 on: February 09, 2023, 02:27:05 pm »
Delhpi uses 2 byte coded chars (UTF-16).
Wrong! UTF16 has between 2 and 4 bytes.

I know already, (older) Delphi use UTF-16 and never useUTF-32.

But {$mode delphi} do not works perfectly, so if I modify, for example:
char->widechar
string->WideString
Because you use the wrong mode: you should have used {$mode delphiunicode}

This is not my fault, this isn't may package. But thanks for the info!

Let me see...
I swap $mode delphi to $mode delphiunicode in two places (tdiff and unit1).
Well, seems to me works badly!
input1: Change the text here & then compareöü
input2: Change the text here & then compareőü
The result: last character missing from the compare.
But it isn't the $mode delphiunicode fault, because this bad result same with the my modified code too.
But anyway, thanks for the {$mode delphiunicode} info. (I think compiler warning missing: $mode delphi -> warning, deprecated!)

Also note that LCS - what you need for a diff - is a bytewise comparision, not a character based comparison and the latest TDiff is known for that reason to work with UTF8 too..

UTF-16 and UTF-32 is fixed size code, doesn't matter the compare is bytewise or character base I think.

---> Okay, thanks for the info, but where can I find the latest TDiff? <---

« Last Edit: February 09, 2023, 03:36:41 pm by totya »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9792
  • Debugger - SynEdit - and more
    • wiki
Re: Compare two text lines and highlight difference
« Reply #23 on: February 09, 2023, 02:40:31 pm »
UTF-16 and UTF-32 is fixed size code,

No UTF-16 is not fixed size.

UTF-16 has a CodeUnit size of 2 byte. (UTF-8 has 1 byte, and UTF-32 has 4).

In UTF-16: A Unicode codepoint can be represented by 1 or 2 CodeUnits (2 or 4 bytes).

A "character" can be either a single codepoint, or a combination of several codepoints. (That applies to Unicode itself, so that is the case for UTF-8, UTF-16 and UTF-32 and any other transfer encoding)

totya

  • Hero Member
  • *****
  • Posts: 720
Re: Compare two text lines and highlight difference
« Reply #24 on: February 09, 2023, 03:14:00 pm »
UTF-16 and UTF-32 is fixed size code,

No UTF-16 is not fixed size.
UTF-16 has a CodeUnit size of 2 byte. (UTF-8 has 1 byte, and UTF-32 has 4).
In UTF-16: A Unicode codepoint can be represented by 1 or 2 CodeUnits (2 or 4 bytes).
A "character" can be either a single codepoint, or a combination of several codepoints. (That applies to Unicode itself, so that is the case for UTF-8, UTF-16 and UTF-32 and any other transfer encoding)

I know otherwise, for example I hate UTF8 because this NOT only 1 byte length, this is variable length (1-4 byte) so very complicated to handle it, but I see many function available (ex.: LazUTF8: UT8Pos, UT8copy etc).

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9792
  • Debugger - SynEdit - and more
    • wiki
Re: Compare two text lines and highlight difference
« Reply #25 on: February 09, 2023, 03:27:13 pm »
UTF-16 and UTF-32 is fixed size code,

No UTF-16 is not fixed size.
UTF-16 has a CodeUnit size of 2 byte. (UTF-8 has 1 byte, and UTF-32 has 4).
In UTF-16: A Unicode codepoint can be represented by 1 or 2 CodeUnits (2 or 4 bytes).
A "character" can be either a single codepoint, or a combination of several codepoints. (That applies to Unicode itself, so that is the case for UTF-8, UTF-16 and UTF-32 and any other transfer encoding)

I know otherwise,
Then you know wrong.

Quote
for example I hate UTF8 because this NOT only 1 byte length, this is variable length (1-4 byte) so very complicated to handle it, but I see many function available (ex.: LazUTF8: UT8Pos, UT8copy etc).

And in UTF-16 (unlike UCS-2) you got  2 or 4 bytes.

UTF-16 has surrogates. And they are 4 bytes.

For example the following emoticons use 4 bytes in UTF-16 https://www.compart.com/en/unicode/block/U+1F600
Click then, see the UTF-16 encoding.


totya

  • Hero Member
  • *****
  • Posts: 720
Re: Compare two text lines and highlight difference
« Reply #26 on: February 09, 2023, 03:35:16 pm »
And in UTF-16 (unlike UCS-2) you got  2 or 4 bytes.
UTF-16 has surrogates. And they are 4 bytes.
For example the following emoticons use 4 bytes in UTF-16 https://www.compart.com/en/unicode/block/U+1F600
Click then, see the UTF-16 encoding.

Thanks, because you know wrong already, and u see here, UTF-8 is not only 1 byte... in this example it's 4 byte length.

Okay, UTF-16 is 2-4 byte length. Peace. :)

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9792
  • Debugger - SynEdit - and more
    • wiki
Re: Compare two text lines and highlight difference
« Reply #27 on: February 09, 2023, 03:35:38 pm »
And in addition to my last post (UTF-16 having 2 or 4 bytes) there is more.

This n-bytes size refers to how a single "Codepoint" is encoded in UTF-n.

However, in Unicode itself (even affecting UTF-32) a character can have more than 1 codepoint.

Example: "ä"
https://www.compart.com/unicode/U+00E4
Can be represented either as  U+00E4  (1 codepoint)
Or as as  U+0061 followed by U+0308  (2 codepoints)

But both are the same letter. (there are letters that can be represented in more than 2 forms).

There are also letters that have no composed (1 single codepoint) form, but are always several codepoints.


Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9792
  • Debugger - SynEdit - and more
    • wiki
Re: Compare two text lines and highlight difference
« Reply #28 on: February 09, 2023, 03:40:22 pm »
Thanks, because you know wrong already, and u see here, UTF-8 is not only 1 byte... in this example it's 4 byte length.

You must have misread me.
I assume you refer to:
UTF-16 has a CodeUnit size of 2 byte. (UTF-8 has 1 byte, and UTF-32 has 4).

I did say: In UTF-8 a "CodeUnit" is 1 byte.

And that is true. A Codepoint is then represented by 1 to 4 CodeUnits.


totya

  • Hero Member
  • *****
  • Posts: 720
Re: Compare two text lines and highlight difference
« Reply #29 on: February 09, 2023, 03:51:47 pm »
You must have misread me.

Codepoint and Codeunits are different okay.

But back to the topics... Thaddy said the new Tdiff UTF-8 capable (and I hope work better - in the last example works badly) U have any idea where can I find it?

 

TinyPortal © 2005-2018