[SOLVED] A fast way of comparing hash values line by line from two stringlists?

Gizmo

Hero Member
Posts: 831

[SOLVED] A fast way of comparing hash values line by line from two stringlists?

« on: July 19, 2014, 12:42:42 pm »

Coding an additional feature to my QuickHash program - compare directories. So, take the files in DirA, hash them, and compare the hashes against the hashes of all the files in DirB. Where there are differences, I want to highlight them.

I'm OK with FindAllFiles and Search records etc and will probably store the results in memory in a StringList, or two StringLists at least. I have the hash code using the Freepascal MD5 and SHA1 units.

What I want to ask is what's the quickest way to compare two lists of hash values other than traversing two stringlists and doing "if HashFileA <> HashFileB" line by line? Am I to assume that StringHashMap is the best, by Juha? (http://wiki.lazarus.freepascal.org/StringHashMap). If so, could anyone give me some pointers to get me started with that? As far as I can see from the demo, it is a fast way of searching for strings so I assume I'd run it\integrate it with my stringlists of hashes?

« Last Edit: July 24, 2014, 03:49:06 pm by Gizmo »

Logged

Lazarus 2.2.6 FPC 3.2.2- Linux Mint 21 LTS, Windows 10 64 and Mac OSX 14.0
Useful Pages to remember :

http://wiki.freepascal.org/Cross_compiling#From_Linux_x64_to_Linux_i386
https://wiki.freepascal.org/macOS_Big_Sur_changes_for_developers#ARM64.2FAArch64.2FApple_Silicon_Support

engkin

Hero Member
Posts: 3112

Re: A fast way of comparing hash values line by line from two stringlists?

« Reply #1 on: July 19, 2014, 06:32:03 pm »

For speed I suggest that you compare digests, not their string hexadecimal representations. For two reasons: take sha1 digest for instance:

Code: [Select]

TSHA1Digest = array[0..19] of Byte;

it is 20 bytes while its string counterpart is 40 characters/bytes. The second reason that strings are compared based on character/byte level and not benefiting from the CPU data bus width (4/8 bytes).

The same applies to MD5 digests:

Code: [Select]

TMDDigest = array[0..15] of Byte;

only two comparisons are needed here on 64 bit systems.

Logged

Gizmo

Hero Member
Posts: 831

Re: A fast way of comparing hash values line by line from two stringlists?

« Reply #2 on: July 22, 2014, 10:02:06 pm »

Thanks Engkin, but for the sake of simplicty, for now at least, I'd like to compare the hashes themselves and I've cobbled together some basic syntax to start with. I will look at more advanced methods once I have got this working. I am struggling though.

In summary, the user chooses directory A (DirA) and Directory B (DirB) to compare one against the other.

FindAllFiles is then run for both dirs, the results of which go into two seperate stringlists, FileListA and FileListB and they include full filename and path.

I then itterate FileListA and FileListB calling a hash function for each file found in the list, and store the hashes in two additional stringlists called HashListA and HasListtB. See screenshot 'DirAAndDirB' which shows these two hash lists for two directories where one dir has one extra file that the other does not have.

I then have a third pair of stringlists with both results of both directories combined. So FilesAndHashListA and FilesAndHashListB contains the full filename and hash, seperated with a comma, for lookup further on - to match the hash against the filename but without having to itterate entire stringlists full of both.

All lists are sorted.

Now, obviously the full path recorded in each list will be different because you're comparing one folder against a different one, so it's only the content based on hash I am comparing. But where there are differences I need it to say "find the list where one hash appears in one list but not the other and then lookup the corresponding filename and path for that hash or hashes"

I've nearly done it with the code below, except it reports the wrong file as the missing one.

Code: [Select]


procedure TMainForm.btnCompareClick(Sender: TObject);
var
  DirA, DirB, FilePath, FileName, FullPathAndName, FileHashA, FileHashB,
    HashOfListA, HashOfListB, Mismatch : string;
  TotalFilesDirA, TotalFilesDirB,       // Stringlists just for the file names
    HashListA, HashListB,               // Stringlists just for the hashes of each file in each directory
    FileAndHashListA, FileAndHashListB, // Stringlists for the combined lists of both hashes with filenames
    MisMatchList
    : TStringList;
  i, index : integer;

begin
  i := 0;
  index := 0;
  DirA := lblDirAName.Caption;
  DirB := lblDirBName.Caption;

  try
    // First, list and hash the files in DirA
    TotalFilesDirA := TStringList.Create;
    TotalFilesDirA.Sorted := true;
    TotalFilesDirA := FindAllFiles(DirA, '*', True);
    TotalFilesDirA.Sort;

    HashListA := TStringList.Create;
    FileAndHashListA := TStringList.Create;
    HashListA.Sorted := true;
    FileAndHashListA.Sorted := true;

    for i := 0 to TotalFilesDirA.Count -1 do
      begin
        FilePath := ExtractFilePath(TotalFilesDirA.Strings[i]);
        FileName := ExtractFileName(TotalFilesDirA.Strings[i]);
        FullPathAndName := FilePath + FileName;
        FileHashA := CalcTheHashFile(FullPathAndName);
        HashListA.Add(FileHashA);
        FileAndHashListA.Add(FullPathAndName + ',' + FileHashA);
      end;
    HashListA.Sort;
    memDirAList.Text:= HashListA.Text;
    lblTotalFileCountNumberA.Caption := IntToStr(TotalFilesDirA.Count);

    // Then, list and hash the files in DirB
    TotalFilesDirB := TStringList.Create;
    TotalFilesDirB.Sorted := true;
    TotalFilesDirB := FindAllFiles(DirB, '*', True);
    TotalFilesDirB.Sort;

    HashListB := TStringList.Create;
    FileAndHashListB := TStringList.Create;
    HashListB.Sorted := true;
    FileAndHashListB.Sorted := true;

    for i := 0 to TotalFilesDirB.Count -1 do
      begin
        FilePath := ExtractFilePath(TotalFilesDirB.Strings[i]);
        FileName := ExtractFileName(TotalFilesDirB.Strings[i]);
        FullPathAndName := FilePath + FileName;
        FileHashB := CalcTheHashFile(FullPathAndName);
        HashListB.Add(FileHashB);
        FileAndHashListB.Add(FullPathAndName + ',' + FileHashB);
      end;
    HashListB.Sort;
    FileAndHashListB.Sort;

    memDirBList.Text:= HashListB.Text;
    lblTotalFileCountNumberB.Caption := IntToStr(TotalFilesDirB.Count);

    // Now work out where the differences are.
    // Start by establishing if the dirs are identical : same no of files + same hashes = matching dirs
    lblFileCountDiffB.Caption := IntToStr(TotalFilesDirB.Count - TotalFilesDirA.Count);

    { If there is no difference between file count, then if all the files are
      actually the same files, the hash lists themselves will be identical if there
      were no errors or no file mistmatches.
      So instead of comparing each hash line by line, just hash the list and see if they match
      However, we don't know whether DirA or DirB is the one that might have most files in,
      so we do a count of each subtracted by the other
    }
    if ((TotalFilesDirB.Count - TotalFilesDirA.Count) = 0) or ((TotalFilesDirA.Count - TotalFilesDirB.Count) = 0) then
      begin
      HashOfListA := SHA1Print(SHA1String(HashListA.Text));
      HashOfListB := SHA1Print(SHA1String(HashListB.Text));
      if HashOfListA = HashOfListB then
        begin
          lblHashMatchB.Caption:= 'MATCH!';
        end
      end;

    // If both matched, the previous loop will have been executed.
    // If, however, one dir has a higher count than the other, the following loop runs:

    if (TotalFilesDirB.Count < TotalFilesDirA.Count) or (TotalFilesDirB.Count > TotalFilesDirA.Count) then
      begin
        lblHashMatchB.Caption:= 'Mis-MATCH!';
        FileAndHashListA.Sort;
        FileAndHashListB.Sort;
        MismatchList := TStringList.Create;
        for i := 0 to HashListB.Count -1 do
          begin
            if not HashListB.Find(HashListA.Strings[i], index) then
              begin
              MismatchList.Add(FileAndHashListB.Strings[i] + ' not found in both directories');
              ShowMessage(MismatchList.Text); // THIS LISTS A FILE THAT IS IN BOTH LISTS AND IS IDENTICAL, INSTEAD OF THE ONE THAT IS ONLY IN DIRB AND NOT IN DIRA
              end;
          end;
      end;
  finally
    TotalFilesDirA.Free;
    TotalFilesDirB.Free;
    HashListA.Free;
    HashListB.Free;
    if assigned (MisMatchList) then MismatchList.Free;
  end;
end;

If you look at the Results.png file attached, you'll see it claims the single missing file is a file that is there in both directories. The one that is actually missing is highlighted and is half way down the right hand list. It should be that one that is identified as the missing file.

Any advice?

DirAAndDirB.png (18.82 kB, 277x340 - viewed 306 times.)

Results.png (38.07 kB, 1020x827 - viewed 356 times.)

Logged

taazz

Hero Member
Posts: 5368

Re: A fast way of comparing hash values line by line from two stringlists?

« Reply #3 on: July 22, 2014, 10:19:03 pm »

2 problems
1) the base list must be the one with the most items ee

Code: [Select]

for i := 0 to HashListB.Count -1 do must be the list with the bigger count not any list.

2) you search in the other list than the one in the for statement ee

Code: [Select]

if not HashListB.Find(HashListA.Strings[i], index) then must be changed to

Code: [Select]

if not HashListA.Find(HashListB.Strings[i], index) then you do not use I that is only correct with HashListB in your code as an index on HashListA that is where GPFs are born.

Logged

Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

Gizmo

Hero Member
Posts: 831

Re: A fast way of comparing hash values line by line from two stringlists?

« Reply #4 on: July 23, 2014, 02:51:46 pm »

Taaz

Forgive me - I did have them in the right order before. I just tested it the other way round to see if I was going made but forgot to revert it back before pasting in the code above.

Anyway, I battled with this all of last night and still couldn't resolve it. Long story short, if there are 4 entries in ListB and 3 in ListA, the program correctly identifies the hash value that is missing from ListA. However, when it looks up the corresponding filename and path from the third list to which the hash relates, it lists the wrong one. So it is something to do with my list ordering but I can't for the life of me see it. So I sepnt sometime in my lunch break quickly preparing this demo project. It is a Lazarus 1.2.2 project with FPC 2.6.4 and contains two directories (DirA, DirB) that contain demo files (DirA contains 3, and DirB contains 4).

Can some please see where I'm going wrong?

Demo.zip (5.25 kB - downloaded 90 times.)

Logged

taazz

Hero Member
Posts: 5368

Re: A fast way of comparing hash values line by line from two stringlists?

« Reply #5 on: July 23, 2014, 03:09:39 pm »

actually from what I see from your code

Code: [Select]

FileAndHashListB.Find(MissingHash, indexB); should return false making the indexB value is a false positive. FileAndHashListB contains a string of <fullFileName>,<hashValue> and you are searching for <HashValue> only you will never find it unless you build your own searching method to compare only the hash value on a FileAndHashListB string instead of the complete string.

Logged

Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

howardpc

Hero Member
Posts: 4144

Re: A fast way of comparing hash values line by line from two stringlists?

« Reply #6 on: July 23, 2014, 03:33:11 pm »

To compare the lines in the memos change Button1Click to the following:

Code: [Select]

procedure TForm1.Button1Click(Sender: TObject);

var
  DirA, DirB, FilePath, FileName, FullPathAndName, FileHashA, FileHashB, s: string;
  TotalFilesDirA, TotalFilesDirB,       // Stringlists just for the file names
    HashListA, HashListB,               // Stringlists just for the hashes of each file in each directory
    FileAndHashListA, FileAndHashListB, // Stringlists for the combined lists of both hashes with filenames
    MisMatchList
    : TStringList;
  i: integer;

begin
  DirA := lblDirAName.Caption;    // Use DirA demo folder provided, holds 3 files
  DirB := lblDirBName.Caption;    // Use DirB demo folder provided, holds 4 files

  try
    // First, list and hash the files in DirA
    TotalFilesDirA := TStringList.Create;
    TotalFilesDirA.Sorted := true;
    TotalFilesDirA := FindAllFiles(DirA, '*', True);
    TotalFilesDirA.Sort;

    HashListA := TStringList.Create;
    FileAndHashListA := TStringList.Create;
    HashListA.Sorted := true;
    FileAndHashListA.Sorted := true;

    for i := 0 to TotalFilesDirA.Count -1 do
      begin
        FilePath := ExtractFilePath(TotalFilesDirA.Strings[i]);
        FileName := ExtractFileName(TotalFilesDirA.Strings[i]);
        FullPathAndName := FilePath + FileName;
        FileHashA := SHA1Print(SHA1File(FullPathAndName));
        HashListA.Add(FileHashA);
        FileAndHashListA.Add(FullPathAndName + ',' + FileHashA);
      end;
    HashListA.Sort;
    Memo1.Text:= HashListA.Text;
    lblTotalFileCountNumberA.Caption := IntToStr(TotalFilesDirA.Count);

    // Then, list and hash the files in DirB
    TotalFilesDirB := TStringList.Create;
    TotalFilesDirB.Sorted := true;
    TotalFilesDirB := FindAllFiles(DirB, '*', True);
    TotalFilesDirB.Sort;

    HashListB := TStringList.Create;
    FileAndHashListB := TStringList.Create;
    HashListB.Sorted := true;
    FileAndHashListB.Sorted := true;

    for i := 0 to TotalFilesDirB.Count -1 do
      begin
        FilePath := ExtractFilePath(TotalFilesDirB.Strings[i]);
        FileName := ExtractFileName(TotalFilesDirB.Strings[i]);
        FullPathAndName := FilePath + FileName;
        FileHashB := SHA1Print(SHA1File(FullPathAndName));
        HashListB.Add(FileHashB);
        FileAndHashListB.Add(FullPathAndName + ',' + FileHashB);
      end;
    HashListB.Sort;
    FileAndHashListB.Sort;

    Memo2.Text:= HashListB.Text;
    lblTotalFileCountNumberB.Caption := IntToStr(TotalFilesDirB.Count);

    // Now work out where the differences are.
    // Start by establishing if the dirs are identical : same no of files + same hashes = matching dirs
    lblFileCountDiffB.Caption := IntToStr(TotalFilesDirB.Count - TotalFilesDirA.Count);

    MismatchList := TStringList.Create;

    for i := 0 to HashListA.Count-1 do begin
      s:=HashListA[i];
      if (HashListB.IndexOf(s) < 0) then
        MisMatchList.Add(s + ' found only in Dir A');
    end;
    for i := 0 to HashListB.Count-1 do begin
      s:=HashListB[i];
      if (HashListA.IndexOf(s) < 0) then
        MisMatchList.Add(s + ' found only in Dir B');
      end;

    if (MisMatchList.Count > 0) then
      ShowMessage(MismatchList.Text)
    else ShowMessageFmt('Dir A and Dir B contain %d identical files',[HashListB.Count]);

  finally
    TotalFilesDirA.Free;
    TotalFilesDirB.Free;
    HashListA.Free;
    HashListB.Free;
    MismatchList.Free;
  end;
end;

Logged

Gizmo

Hero Member
Posts: 831

Re: A fast way of comparing hash values line by line from two stringlists?

« Reply #7 on: July 23, 2014, 03:39:44 pm »

Taaz

I thought StringList.Find looks for any string in the list even if it is surrounded with other characters to the left and right by other text. From what you're saying it sounds as though it looks through the list line by line and only returns true if the entire line matches what is searched for?

If that is the case, then I will build a Pos and PosEx routine but I, perhaps incorrectly, assumed .Find found the string in the list no matter where it was.

If indeed it does do that, then I can't see why it would be false if it is correctly looking for the right hash. If you're looking for hash AB1234...AE in the following stringlist :

DirA\FileA.txt,AB1234...AE
DirA\FileB.txt,BA6544...AE
DirA\FileC.txt,AE6789...AE

I would expect index to contain 1, i.e. row one of the StringList.

HowardPC

Thanks for your suggestion. I will try that out tonight when I get home. It looks to make sense, which is a good start for me! Thanks for your time. Two observations though - I don't think it looks up the actual filename that is missing, only by representation to the hash of that file. And that is my point. The code I supplied already identifies the missing hashes - it's doing the filename lookup that causes the problem because it then finds and displays the wrong line number for the file and hash that is missing. So instead of listing row 4, for example, it lists row 1. So instead it of reporting the file that is not in both directories it reportss a file that is in both directories!!

That said, your code is better in that it does a two way comparison. But the whole idea is that the user is notified WHICH FILE appears to be in one dir but not the other.

Secondly, I have to use sorted lists for this because it could be comparing hundreds of thousands or millions of files, so it has to be as fast I can make it. Stringlist.find can only be used on sorted lists and works really fast. IndexOf is for unsorted lists and will be too slow with many files.

So, I need to try and find a solution using my current code. It's 99% done. It's just one little numbering issue that seems to be wrong.

« Last Edit: July 23, 2014, 04:02:10 pm by Gizmo »

Logged

taazz

Hero Member
Posts: 5368

Re: A fast way of comparing hash values line by line from two stringlists?

« Reply #8 on: July 23, 2014, 03:57:10 pm »

Quote from: Gizmo on July 23, 2014, 03:39:44 pm

Taaz

I thought StringList.Find looks for any string in the list

that's true it searches for any string in the list.

Quote from: Gizmo on July 23, 2014, 03:39:44 pm

even if it is surrounded with other characters to the left and right by other text. From what you're saying it sounds as though it looks through the list line by line and only returns true if the entire line matches what is searched for?
If that is the case, then I will build a Pos and PosEx routine but I, perhaps incorrectly, assumed .Find found the string in the list no matter where it was.

that is your assumption and its wrong. what you describe is usually described as a "substring in string" in the documentation. You need to use the pos/posex or midstr function to limit your comparison to the hashvalue.

Logged

Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

howardpc

Hero Member
Posts: 4144

Re: A fast way of comparing hash values line by line from two stringlists?

« Reply #9 on: July 23, 2014, 07:29:29 pm »

The attached project shows a way to avoid filename lookup altogether by storing the hash and filename together. I think you can adapt these ideas for your needs.
Searching is done by hash, and the associated filename is stored where the found hash is located - no need for an error-prone search for it in some other data container via a complex mapping.

publishedproject.zip (3.43 kB - downloaded 137 times.)

Logged

Gizmo

Hero Member
Posts: 831

Re: A fast way of comparing hash values line by line from two stringlists?

« Reply #10 on: July 24, 2014, 01:06:17 pm »

Hi Howard

Forgive me, but, thanks to Taaz pointing the issue with SL.Find, I worked this out for myself in the end late last night using Pos, RPosex and using ':' as delimiters between filenames and hash values in the stringlists. I have pasted my revised procedure below for the benefit of others.

That said, having examined your demo project, there is much I'd like to take from that so I will probably merge the two solutions.

I'm very thankful to you for taking the time to assist, and if you wish to comment on the procedure above (ways to make it better, faster etc) please feel free.

Code: [Select]

procedure TMainForm.btnCompareClick(Sender: TObject);
var
  DirA, DirB, FilePath, FileName, FullPathAndName, FileHashA, FileHashB,
    HashOfListA, HashOfListB, Mismatch, MissingHash, s, ExtractedFileName : string;
  TotalFilesDirA, TotalFilesDirB,       // Stringlists just for the file names
    HashListA, HashListB,               // Stringlists just for the hashes of each file in each directory
    FileAndHashListA, FileAndHashListB, // Stringlists for the combined lists of both hashes with filenames
    MisMatchList
    : TStringList;
  i, indexA, indexB,  HashPosStart , FileNameAndPathPosStart, FileNameAndPathPosEnd : integer;

begin
  i := 0;
  indexA := 0;
  indexB := 0;
  HashPosStart := 0;
  FileNameAndPathPosStart := 0;
  FileNameAndPathPosEnd := 0;
  DirA := lblDirAName.Caption;
  DirB := lblDirBName.Caption;

  try
    // First, list and hash the files in DirA
    TotalFilesDirA := TStringList.Create;
    TotalFilesDirA.Sorted := true;
    TotalFilesDirA := FindAllFiles(DirA, '*', True);
    TotalFilesDirA.Sort;
    sgDirA.RowCount := TotalFilesDirA.Count + 1;

    HashListA := TStringList.Create;
    FileAndHashListA := TStringList.Create;
    HashListA.Sorted := true;
    FileAndHashListA.Sorted := true;

    for i := 0 to TotalFilesDirA.Count -1 do
      begin
        FilePath := ExtractFilePath(TotalFilesDirA.Strings[i]);
        FileName := ExtractFileName(TotalFilesDirA.Strings[i]);
        FullPathAndName := FilePath + FileName;
        FileHashA := CalcTheHashFile(FullPathAndName);
        HashListA.Add(FileHashA);
        FileAndHashListA.Add(FullPathAndName + ':' + FileHashA + ':');
        // Populate display grid for DirA
        sgDirA.Cells[0, i+1] := IntToStr(i+1);
        sgDirA.Cells[1, i+1] := FullPathAndName;
        sgDirA.Cells[2, i+1] := FileHashA;
        sgDirA.Row         := i;
        sgDirA.col         := 1;
      end;
    HashListA.Sort;

    lblTotalFileCountNumberA.Caption := IntToStr(TotalFilesDirA.Count);

    // Then, list and hash the files in DirB
    TotalFilesDirB := TStringList.Create;
    TotalFilesDirB.Sorted := true;
    TotalFilesDirB := FindAllFiles(DirB, '*', True);
    TotalFilesDirB.Sort;
    sgDirB.RowCount := TotalFilesDirB.Count + 1;

    HashListB := TStringList.Create;
    FileAndHashListB := TStringList.Create;
    HashListB.Sorted := true;
    FileAndHashListB.Sorted := true;

    for i := 0 to TotalFilesDirB.Count -1 do
      begin
        FilePath := ExtractFilePath(TotalFilesDirB.Strings[i]);
        FileName := ExtractFileName(TotalFilesDirB.Strings[i]);
        FullPathAndName := FilePath + FileName;
        FileHashB := CalcTheHashFile(FullPathAndName);
        HashListB.Add(FileHashB);
        FileAndHashListB.Add(FullPathAndName + ':' + FileHashB + ':');
        // Populate display grid for DirB
        sgDirB.Cells[0, i+1] := IntToStr(i+1);
        sgDirB.Cells[1, i+1] := FullPathAndName;
        sgDirB.Cells[2, i+1] := FileHashB;
        sgDirB.Row         := i;
        sgDirB.col         := 1;
      end;
    HashListB.Sort;
    FileAndHashListB.Sort;

    lblTotalFileCountNumberB.Caption := IntToStr(TotalFilesDirB.Count);

    // Now work out where the differences are.
    // Start by establishing if the dirs are identical : same no of files + same hashes = matching dirs
    if TotalFilesDirB.Count > TotalFilesDirA.Count then
      begin
        lblFileCountDiffB.Caption := IntToStr(TotalFilesDirB.Count - TotalFilesDirA.Count);
      end
    else if TotalFilesDirA.Count > TotalFilesDirB.Count then
      begin
        lblFileCountDiffB.Caption := IntToStr(TotalFilesDirA.Count - TotalFilesDirB.Count);
      end
    else lblFileCountDiffB.Caption := '0';

    { If there is no difference between file count, then if all the files are
      actually the same files, the hash lists themselves will be identical if there
      were no errors or no file mistmatches.
      So instead of comparing each hash line by line, just hash the list and see if they match
      However, we don't know whether DirA or DirB is the one that might have most files in,
      so we do a count of each subtracted by the other
    }
    if ((TotalFilesDirB.Count - TotalFilesDirA.Count) = 0) or ((TotalFilesDirA.Count - TotalFilesDirB.Count) = 0) then
      begin
      HashOfListA := SHA1Print(SHA1String(HashListA.Text));
      HashOfListB := SHA1Print(SHA1String(HashListB.Text));
      if HashOfListA = HashOfListB then
        begin
          lblHashMatchB.Caption:= 'MATCH!';
        end
      end;

    // If both matched, the previous loop will have been executed.
    // If, however, one dir has a higher count than the other, the following loop runs
    // Start of Mis-Match Loop:
    if (TotalFilesDirB.Count < TotalFilesDirA.Count) or (TotalFilesDirB.Count > TotalFilesDirA.Count) then
      begin
        lblHashMatchB.Caption:= 'Mis-MATCH!';
        FileAndHashListA.Sort;
        FileAndHashListB.Sort;
        try
          MismatchList := TStringList.Create;

          // Check the content of ListB against ListA

          for i := 0 to HashListB.Count -1 do
            begin
              if not HashListA.Find(HashListB.Strings[i], indexA) then
                begin
                  MissingHash := HashListB.Strings[i];
                  HashPosStart := Pos(MissingHash, FileAndHashListB.Text);
                  FileNameAndPathPosEnd := RPosEx(':', FileAndHashListB.Text, HashPosStart);
                  FileNameAndPathPosStart := RPosEx(':', FileAndHashListB.Text, FileNameAndPathPosEnd -1);
                  if (HashPosStart > 0) and (FileNameAndPathPosStart > 0) and (FileNameAndPathPosEnd > 0) then
                    begin
                      ExtractedFileName := Copy(FileAndHashListB.Text, FileNameAndPathPosStart -1, (FileNameAndPathPosEnd - FileNameAndPathPosStart) +1);
                      MisMatchList.Add(ExtractedFileName + ' ' + MissingHash + ' is NOT in both directories');
                    end;
                end;
            end;

          // Check the content of ListA against ListB

          for i := 0 to HashListA.Count -1 do
            begin
              if not HashListB.Find(HashListA.Strings[i], indexA) then
                begin
                  MissingHash := HashListA.Strings[i];
                  HashPosStart := Pos(MissingHash, FileAndHashListA.Text);
                  FileNameAndPathPosEnd := RPosEx(':', FileAndHashListA.Text, HashPosStart);
                  FileNameAndPathPosStart := RPosEx(':', FileAndHashListA.Text, FileNameAndPathPosEnd -1);
                  if (HashPosStart > 0) and (FileNameAndPathPosStart > 0) and (FileNameAndPathPosEnd > 0) then
                    begin
                      ExtractedFileName := Copy(FileAndHashListA.Text, FileNameAndPathPosStart -1, (FileNameAndPathPosEnd - FileNameAndPathPosStart) +1);
                      MisMatchList.Add(ExtractedFileName + ' ' + MissingHash + ' found in both directories');
                    end;
                end;
            end;

          // This next check is probably unnecessary because the above two for loops
          // are only executed if the number of files differ anyway. If they don't differ
          // none of this if (TotalFilesDirB.Count < TotalFilesDirA.Count) or (TotalFilesDirB.Count > TotalFilesDirA.Count) then is run
          // But, just as s secondary validation, we will check. It only takes a millisecond.
          if (MisMatchList.Count > 0) then
            begin
              ShowMessage(MismatchList.Text)
            end
            else
              ShowMessageFmt('Dir A and Dir B contain %d identical files',[HashListB.Count]);
        finally // Finally for MisMatch
          if assigned (MisMatchList) then MismatchList.Free;
        end;
    end; // End of mis-match loop
  finally
    HashListA.Free;
    TotalFilesDirA.Free;
    FileAndHashListA.Free;

    TotalFilesDirB.Free;
    FileAndHashListB.Free;
    HashListB.Free;
  end;
end;

« Last Edit: July 24, 2014, 01:49:10 pm by Gizmo »

Logged

taazz

Hero Member
Posts: 5368

Re: A fast way of comparing hash values line by line from two stringlists?

« Reply #11 on: July 24, 2014, 05:23:26 pm »

Quote from: Gizmo on July 24, 2014, 01:06:17 pm

I'm very thankful to you for taking the time to assist, and if you wish to comment on the procedure above (ways to make it better, faster etc) please feel free.

Make sure that each list is sorted once and only once. Make sure that you use the hashlistX with the most items as base for your for and remove the second loop that ee // Check the content of ListB against ListA loop does exactly the same thing as // Check the content of ListA against ListB just make sure that the loop is execute once using the ist with the most items as base for the loop eg

Code: [Select]

var
  vBaseList :Tstringlist;
  ......
begin
  ....
  vBaseList := hashListA;
  if HashListA.Count < HashListB.Count the vBaseList := HashListB;
  for i := 0 to vBaseList.Count -1 do
    begin
      ..........

Thats the ones I looked at for now didn't looked at how the files are found and the hash are calculated so based on the above idea of remove any duplicate work take a look and see if there is some thing duplicated to be removed. Next step is to remove any unneeded code for example converting the sha1 or md5 to a string can be removed as engkin already said.

Logged

Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

Lazarus

Bookstore

Search

Recent

Author Topic: [SOLVED] A fast way of comparing hash values line by line from two stringlists? (Read 9347 times)

Gizmo

[SOLVED] A fast way of comparing hash values line by line from two stringlists?

engkin

Re: A fast way of comparing hash values line by line from two stringlists?

Gizmo

Re: A fast way of comparing hash values line by line from two stringlists?

taazz

Re: A fast way of comparing hash values line by line from two stringlists?

Gizmo

Re: A fast way of comparing hash values line by line from two stringlists?

taazz

Re: A fast way of comparing hash values line by line from two stringlists?

howardpc

Re: A fast way of comparing hash values line by line from two stringlists?

Gizmo

Re: A fast way of comparing hash values line by line from two stringlists?

taazz

Re: A fast way of comparing hash values line by line from two stringlists?

howardpc

Re: A fast way of comparing hash values line by line from two stringlists?

Gizmo

Re: A fast way of comparing hash values line by line from two stringlists?

taazz

Re: A fast way of comparing hash values line by line from two stringlists?

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook