* * *

Author Topic: [SOLVED] Problems getting PDF document size  (Read 608 times)

arodriguez

  • Newbie
  • Posts: 3
[SOLVED] Problems getting PDF document size
« on: August 28, 2017, 11:26:45 am »
Hello everyone,
First of all, I'm sorry if this post isn't in the correct forum. Also excuse me if this topic or a similar one is already solved, I've searched in the forum but didn't find nothing.

I'm trying to make a little program to print engineering drawings in PDF format. The ones with a size of A4 and A3 must go to the laser printer, and the bigger ones must go to the plotter.
My idea is to set the path where the PDFs are located, then open each one, get the size, print to the correct printer, and repeat until they're all printed.
The problem I'm facing is to get the paper size of an open PDF. I've read something about getting the Dots Per Inch (XDPI and YDPI) and the PageWidth and PageHeight (from here https://www.freepascal.org/~michael/articles/lazprint/lazprint.pdf), but I don't know how to read this parameters of an open PDF.

By the way, a couple of things that could help:
My PDFs have only one page, so I only need to read the size of the first and only page. Also, I'm trying to open each document because I think it's the only way to read the size of the document, but if you know how to do it without opening each one, it will be awesome.

Thank you in advance for your help!

-Adrian

Edit: Solved!
« Last Edit: August 30, 2017, 08:28:16 am by arodriguez »

fred

  • Jr. Member
  • **
  • Posts: 81
Re: Problems getting PDF document size
« Reply #1 on: August 28, 2017, 04:47:15 pm »
In the past I have used a comand line tool pdfinfo http://www.xpdfreader.com/pdfinfo-man.html for getting the page size of a pdf.
Old style but it worked :)
Started with OmegaSoft Pascal on OS-9/68k , now Lazarus 1.8 RC4 / FPC 3.0.4rc1 on Windows 7

Mick

  • New member
  • *
  • Posts: 37
Re: Problems getting PDF document size
« Reply #2 on: August 28, 2017, 07:31:28 pm »
Each PDF file can be examined by textual parsing.

You can treat the content of PDF file as text and search for such substring: "/MediaBox [A B C D]" (without the doblequotes obviously).

It most probably will be the area / rectangle / box of the whole document. However, it can also be redefined (overriden) for individual pages, or even for other PDF objects. So you can expect more than one occurence. But I suppose you can assume that the page area will be the largest box. The A B C D will be the coordinates expressed in UserSpace units.  UserSpace units (if not defined in other way) are expressed in 1/72 of an inch.

Notice that this is just an idea. The usage of this kind of approach can be a challenge with some PDF files. The possibility to parse PDF content in this way will depend on how the PDF file was "constructed". PDF is quite complex format, imho.
« Last Edit: August 28, 2017, 07:58:29 pm by Mick »

arodriguez

  • Newbie
  • Posts: 3
Re: Problems getting PDF document size
« Reply #3 on: August 29, 2017, 10:52:59 am »
Thank you for your answers.

In the past I have used a comand line tool pdfinfo http://www.xpdfreader.com/pdfinfo-man.html for getting the page size of a pdf.
Old style but it worked :)
Sorry, I'm very new to this and I don't know how make this work. Can you explain? Thank you!


Each PDF file can be examined by textual parsing.

You can treat the content of PDF file as text and search for such substring: "/MediaBox [A B C D]" (without the doblequotes obviously).

It most probably will be the area / rectangle / box of the whole document. However, it can also be redefined (overriden) for individual pages, or even for other PDF objects. So you can expect more than one occurence. But I suppose you can assume that the page area will be the largest box. The A B C D will be the coordinates expressed in UserSpace units.  UserSpace units (if not defined in other way) are expressed in 1/72 of an inch.

Notice that this is just an idea. The usage of this kind of approach can be a challenge with some PDF files. The possibility to parse PDF content in this way will depend on how the PDF file was "constructed". PDF is quite complex format, imho.
I've searched about this, and found this code:

Code: Pascal  [Select]
  1. function GetMediaBox(const Stream: TMemoryStream): string;
  2. var
  3.   What: RawByteString;
  4.   aPtr: PChar;
  5.   len, i: Int64;
  6. begin
  7.   Result := '';
  8.   What := '/MediaBox [';
  9.   Stream.Seek(0, soFromBeginning);
  10.   len := Length(What);
  11.   i := 0;
  12.   while Stream.Position+len < Stream.Size do
  13.   begin
  14.     aPtr := Stream.Memory;
  15.     inc(aPtr, Stream.Position);
  16.     if CompareMem(aPtr, PChar(What), Length(What)) then
  17.     begin
  18.       SetString(Result, aPtr, 50);
  19.       Result := Copy(Result, 1, Pos(']', Result));
  20.       Exit;
  21.     end;
  22.     inc(i);
  23.     Stream.Seek(i,0)
  24.   end;
  25. end;
  26.  
  27. var
  28.   Stream: TMemoryStream;
  29.   s: string;        
  30. begin
  31.   Stream := TMemoryStream.Create();
  32.   Stream.LoadFromFile('c:\test.pdf');
  33.   s := GetMediaBox(Stream);
  34.   WriteLn(s);
  35.   Stream.Free;
  36. end;

This seems to work fine with some PDFs I've tried. It returns a value like [0 0 596 842], which means it's an A4. But with the ones created by our CAD software (Solidworks) it doesn't return any value. Any idea why this happens?


Thank you!

EDIT: Ok, I found what's wrong with the MediaBox code. In my PDFs, the MediaBox line doesn't have the space between "MediaBox" and "[". If this space is deleted, it returns the MediaBox values of my PDFs. I'm going to work with this for now, but any help will be very welcome. Thank you!
« Last Edit: August 29, 2017, 11:38:38 am by arodriguez »

fred

  • Jr. Member
  • **
  • Posts: 81
Re: Problems getting PDF document size
« Reply #4 on: August 29, 2017, 12:20:46 pm »
Quote
Sorry, I'm very new to this and I don't know how make this work. Can you explain? Thank you!
The are always better ways like above but you can try this:


Code: Pascal  [Select]
  1. type TPdfSize = record
  2.   width, height: Double;
  3.   rotation: integer;
  4. end;
  5.  
  6.  
  7. function GetPdfPageSize(const filename: string): TPdfSize;
  8. var
  9.   outputstring: string;
  10.   s: string;
  11.   ind: integer;
  12.  
  13.   function ExtractNumber(const s: string; var ind: integer): string;
  14.   begin
  15.     Result := '';
  16.     while (ind < Length(s)) and not (s[ind] in ['0'..'9']) do Inc(ind);
  17.     while (ind < Length(s)) and  (s[ind] <> ' ') do begin
  18.       Result := Result + s[ind];
  19.       Inc(ind);
  20.     end;
  21.     Result := StringReplace(Result, '.', ',', []);
  22.   end;
  23.  
  24. begin
  25.   if process.RunCommand(Application.Location + 'pdfinfo.exe', [filename], outputstring, [poNoConsole])
  26.   then begin
  27.     // just for debugging
  28.     Form1.memo1.Lines.Append(outputstring);
  29.  
  30.     // isolate the line containing the page size
  31.     s := Copy(outputstring, Pos('Page size:', outputstring));
  32.     s := Copy(s, 1, Pos(LineEnding, s));
  33.  
  34.     // just for debugging
  35.     Form1.memo1.Lines.Append(s);
  36.  
  37.     // here we use a ',' as decimal separator...
  38.     if (DefaultFormatSettings.DecimalSeparator <> '.')
  39.     then s := StringReplace(s, '.', DefaultFormatSettings.DecimalSeparator, [rfReplaceAll]);
  40.  
  41.     // extract the size numbers
  42.     ind := 1;
  43.     Result.width    := StrToFloat(ExtractNumber(s, ind));
  44.     Result.height   := StrToFloat(ExtractNumber(s, ind));
  45.  
  46.     // extract the rotation number, you could use it to swap width and height when rotation is 90 or 270 degrees
  47.     s := Copy(s, Pos('rotated', s));
  48.     ind := 1;
  49.     Result.rotation := StrToInt(ExtractNumber(s, ind));
  50.  
  51.     // just for debugging
  52.     Form1.Memo1.Lines.Append(LineEnding);
  53.     Form1.Memo1.Lines.Append(FloatToStr(Result.width));
  54.     Form1.Memo1.Lines.Append(FloatToStr(Result.height));
  55.     Form1.Memo1.Lines.Append(IntToStr(Result.rotation));
  56.   end;
  57. end;
  58.  
Started with OmegaSoft Pascal on OS-9/68k , now Lazarus 1.8 RC4 / FPC 3.0.4rc1 on Windows 7

arodriguez

  • Newbie
  • Posts: 3
Re: Problems getting PDF document size
« Reply #5 on: August 30, 2017, 08:24:35 am »
Quote
Sorry, I'm very new to this and I don't know how make this work. Can you explain? Thank you!
The are always better ways like above but you can try this:


Code: Pascal  [Select]
  1. type TPdfSize = record
  2.   width, height: Double;
  3.   rotation: integer;
  4. end;
  5.  
  6.  
  7. function GetPdfPageSize(const filename: string): TPdfSize;
  8. var
  9.   outputstring: string;
  10.   s: string;
  11.   ind: integer;
  12.  
  13.   function ExtractNumber(const s: string; var ind: integer): string;
  14.   begin
  15.     Result := '';
  16.     while (ind < Length(s)) and not (s[ind] in ['0'..'9']) do Inc(ind);
  17.     while (ind < Length(s)) and  (s[ind] <> ' ') do begin
  18.       Result := Result + s[ind];
  19.       Inc(ind);
  20.     end;
  21.     Result := StringReplace(Result, '.', ',', []);
  22.   end;
  23.  
  24. begin
  25.   if process.RunCommand(Application.Location + 'pdfinfo.exe', [filename], outputstring, [poNoConsole])
  26.   then begin
  27.     // just for debugging
  28.     Form1.memo1.Lines.Append(outputstring);
  29.  
  30.     // isolate the line containing the page size
  31.     s := Copy(outputstring, Pos('Page size:', outputstring));
  32.     s := Copy(s, 1, Pos(LineEnding, s));
  33.  
  34.     // just for debugging
  35.     Form1.memo1.Lines.Append(s);
  36.  
  37.     // here we use a ',' as decimal separator...
  38.     if (DefaultFormatSettings.DecimalSeparator <> '.')
  39.     then s := StringReplace(s, '.', DefaultFormatSettings.DecimalSeparator, [rfReplaceAll]);
  40.  
  41.     // extract the size numbers
  42.     ind := 1;
  43.     Result.width    := StrToFloat(ExtractNumber(s, ind));
  44.     Result.height   := StrToFloat(ExtractNumber(s, ind));
  45.  
  46.     // extract the rotation number, you could use it to swap width and height when rotation is 90 or 270 degrees
  47.     s := Copy(s, Pos('rotated', s));
  48.     ind := 1;
  49.     Result.rotation := StrToInt(ExtractNumber(s, ind));
  50.  
  51.     // just for debugging
  52.     Form1.Memo1.Lines.Append(LineEnding);
  53.     Form1.Memo1.Lines.Append(FloatToStr(Result.width));
  54.     Form1.Memo1.Lines.Append(FloatToStr(Result.height));
  55.     Form1.Memo1.Lines.Append(IntToStr(Result.rotation));
  56.   end;
  57. end;
  58.  

Hi Fred,
Thank you for this, but I'm having a problem with the pdfinfo.exe, it doesn't open. I think it's because our antivirus software it's blocking it. But this code will help me a lot in the future, thanks.

Finally I got the PDF size of my files searching for the /MediaBox line inside the document, and my little program is ready and working fine.
Here's the code I used, in case it could be useful for someone in the future:

Code: Pascal  [Select]
  1. { Function to read the 'MediaBox' line of a PDF file }
  2. function GetMediaBox(const Stream: TMemoryStream): string;
  3. var
  4.   What: String;
  5.   aPtr: PChar;
  6.   len, i: integer;
  7. begin
  8.   Result := '';
  9.   What := '/MediaBox[';
  10.   Stream.Seek(0, soFromBeginning);
  11.   len := Length(What);
  12.   i := 0;
  13.   while Stream.Position+len < Stream.Size do
  14.   begin
  15.     aPtr := Stream.Memory;
  16.     inc(aPtr, Stream.Position);
  17.     if CompareMem(aPtr, PChar(What), Length(What)) then
  18.     begin
  19.       SetString(Result, aPtr, 50);
  20.       Result := Copy(Result, 1, Pos(']', Result));
  21.       Exit;
  22.     end;
  23.     inc(i);
  24.     Stream.Seek(i,0);
  25.   end;
  26. end;

With this function, I'm reading something like this: /MediaBox[0.0 0.0 841.89 1190.55] .
With the following code I extract the 3rd and 4rth values, then distinguish if it's bigger than an A3 or not, and then print to the correct printer.
(Previously, I specify the folder where it's going to search with the "Directorio" variable).

Code: Pascal  [Select]
  1. { Global var }
  2. var
  3.   Stream: TMemoryStream;
  4.   s: string; {total value of the MediaBox line // Exmpl: /MediaBox[0.0 0.0 841.89 1190.55]}
  5.  
  6.   sl: TStringList;
  7.   ancho, alto: string; {value #2 of MediaBox}
  8.   anchoN, altoN: real; {value #3 of MediaBox}
  9.  
  10.   Directorio: string;
  11.   Extension: string;
  12.  
  13. { 'Print' button action: reads the size of the doc
  14.    and prints to the correct printer }
  15. procedure TForm1.Button1Click(Sender: TObject);
  16. var
  17.    searchResult: TSearchRec;
  18.    DirExt: string;
  19.    DirImp: string;
  20. begin
  21.   Directorio := Edit1.Text;
  22.   DirExt:= (Directorio + '\*.pdf');
  23.  
  24.   // Search for *.pdf inside the specified folder
  25.   if findfirst(DirExt, faAnyFile, searchResult) = 0 then
  26.   begin
  27.     repeat
  28.         DirImp:= (Directorio +'\'+ searchResult.Name);
  29.  
  30.         // Calls the MediaBox function
  31.         Stream := TMemoryStream.Create();
  32.         Stream.LoadFromFile(DirImp);
  33.         s := GetMediaBox(Stream);
  34.         Stream.Free;
  35.  
  36.         // Getting the Width and Height of the MediaBox
  37.         sl:=TStringList.Create;
  38.         sl.Delimiter:=' ';
  39.         sl.DelimitedText:= (s);
  40.  
  41.         ancho:= sl[2];
  42.         alto:= sl[3];
  43.         delete(alto,length(alto),1);
  44.  
  45.         anchoN:= StrToFloat(ancho);
  46.         altoN:= StrToFloat(alto);
  47.  
  48.         sl.Free;
  49.  
  50.       // If width and heigth are bigger than A3 size,
  51.       // then prints to plotter. If not, prints to laser printer
  52.       if (anchoN > 1200) and (altoN > 850) then
  53.          ShellExecute(Handle, 'printto', PChar(DirImp), 'cma_plotter', nil, SW_HIDE)
  54.          else
  55.          ShellExecute(Handle, 'printto', PChar(DirImp), 'cma_laser', nil, SW_HIDE);
  56.  
  57.     until FindNext(searchResult) <> 0;
  58.  
  59.     // Must free up resources used by these successful finds
  60.     SysUtils.FindClose(searchResult);
  61.   end
  62.  
  63.   else
  64.   ShowMessage ('Cannot find PDF files in the specified folder.')
  65.  
  66. end;

Also, put a button to previously list all the PDF files inside the folder, and a checkbox to print/not print files in subfolders:

Code: Pascal  [Select]
  1. procedure TForm1.Button2Click(Sender: TObject);
  2. var
  3.   List: TStringList;
  4.   i: integer;
  5.   Subc: boolean;
  6. begin
  7.      {List PDFs in folder}
  8.      Directorio := Edit1.Text;
  9.      Extension := '*.pdf';
  10.      Subc:= false;
  11.  
  12.      if CheckBox1.Checked = true then
  13.      Subc:=true;
  14.  
  15.      ListBox1.Clear;
  16.      List := TStringList.Create;
  17.      List := FindAllFiles(Directorio, Extension, Subc {search in subdirectory});
  18.  
  19.      try
  20.      for i:=0 to List.Count-1 do
  21.      ListBox1.Items.Add(List.Strings[i]);
  22.      finally
  23.      List.Free;
  24.  
  25.      ShowMessage('Se han encontrado '+ IntToStr(i+1) + ' archivos para imprimir.');
  26.  
  27. end;
  28. end;

This program is working fine for me, but I'm sure that I've done things that could be done better and "cleaner", so feel free to give your opinion if you want. It will help me learning and improve my coding.
Thank you!!

Btw, I'm marking this thread as "solved".

fred

  • Jr. Member
  • **
  • Posts: 81
Re: [SOLVED] Problems getting PDF document size
« Reply #6 on: August 30, 2017, 09:21:17 am »
I'm glad that you have solved it, some last remarks:

I checked with virustotal.com and pdfinfo.exe is clean, so AV should not be the reason.
You can check and start it by hand in a console window.

I don't think you need to Create and Free your stream and stringlist object everytime inside a loop, LoadFromFile and DelimitedText will clear prevous data.
Something like:
Code: Pascal  [Select]
  1.   // create objects
  2.   try
  3.     // do what you want
  4.   finally
  5.     // free objects
  6.   end;

You can use "FindAllFiles(ListBox1.Items, Directorio, Extension)", SearchSubDirs default to true, saves you a lot of lines ;)

There are many ways to solve something in software but I think make it work reliable, learn from it and have fun with it :)
Started with OmegaSoft Pascal on OS-9/68k , now Lazarus 1.8 RC4 / FPC 3.0.4rc1 on Windows 7

 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads Open Hub project report for Lazarus