Recent

Author Topic: 7,970,312 As it turns out.  (Read 1571 times)

jamie

  • Hero Member
  • *****
  • Posts: 1123
Re: 7,970,312 As it turns out.
« Reply #15 on: January 05, 2019, 02:59:19 pm »
That file needs to be indexed..

Read in the file line by line, generate a list of strings with the names given as a STRING index in a separate
file..
 
  When you do this, at the same time you create a fixed record size that contains the numerical imformation
converted down to binary live so to compact it. Name fields can be a index to the string table so that repeating
string sets are not duplicated in the file..

 When all said and done, the main file can be index and any textual references made in the record can index that
from the string file.

 This may  still not fit in conventional memory but you could virtualize it by writing a wrapper class that uses a
disk file instead but looks and acts like memory.

Handoko

  • Hero Member
  • *****
  • Posts: 2766
  • My goal: build my own game engine using Lazarus
Re: 7,970,312 As it turns out.
« Reply #16 on: January 05, 2019, 03:41:35 pm »
I admit database programming is my weakness. :'(

But I tested the program memory usage:
- Before running the program: 839 MB
- After the program started: 846 MB
- Highest usage when generating data: 886 MB
- Highest usage when searching data: 903 MB

The maximum memory usage is 903-839, which is 64 MB for 10,000,000 records with the file size of 200+ MB, I think it is safe. My computer has several GBs of memory. :D

Generating the huge data takes about 1 minutes on my Core 2 Quad Linux 64-bit computer. If I do not reduce the program responsiveness, it may take about 3x longer. Searching the text "hi" takes about 20 seconds.

Seriously, can you please take the source code and improve it by adding the index feature? I really want to learn how to do it.

I do not have any formal programming education. Twenty years ago I ever wrote my own database library, it used a lot of additional files for indexing. No matter how hard I optimized it, it's still run very slow.

I know no need to reinvent the wheel, but for education purpose I think it is fun reinventing the things already invented.

Image source: https://www.flickr.com/photos/dianneyee/8129911408
« Last Edit: January 05, 2019, 04:05:14 pm by Handoko »

JLWest

  • Sr. Member
  • ****
  • Posts: 340
Re: 7,970,312 As it turns out.
« Reply #17 on: January 05, 2019, 04:44:45 pm »
I will confess I didn't test it. I don't think I have a dog of a machine. So I'll test.

Why a Memo1 and not a listbox?
JLWEST 
 FPC 3.0.4, Lazarus IDE v1.8.2 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig 1.5 Terabyte SSD
3 Terabyte conventional disk space.

Handoko

  • Hero Member
  • *****
  • Posts: 2766
  • My goal: build my own game engine using Lazarus
Re: 7,970,312 As it turns out.
« Reply #18 on: January 05, 2019, 05:53:57 pm »
Why a Memo1 and not a listbox?

Nothing special. It's just my habit to use TMemo for showing test result. I've tried to modify the code to use TListBox instead, it worked without any issue too.

You said you failed to load 7 millions+ items into a list of string, aren't you?
I think I found the culprit of your issue.

Below is the code I used for a test:

Code: Pascal  [Select]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   S: string;
  4.   i: Integer;
  5. begin
  6.   for i := 1 to 5000000 do
  7.   begin
  8.     S := i.ToString;
  9.     Memo1.Append(S);
  10.     ListBox1.AddItem(S, nil);
  11.     if (i mod 100) <> 0 then Continue;
  12.     Application.ProcessMessages;
  13.   end;
  14.   ShowMessage('Done');
  15. end;

Only on 5 millions items, the Memo1 and ListBox1 already ate about 2 GB of my computer memory and it took more than 5 minutes.

Why it's so stressful compare to the previous 10 millions items massive harddisk operation?

My previous code loads only necessary data into memory. Although it was massive disk operation, the OS and harddisk caching system optimize the processes. It took only 20 seconds to read all the 10 millions records and load some necessary items into memory.

The 5 millions TMemo and TListBox string appending is extremely stressing the hardware.

Because string is expensive >:D

So I believe that is why you failed. You tried to load all the 7 millions data into a list of string (or TStringList or TListBox). AFAIK, String and TStringList are okay for handling data that are very long but not huge in the count. Because they have to deal with memory allocation, fragmentation, etc.
Correct me if I'm wrong.

You still can load all the data into memory and run smoothly. But you should write your own component, which is more efficient both in memory usage and performance. Maybe you can use BlockRead and TFPList for this purpose. But it will be very hard, I prefer the "load only the needed data" approach.
« Last Edit: January 05, 2019, 06:19:44 pm by Handoko »

Windsurfer

  • Sr. Member
  • ****
  • Posts: 344
    • Windsurfer
Re: 7,970,312 As it turns out.
« Reply #19 on: January 05, 2019, 07:12:30 pm »
Just a small comment. When using a simple editor to look at data, it will probably stop if it meets a <Ctrl>F which indicated End of File in older Windows Notebook. (I have no idea what Windows 10 apps do.)

lucamar

  • Hero Member
  • *****
  • Posts: 734
Re: 7,970,312 As it turns out.
« Reply #20 on: January 05, 2019, 08:17:21 pm »
Just a small comment. When using a simple editor to look at data, it will probably stop if it meets a <Ctrl>F which indicated End of File in older Windows Notebook. (I have no idea what Windows 10 apps do.)

Not Ctrl+F (#$06) but Ctrl+Z (#$1A), the good old CP/M EOF marker. Most modern applications simply ignore it or can be configured to ignore it, since they can get the size in bytes from the OS.
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 1.8.4/FPC 3.0.4 on:
(K)Ubuntu 12..16, Windows XP SP3 (Home/Prof.) and various DOS incarnations.

engkin

  • Hero Member
  • *****
  • Posts: 2194
Re: 7,970,312 As it turns out.
« Reply #21 on: January 05, 2019, 08:31:01 pm »
@Handoko

Items and Lines in TListBox and TMemo are initially TStringList, then they get replaced with a different classes that interact with the OS. Meaning the data is doubled. A copy in the Items/Lines and the other is sent to the OS (one item at a time for TListBox, for instance).

The typical method is to use owner-drawn arrangement.

JLWest

  • Sr. Member
  • ****
  • Posts: 340
Re: 7,970,312 As it turns out.
« Reply #22 on: January 05, 2019, 10:31:30 pm »
We are above my level of understanding here.

Notepad++ will load the file. If I try to do almost anything with the file it will crash to the desktop.

The last record is just 99.

I can load it and page down a couple of thousand records before it blows.

@jamie

I think you are talking about a Hash file which would be a separate file built that looks such:

Airport Record
Name  Position
-----------------
KLAX   6528
KLAS   7124

1    460 1 0 LELL Sabadell
17   1791 0 0 TN04 [H] Bristol Regional Medical Center
1   5669 1 0 KBJC Rocky Mountain Metropolitan Airport

So  I would need to read the file and when I get to a 1 Airport, 17, Helipad or 16 [Seaplane base] build a Hash record. I can do that.

@Handoko

I don't entirely understand.

"Maybe you can use BlockRead and TFPList for this purpose. But it will be very hard, I prefer the "load only the needed data" approach."

The BlockRead and TFPList is over my head right now. I prefer the "load only the needed data" approach."

 There in lies the problem. I need all 7,970,312 records and for them to stay intact. The only changes will be to a few thousand 1301 records with added data.
JLWEST 
 FPC 3.0.4, Lazarus IDE v1.8.2 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig 1.5 Terabyte SSD
3 Terabyte conventional disk space.

jamie

  • Hero Member
  • *****
  • Posts: 1123
Re: 7,970,312 As it turns out.
« Reply #23 on: January 06, 2019, 12:24:00 am »
I wasn't talking about a hash file just a simple native file for example

"Rocky Mountain Metropolitan Airport"

Lets say that string appears many times in the list, the idea is to write that once in a stringlist
and recreate that line entry formatted to accept the index number of the string list so that later you can
refer to its full name.

 The numerical entries could be converted to a binary type, one that is the largest that is needed per field
 for example, lets say one field never exceeds 255, you could use a byte there, and others that may never
exceed 65535, you could use a word there and so on.

 create a record with those formatted numbers in it "PACKED" and you get the most  out of your storage and
be able to index it fast.


engkin

  • Hero Member
  • *****
  • Posts: 2194
Re: 7,970,312 As it turns out.
« Reply #24 on: January 06, 2019, 12:33:53 am »
@JLWest, any sample file?

JLWest

  • Sr. Member
  • ****
  • Posts: 340
Re: 7,970,312 As it turns out.
« Reply #25 on: January 06, 2019, 01:23:10 am »
1    460 1 0 LELL Sabadell
17   1791 0 0 TN04 [H] Bristol Regional Medical Center
1   5669 1 0 KBJC Rocky Mountain Metropolitan Airport

These will only appear in the 7,970,312  records 1 time. They are unique records and denote the start of an airport, helipad or seaplane base. There is only one KBJC Rocky Mountain Metropolitan Airport in the world and ICAO Airports standards will only allow 1.

You can in fact build an airport in your backyard and call it KBJC but no airport authority will recognize it, put it on an aeronautical chart or list it.

So I think a Hash file for quick access to an airport is the way to go.The airport name is 5 fields in on all of the records. 

@ engkin

Nothing yet. I can only work on the computer about 1 1/2 at a time before everything becomes a blur. Then I have to rest my eyes. I had a lens replaced Dec 17th and can't get glasses until the Jan 18. I'm not complaining but I'm just a little slow on getting stuff done.I need to write a quick and dirty and dump KLAX. But it may be to big to post. In that case I'll have to dummy one up.
JLWEST 
 FPC 3.0.4, Lazarus IDE v1.8.2 Windows 10 Pro
 Intel i7 770K CPU 4.2GHz 32702MB Ram
GeForce GTX 1080 Graphics - 8 Gig 1.5 Terabyte SSD
3 Terabyte conventional disk space.

jamie

  • Hero Member
  • *****
  • Posts: 1123
Re: 7,970,312 As it turns out.
« Reply #26 on: January 06, 2019, 01:48:24 am »
You still can index the file even if that is the case, but also you could check for duplicates as you go..

Hashing the names I suppose will compress it down but you still can write the names out to a
sequencel file and for each one you write you'll have the file index that you can use in the record so that
later on you can directly index the name of the port. You will need to add a  marker on the end of each string
so when you read it back you'll know when to stop or use a led byte for length.

engkin

  • Hero Member
  • *****
  • Posts: 2194
Re: 7,970,312 As it turns out.
« Reply #27 on: January 06, 2019, 02:04:48 am »
@JLWest, it is OK. Take your time.

josh

  • Hero Member
  • *****
  • Posts: 700
Re: 7,970,312 As it turns out.
« Reply #28 on: January 06, 2019, 03:13:21 am »
Hi

Are you referring to the data from here
http://ourairports.com/data/
If so I had a quick look at the data, and the data has links to the other data; so with a bit of work you should be able to create your own data files.

ie the airports.csv id field is used in the airportfrequency file in the airport_ref field.

I also sorted the sheets, and some of the airports have more than one frequency; sp you would need to cater for that.

If you know how you wish the user to access the data and what you wish to retrieve from it; then i suspect members could give you a more exact idea.



Development Installation Lazarus 1.3, FPC 2.7.1,Windows 7/8 32/64, OSX, *nix

Test Environment Lazarus & FPC Trunk on Windows and OSX (Cocoa Mainly on OSX). Testing also Crosscompile windows to OSX.. 
Any posts made from 2015 will be based on Lazarus Trunk.

lucamar

  • Hero Member
  • *****
  • Posts: 734
Re: 7,970,312 As it turns out.
« Reply #29 on: January 06, 2019, 03:43:00 am »
I can only work on the computer about 1 1/2 at a time before everything becomes a blur. Then I have to rest my eyes. I had a lens replaced Dec 17th and can't get glasses until the Jan 18. I'm not complaining but I'm just a little slow on getting stuff done.I need to write a quick and dirty and dump KLAX. But it may be to big to post. In that case I'll have to dummy one up.

I'm preparing a little demo for you but it won't be ready till tomorrow [afternoon..night]. Don't loose faith! :)

ETA: Huh! Three-kings day ... more towards (late) night then: my nieces come first.
« Last Edit: January 06, 2019, 03:46:00 am by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 1.8.4/FPC 3.0.4 on:
(K)Ubuntu 12..16, Windows XP SP3 (Home/Prof.) and various DOS incarnations.