Recent

Author Topic: I need a suggestion…  (Read 2060 times)

Espectr0

  • Full Member
  • ***
  • Posts: 218
I need a suggestion…
« on: May 04, 2022, 01:16:19 pm »
Hola gente,

 I have a query I would like you to guide me and suggest a little.

 I have a database in xml format, in which I have blocks of phrases in English and their corresponding ones in Spanish for example, let's say a dictionary type.
 What should I do to work with her?

 -use xml parser and fetch the phrase whenever I need it directly from the xml?

 -read the entire xml and save everything in a string structure?

 -work with sql?

 They can be large files of more than 100,000 lines.

 Thank you, regards

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: I need a suggestion…
« Reply #1 on: May 04, 2022, 01:29:04 pm »
Put everything into an SQL database, like SQLite. Searching a text file is very slow.

Alternatively: read the interesting parts, put them in a format like "name=value", sort and write that to a file (with a TStringList) and read it back into separate TStringLists, like one for each starting char.
« Last Edit: May 04, 2022, 01:32:44 pm by SymbolicFrank »

Espectr0

  • Full Member
  • ***
  • Posts: 218
Re: I need a suggestion…
« Reply #2 on: May 04, 2022, 04:01:31 pm »
so the best option is to convert it to a sql database and read from there directly?
And if I want to do word-style sugestiona that check what you write and suggest, is it fast?

Handoko

  • Hero Member
  • *****
  • Posts: 5158
  • My goal: build my own game engine using Lazarus
Re: I need a suggestion…
« Reply #3 on: May 04, 2022, 04:23:21 pm »
If performance is important, you should use a database. They are optimized for performance and use the system's memory efficiently.

If the data is not too large, you may load them into memory and do the searching in the memory. You can use TStringList, but it is not very memory efficient for working on large amount of data. Read more:
https://forum.lazarus.freepascal.org/index.php/topic,58531.msg436020.html#msg436020

Even, if memory is not an issue. I doubt we can write data searching/sorting code that is as fast as those using database.
« Last Edit: May 04, 2022, 04:25:40 pm by Handoko »

Edson

  • Hero Member
  • *****
  • Posts: 1302
Re: I need a suggestion…
« Reply #4 on: May 04, 2022, 04:26:21 pm »
How to store, depend of what you want to do with that information.

If you want to implement a word suggestion system, a database like SQLite would be a good option. It's fast, can implement advanced search criteria and can be easily updated, but needs a external library.

If you need something without dependencies, you can store the dictionary in memory using some list structure. In that case you would need to implement a search method according to your requirements. Search in memory can be very fast if you implement it correctly.
Lazarus 2.2.6 - FPC 3.2.2 - x86_64-win64 on Windows 10

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: I need a suggestion…
« Reply #5 on: May 04, 2022, 04:54:08 pm »
I did something like this in the past, to get addresses from postal codes. Split into 100 different, sorted lists, 2 GB total. It took 2 minutest to start and read the file into memory. Total memory usage: 4GB. After that, it was blindingly fast. Select the right list, read the record in the middle, if it is too large, read the one midst between the first and the middle one, etc.

A database (on a fast SSD) can do just as well, it depends on the size of the index and the complexity of the queries. If the index doesn't fit in memory, it slows down dramatically.

Handoko

  • Hero Member
  • *****
  • Posts: 5158
  • My goal: build my own game engine using Lazarus
Re: I need a suggestion…
« Reply #6 on: May 04, 2022, 06:45:12 pm »
I did something like this in the past ...

Interesting.

I personally will avoid using any database because I am not good in database programming. For your case - finding/sorting huge amount of data - I think most people will consider using a database first. Good to know your solution works but I wonder why you chose not doing it using a database. Can you please shed some light?

MarkMLl

  • Hero Member
  • *****
  • Posts: 6692
Re: I need a suggestion…
« Reply #7 on: May 04, 2022, 07:20:33 pm »
I personally will avoid using any database because I am not good in database programming. For your case - finding/sorting huge amount of data - I think most people will consider using a database first. Good to know your solution works but I wonder why you chose not doing it using a database. Can you please shed some light?

In the pre-database era, I handled repetitive work on large (for the day) files full of columnar text by (a) generating a single level-1 index file comprising 4-byte entries giving the offset of every line relative to the start and (b) generating multiple level-2 index files comprising the offsets in the level-1 index sorted according to specified columns in the original text file. I've got a vague recollection of tools built on top of those which allowed files to be merged etc... in any event it was a simple solution which turned out to be flexible and efficient.

I'm not necessarily arguing against using a database now that they are easily available. However for a comparatively simple read-only job something like that might be an alternative, particularly if storage inflation is considered to be a problem (e.g. PostgreSQL files are of the order of 10x the size of the original raw data).

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Espectr0

  • Full Member
  • ***
  • Posts: 218
Re: I need a suggestion…
« Reply #8 on: May 04, 2022, 09:10:30 pm »
so the first option would be to read the xml and fill a specialized list of:

 
Code: Pascal  [Select][+][-]
  1. type TdbItem = packed record
  2.    Original, Translated, Notes: String;
  3.  

 But if the list is 100000 or more TdbItem, does it consume a lot of memory?

 So I should create a sql database?

 Thank you

Handoko

  • Hero Member
  • *****
  • Posts: 5158
  • My goal: build my own game engine using Lazarus
Re: I need a suggestion…
« Reply #9 on: May 05, 2022, 04:16:45 am »
I have an own-made TListBox-like component that will do real-time filter based on user's input. I tested it on a case of 500000 (500k) items, which on average each item was 12 characters length. For optimizing memory usage and performance, the component uses 3 lists:
- A TStringList for keeping the original data
- A dynamic array of LongInt for filtered items
- An additional dynamic array of LongInt for temporary usage

Running the code on Linux, my task manager only showed memory increasing about 0,1 GB.

Nowadays computers usually have at least 4GB memory. So I believe you should be okay to use TStringList for handling 100k records of 3 strings. Except those strings are very long or you're trying to maintain too many lists internally.

So I should create a sql database?

Your case seems to be doable either with or without SQL.

If you have plenty of time and you're not good in SQL, do it using SQL. Use this opportunity to improve your SQL skill. If your time is limited and you're not good in SQL then don't do it using SQL. If you want advanced search features like mentioned by Edson, use SQL.

@MarkMLl
Good reasons.
« Last Edit: May 05, 2022, 05:18:07 am by Handoko »

Zvoni

  • Hero Member
  • *****
  • Posts: 2330
Re: I need a suggestion…
« Reply #10 on: May 05, 2022, 08:27:34 am »
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

Espectr0

  • Full Member
  • ***
  • Posts: 218
Re: I need a suggestion…
« Reply #11 on: May 05, 2022, 01:05:56 pm »
thanks everyone for the suggestions,
another thing that would make me more inclined to sql, exists this function in sqllite :

Quote

SQL string similarity function
The %SIMILARITY function returns a numeric value indicating the similarity between each value of field and the text specified in document. The fractional values returned range from 0 (no similarity at all) to 1 (identical). The returned value is of type NUMERIC with a precision of 19 and a scale of 18.


Nuevamente gracias.

Zvoni

  • Hero Member
  • *****
  • Posts: 2330
Re: I need a suggestion…
« Reply #12 on: May 05, 2022, 01:29:27 pm »
thanks everyone for the suggestions,
another thing that would make me more inclined to sql, exists this function in sqllite :

Quote

SQL string similarity function
The %SIMILARITY function returns a numeric value indicating the similarity between each value of field and the text specified in document. The fractional values returned range from 0 (no similarity at all) to 1 (identical). The returned value is of type NUMERIC with a precision of 19 and a scale of 18.


Nuevamente gracias.
AFAIK, No!

The only thing coming near your requirement would be the SpellFix-Extension, because it has Distance and Score
https://www.sqlite.org/spellfix1.html
Quote
......and is not a part of any standard SQLite build. It is a loadable extension.
« Last Edit: May 05, 2022, 01:33:28 pm by Zvoni »
One System to rule them all, One Code to find them,
One IDE to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
Code is like a joke: If you have to explain it, it's bad

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: I need a suggestion…
« Reply #13 on: May 06, 2022, 09:09:58 am »
I did something like this in the past ...

Interesting.

I personally will avoid using any database because I am not good in database programming. For your case - finding/sorting huge amount of data - I think most people will consider using a database first. Good to know your solution works but I wonder why you chose not doing it using a database. Can you please shed some light?

The input was a very large XML document, so I had to process it anyway, and the output was a set of key:value pairs, handed out by a HTTP server. Normally, you then install a database and webserver on the server and build a service with some scripting language. But after processing, I had the output in memory in a TStringList anyway, so why bother? Open a port and send it directly.

Only, the TStringList exceeded 4GB in size, which created problems, and searching for the key was quite slow. So, I made a lightweight list with a separate index and I split it into 10 separate lists (all keys started with a number). That was much better, but it still had a noticeable delay between typing the URL in a browser and receiving the response. With a 100 separate lists (the first two chars), the processing of the initial XML was much faster and the interval between requesting and receiving was down to the network latency.

Often, if you need a webserver and database, you have to ask the sysadmin, which complicates matters. It is much easier just to put an executable and text file on the server and make sure that executable always runs.

 

TinyPortal © 2005-2018