-
Notifications
You must be signed in to change notification settings - Fork 0
Input optimizations
Ilya Kashnitskiy edited this page Jan 29, 2020
·
5 revisions
- read big file
- file consist a lot of short lines:
- list of nodes (name, coordinates), one per line, unknown amount, random order
- list of connects (name to name), one per line, unknown amount, random order
- valid all input data
Legend:
L
- number of lines,N
- number of nodes,C
- number of connects (undirect)
- we should get the next line of the file very quickly -
L
times. - we should valid every line very quickly -
N
times for nodes +C
times for connects - efficient data storage:
- fast access
- optimal memory using
- We can't use standart
getline()
from libc. - Our own
ft_get_next_line()
(same asgetline()
) supports simultaneous reading from several FD. For this, them create list of buffers (1 for every fd, with allocated memory), and each time we call them, they looking in this list buffer for particular fd, and work with it (again - with allocated memory). - But all we need is a quick read of a single fd.
- And, obviously, we should be able to quickly extract the next line from the buffer.
So, special for lemin I wrote a simple but fast version of get_next_line()
(and put into my libft).
Take a look at ft_fast_input_gnl.h ft_fast_input_gnl.c
1. For storage data while reading I use my t_vector
- dynamic array from libft:
- access any element -
O(1)
- add new element to the end -
O(1)
- number of memory allocations -
log(n) - log(INIT_SIZE)
- wheren
- count of storaged data.
2. Node storage:
- Coordinates of nodes are necessary only for visualization. Directly in the program, we only need to check them for validity. However, I don’t see any reason to store them, therefore, after checking for validity (and wright them back, of course (but more about that in output optimizations), I will omit them.
- Despite the fact that the gnl returns a finished line, it allocates a new piece of memory for each new line, and I really (I mean REALLY) do not want to work with an array of pointers in the future, each of which points to its own, potentially EVEN not cached piece of memory. This can slow me down to 20 times, compared with sequential reading. In addition, we will probably work with short names, and, for example, a name of 5 characters will occupy at least 2 bytes (or even more - it depends on the particular machine). And do not forget about the need to get rid of the coordinates!
- So, this is my decision:
- create
t_vector chars
- stored item -char
- create
t_vector names
- stored item -size_t/(char *)
- get next
line
- check on valid (name + coords; no need to
strsplit()
... ; just detect count of words (must be 3); check 2-nd and 3-th word on valid format foratoi()
) - pick out node name (for example from
i
toj
symbols inline
) - add into
names
new item with value =chars.curlen
- add into
chars
j-i + 1
items - full name +'\0'
- go to get next line and repeat
- create
by the way - in fact, on each iteration wi will call 1
malloc()
forline
and thenfree()
it. simple for system's memory management => fast!
- after read all nodes, iterate over
names
and turn indexes (size_t) to direct pointers (char *) (they always have same size, but we can store pointer while reading because we work with dynamic struct and it may realocate memory.) - Here are the benefits of this approach:
-
names
- usual array of pointers. each pointer pointed to 'C'-string. convenient for work - access to each name - O(1)
- all these strings are next to each other in the
chars
- ideal for caching - each line takes exactly as much space as it needs and not a single byte more
-
we can use index of node in
names
as ID!!! from here on, for indicate to particular node we will use itself index innames
. This is extremely convenient!
-
3. Connect storage:
- create
t_vector connects
- stored item -{int;int}
(structure with two numbers) - We get 2 names. For every, we need match it with particular node. Best way - use nodes ID. So our task - the fastest search for the location of an element
name
in an arraynames
. - Best case - binary search.
- So, when I read all nodes, I am just sort
names
using usualqsort()
andstrcmp()
. And for match every node in connections by simplest binary search. Nothing difficult and very quickly.
Why not binary tree or hesh-table?
- more difficult memory organisation and not so good for cache.
- it have no sense - such structures use, when we need fast search while push and pop data. in our case, we read all data, then work with it. simple array and qsort + binaryfind MUCH faster!
By the way, we also need to verify that all names are unique. After sorting the names
it became easy! Just iter over all vector and compare i
and i+1
elements!
So, short overview:
- add new node -
O(1)
- qsort -
N * log (N)
- validation -
O(N)
- add new connect -
2 * log(N)