Skip to content

Input optimizations

Ilya Kashnitskiy edited this page Jan 29, 2020 · 5 revisions

Task:

  • read big file
  • file consist a lot of short lines:
    • list of nodes (name, coordinates), one per line, unknown amount, random order
    • list of connects (name to name), one per line, unknown amount, random order
  • valid all input data

Points for optimization:

Legend: L - number of lines, N - number of nodes, C - number of connects (undirect)

  • we should get the next line of the file very quickly - L times.
  • we should valid every line very quickly - N times for nodes + C times for connects
  • efficient data storage:
    • fast access
    • optimal memory using

Optimizations

GNL

  • We can't use standart getline() from libc.
  • Our own ft_get_next_line() (same as getline()) supports simultaneous reading from several FD. For this, them create list of buffers (1 for every fd, with allocated memory), and each time we call them, they looking in this list buffer for particular fd, and work with it (again - with allocated memory).
  • But all we need is a quick read of a single fd.
  • And, obviously, we should be able to quickly extract the next line from the buffer.

So, special for lemin I wrote a simple but fast version of get_next_line() (and put into my libft).

Take a look at ft_fast_input_gnl.h ft_fast_input_gnl.c

Data valid and storage

1. For storage data while reading I use my t_vector - dynamic array from libft:

  • access any element - O(1)
  • add new element to the end - O(1)
  • number of memory allocations - log(n) - log(INIT_SIZE) - where n - count of storaged data.

2. Node storage:

  • Coordinates of nodes are necessary only for visualization. Directly in the program, we only need to check them for validity. However, I don’t see any reason to store them, therefore, after checking for validity (and wright them back, of course (but more about that in output optimizations), I will omit them.
  • Despite the fact that the gnl returns a finished line, it allocates a new piece of memory for each new line, and I really (I mean REALLY) do not want to work with an array of pointers in the future, each of which points to its own, potentially EVEN not cached piece of memory. This can slow me down to 20 times, compared with sequential reading. In addition, we will probably work with short names, and, for example, a name of 5 characters will occupy at least 2 bytes (or even more - it depends on the particular machine). And do not forget about the need to get rid of the coordinates!
  • So, this is my decision:
    • create t_vector chars - stored item - char
    • create t_vector names - stored item - size_t/(char *)
    • get next line
    • check on valid (name + coords; no need to strsplit() ... ; just detect count of words (must be 3); check 2-nd and 3-th word on valid format for atoi())
    • pick out node name (for example from i to j symbols in line)
    • add into names new item with value = chars.curlen
    • add into chars j-i + 1 items - full name + '\0'
    • go to get next line and repeat

by the way - in fact, on each iteration wi will call 1 malloc() for line and then free() it. simple for system's memory management => fast!

  • after read all nodes, iterate over names and turn indexes (size_t) to direct pointers (char *) (they always have same size, but we can store pointer while reading because we work with dynamic struct and it may realocate memory.)
  • Here are the benefits of this approach:
    • names - usual array of pointers. each pointer pointed to 'C'-string. convenient for work
    • access to each name - O(1)
    • all these strings are next to each other in the chars - ideal for caching
    • each line takes exactly as much space as it needs and not a single byte more
    • we can use index of node in names as ID!!! from here on, for indicate to particular node we will use itself index in names. This is extremely convenient!

3. Connect storage:

  • create t_vector connects - stored item - {int;int} (structure with two numbers)
  • We get 2 names. For every, we need match it with particular node. Best way - use nodes ID. So our task - the fastest search for the location of an element name in an array names.
  • Best case - binary search.
  • So, when I read all nodes, I am just sort names using usual qsort() and strcmp(). And for match every node in connections by simplest binary search. Nothing difficult and very quickly.

Why not binary tree or hesh-table?

  1. more difficult memory organisation and not so good for cache.
  2. it have no sense - such structures use, when we need fast search while push and pop data. in our case, we read all data, then work with it. simple array and qsort + binaryfind MUCH faster!

By the way, we also need to verify that all names are unique. After sorting the names it became easy! Just iter over all vector and compare i and i+1 elements!

So, short overview:

  • add new node - O(1)
  • qsort - N * log (N)
  • validation - O(N)
  • add new connect - 2 * log(N)