-
Notifications
You must be signed in to change notification settings - Fork 0
EDA
Authors: Ágnes Salánki, Attila Klenik
This short syllabus summarizes the basics of a typical data analysis workflow, provides a very high-level overview of: some useful Python packages (e.g., Pandas, SKLearn, Plotly, Bamboolib); Jupyter Notebook for creating self-contained data analysis reports; and the basics of linear regression.
It is strongly recommended that you follow up on the referenced tutorials and guides at the end of this syllabus (especially if you are not familiar with a technology).
Your solution has to meet the following criteria (any deviation must be justified):
- The solution must be implemented and documented in a single Jupyter Notebook file.
- Each task must start with its description taken from this syllabus (so the notebook is self-contained).
- Each plot must be preceded with a short description, containing:
- The question the plot answers (or the statement the plot "proves").
- How you acquired the input data for the plot (if it's not trivial, like simply a column of the original data frame).
- You don't need to explain the code line-by-line, just explain what it does in plain English (e.g., "We grouped the data by processing node type, and took the average of the task durations for each node type").
- Each plot must contain (on the figure) the following minimal information:
- Axis names (including the unit of the axis, and possibly a log-scale indicator).
- Meaningful legend item names, if multiple data set is plotted in the same figure.
- Each plot must be followed by a conclusion in 1-3 sentences, interpreting (meaningfully) what we see on the plot, and how that answers the question.
- Each exercise must be at least started, if not completed. If you skip an exercise completely, then your entire lab solution will be rejected.
You must measure the performance of your implementation and "upload" the results in a well-formed .csv file to the BenchmarkDataPool repository. Just before the lab, a collector script will create one MERGED.csv
data set from every team's file. This will be your input data to analyze.
Without your uploaded benchmark file, you will not be allowed to start the lab.
You must "upload" your CSV through a pull request:
- Fork the BenchmarkPool repository.
- Clone the forked repository.
- Place the
<team_name>.csv
file into the rootcsvs
directory. - Add, commit, and push the changes to your fork.
- Open a pull request on GitHub, and wait for your changes to be merged.
The deadline to open the pull request is 2021.10.24. 12:00 UTC+1 (Sunday noon)!
- You must add/commit/push your solution (a single Jupyter Notebook file) in your own repository, in the
docs
directory:docs/7-eda.ipynb
- You must tag your last commit (containing your solution) as
lab7
.
The input files (parts from two Jane Austen novels with varying sizes) are in the input/texts
directory of the BenchmarkDataPool
repository.
You must run a comparison on word granularity with a shingle size of 2 for each Sense
(Sense and Sensibility) and Pride
(Pride and Prejudice) text pair, meaning 36 (6x6) comparisons at the end (Pride1.txt
and Sense1.txt
, Pride1.txt
and Sense2.txt
, etc.). Please run each comparison independently from each other (i.e., one after another, sequentially) to assure the computation times to be as precise as possible. The processed logs of all 36 runs must go into the single <team_name>.csv
file.
Use the Java Generics-based solution for comparison!
The format of the output is theoretically the same as what you used for logging, plus including the name of your team and a measurement ID as extra columns. So, your CSV has to contain 6 columns, separated by colons (,
), in this order: type
, name
, input_id
, time
, team
, meas_id
.
-
type
is eitherQUEUE
,START
orSTOP
-
name
is the unique name of the processing node in your workflow, adhering to one of the following formats:Tokenize[1-2]
,Collect[1-2]
,ComputeScalar[1-3]
,ComputeCosine
. -
input_id
is the name of the document the node is working on. It must beSense[1-6]
orPride[1-6]
for single-input nodes, andPride[1-6]_Sense[1-6]
for multi-input nodes. You must follow the above naming convention for the document pairs (for comparison reasons across multiple teams)! -
time
is the output of yourSystem.nanoTime()
call -
team
is the name of your team (same for every row in your CSV) -
meas_id
is the input document pair names on which you perform the current measurement:Pride[1-6]_Sense[1-6]
. Enumerate the 36 combinations in this format (PrideX_SenseY
), so it's consistent across teams!
Every row in your CSV must be unique if you properly set the above attributes!
Note, that you have 8 processing nodes in your workflow: 2 Tokenize
, 2 Collect Shingles
, 3 Compute Scalar Product
, and 1 Compute Cosine Similarity
. Each node produces 3 CSV entries (queue, start, stop). So you will have 24 CSV entries for a document pair. In total, you must have 864 CSV entries for the 36 measurements, plus the header row.
Sense6 and Pride6 contain words in order of magnitude of 10,000. If your solution is not fast enough to produce the results for each pair of documents, please leave the time column empty, but keep the data in every other column.
It is possible that your logging format implementation differs from the above format. In this case, don't change the implementation, but perform some basic log transformation (like find & replace) on your resulting log file.
If you have any questions about the input format, feel free to reach out to Attila Klenik.
A script will validate your CSV file. If you feel uncertain about your submission, you should test it with the script above until it returns without an error. Example CSV files can be found in this directory.
If this script fails on your submission then you will not be allowed to start the laboratory.
Requirements to run the script:
- Python 3.X
- The following Python packages (installed through your favorite package manager):
- glob2
- pandas
- numpy
Once you placed your CSV in the csvs
directory, run the following command from the root of the BenchmarkDataPool
directory:
python ./scripts/validate.py ./csvs/
You should see something like the following output (i.e., without any error):
Processing file: ./csvs/team-1.csv
Processing file: ./csvs/team-2.csv
The primary coding language of this laboratory is Python. In order to spend the laboratory time with meaningful analysis tasks, obtaining a basic understanding of Python syntax is highly recommended before the laboratory session. If you have never used the language, please consider gaining some experience. Many good getting started tutorials are available, an excellent one is this by w3schools.
The gathered data will be stored in Pandas data frames, a great library for table-like data manipulation. A short (10 minutes) introduction of Pandas is available here. The Pandas library provides elegant (i.e., short, readable) and fast solutions for table manipulations. If you need to write code that iterates through rows, then there's probably a better solution.
Visualization is a key component in exploratory data analysis. Any type of visualization is allowed in the lab, however, the recommended library to use is Plotly. Plotly provides an easy-to-use API to produce various types of interactive plots. You can start exploring the different plots on this page. The left sidebar points to many other, more complicated plot types.
The Pandas and Plotly libraries can be a little intimidating at first. Luckily, the Bamboolib library provides an intuitive GUI for Pandas data frames, allowing point-and-click transformations and visualizations. Moreover, the performed actions are also exported as code snippets, so there is no hidden magic. Bamboolib is the recommended approach for newcomers.
Install it with the following code snippet in your first notebook cell:
!pip install --upgrade bamboolib --user
!python -m bamboolib install_nbextensions
For documenting the tasks, students have to create a Jupyter Notebook, which allows mixing (Python) code and (Markdown) documentation (similarly to R Markdown). An introduction to Jupyter Notebook can be found here.
Two lightweight alternatives are available to work with notebooks:
- Start your own local server using Docker.
- Use Google Colab through a browser (requires a Google account, and does not support Bamboolib).
The easiest way to get started is to use the appropriate Docker image to start a local Jupyter Notebook server, and simply connect to it through your favorite browser. The advantage of using such prebuilt environments is that they usually contain a LOT of preinstalled libraries/packages, so we don't have to manage these in our own machine.
Execute the following command from your docs
directory:
docker run -i -p 8888:8888 -v "$(pwd)":/home/jovyan/notebooks jupyter/scipy-notebook
Note, that after running this command, the current working directory (docs
) will be attached to the running container as the /home/jovyan/notebooks
directory, so the notebooks saved in that directory will be automatically available on the host. After executing the command something similar should appear on the output:
Executing the command: jupyter notebook
[I 11:50:17.179 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret
[W 11:50:17.540 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[I 11:50:17.590 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
[I 11:50:17.590 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 11:50:17.602 NotebookApp] Serving notebooks from local directory: /home/jovyan
[I 11:50:17.603 NotebookApp] The Jupyter Notebook is running at:
[I 11:50:17.603 NotebookApp] http://(2b34f9d720ea or 127.0.0.1):8888/?token=00989fb56061981545b2caee52a3ea13faf05da78146d5ca
[I 11:50:17.603 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 11:50:17.603 NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://(2b34f9d720ea or 127.0.0.1):8888/?token=00989fb56061981545b2caee52a3ea13faf05da78146d5ca
Accordingly, the server can be reached at http://127.0.0.1:8888/?token=00989fb56061981545b2caee52a3ea13faf05da78146d5ca
Another way to launch the server is to create a docker-compose.yaml
file in the docs
directory with the following content:
version: '3.0'
services:
jupyter:
image: jupyter/scipy-notebook
ports:
- 8888:8888
volumes:
- ./:/home/jovyan/notebooks
Then simply execute the following command from the docs
directory:
docker-compose up
Google Colab is an easy way to work with a notebook in the usual collaborative style, like Google Docs. For our purposes, it works identically to a local Jupyter Notebook server (except for the Bamboolib support).
If you use Colab, don't forget to download the notebook at the end, and add it to your own repository!
The goal of data analysis projects is to extract data-based information about an observed system or phenomenon. In an ideal case, at the end of the project, we have a deeper insight into how our system works (what is happening?) and have some hypotheses about the cause-effect relationships (why is it happening this way?).
In our case, the observed system is the implemented workflow, our goal is to analyze its performance, find the possible bottlenecks, and document suggestions for code improvement (both data structures and algorithms).
The analysis usually consists of two steps: exploratory and confirmatory phases.
This phase consists of activities related to the exploration of the system. It answers the most basic questions related to the marginal and joint distributions of variables, e.g., the marginal distribution of each individual variable and basic relationships between them. Since one- and two-dimensional visualization techniques are excellent (i.e., intuitive, fast, reliable) tools for extracting this information, the exploratory analysis in practice means the plotting of, and inspection into, many graphical plots.
The most frequently used one- and two-dimensional visualization techniques are:
- Histograms and boxplots for one-dimensional numerical variables;
- Barcharts for one-dimensional categorical variables;
- Scatterplots for visualization of two numerical variables;
- Mosaic plots for visualization of two categorical variables;
- Colored/grouped one-dimensional plots (either histograms and boxplots) for visualization of the relationship between a categorical and a numerical variable.
At the end of the EDA phase, we already have some ideas, so-called hypotheses about the basic phenomenon in the system. E.g., the distribution family of the processing time is a two-modal Gaussian. However, to prove (or publish) these, ad-hoc ideas are not enough, we need statistically significant results. This is the main task of the CDA phase. The primary tools here are the statistical tests (z-test, chi-square test, etc.).
This laboratory focuses on the exploratory phase, thus, concepts of statistical testing are not covered here.
Data representation can be long or wide.
- Long data frames have as few as possible columns, minimally three: the ID of the object, a feature name (what we have measured, observed, etc.) and the value itself.
- Wide data frames have one ID column and the features appear in separate columns.
Running example: let's run some of the measurements for this laboratory.
The implemented logging mechanism produces data in the long format:
type name input_id time team meas_id
---------------------------------------------------------------------------------------------
0 QUEUE Tokenize1 Pride6 2297449445928422.0 Team-1 Pride6_Sense2
1 START Tokenize1 Pride6 2297449447188090.0 Team-1 Pride6_Sense2
2 QUEUE Tokenize2 Sense2 2297449447927074.0 Team-1 Pride6_Sense2
3 START Tokenize2 Sense2 2297449449799404.0 Team-1 Pride6_Sense2
4 STOP Tokenize2 Sense2 2297449485702116.0 Team-1 Pride6_Sense2
5 STOP Tokenize1 Pride6 2297449605556119.0 Team-1 Pride6_Sense2
Here, our complex object ID is the (team,meas_id,name,input_id)
tuple (input_id
could be omitted, since meas_id
and name
determines it); the feature name is in column type
; and the corresponding observered value is the time
.
The equivalent data in wide format:
team meas_id name input_id QUEUE START STOP
----------------------------------------------------------------------------------------------------------------------------------
0 Team-1 Pride6_Sense2 Tokenize1 Pride6 2297449445928422.0 2297449447188090.0 2297449605556119.0
1 Team-1 Pride6_Sense2 Tokenize2 Sense2 2297449447927074.0 2297449449799404.0 2297449485702116.0
The wide format "promotes" every event of a task to a dedicated column.
The long format is easy to produce in the form of logs, for example, resulting in small, flexible, and fixed-sized data schemas. However, it is difficult to work with, since data corresponding to a given task is scattered among multiple rows. Wide formats arrange the data around the objects of interest, allowing easy derivation of further attributes (e.g.,
processing_time = stop - start
;queueing_time = start - queue
).
In conclusion: there is no "ideal" data format, it should be adopted to the characteristics of the task and the variables to be emphasized.
Fortunately, Pandas data frames provide a so-called pivot_table operation to transform long data format into the wide format.
For our case, pivoting needs three variable categories:
- index variables: the variables in the long data that will make up the unique key of each row in the wide data (i.e., a single task/node observation of a given team for a given measurement)
- column variable: the variable in the long data, whose values will be converted to columns in the wide data
- value variable: the variable in the long data, whose values will be the values of the newly created columns.
Regression is one of the most frequently used machine learning methods; the goal is to find a function that estimates a single numeric target variable we want to predict, based on given input variables, so that the estimate is as good as possible. Assuming that a function family (linear, exponential, etc.) has already been chosen, the main task of regression is to choose the parameters of the function (that identify a single member of the family) so that the function fits the data (estimates/predicts the target variable) as well as possible (minimizing some measure of error).
Note: choosing a function family is not in the scope of regression, it has to be performed in another way. In this example it is going to be a human task and detection will be made by visualization.
Then, what does regression do? It computes:
- the best parametrization minimizing estimation error,
- the residuals for given data points and
- the goodness of fitting -- a numeric value indicating how well the function works.
The example above is simple enough to give an idea of what we can expect while using the fit
function of SKLearn. Below, there is a brief summary about the basic definitions of linear regression.
Assuming a linear relationship between two numeric variables X
and Y
, we are searching for the best (a, b)
parametrization of the line Y = aX + b
. X
and Y
are variables with (x_i, y_i)
data points in pairs. After we computed coefficients a
and b
, we can calculate a target f(x_i)
value for each x_i
data point.
-
Residual -- the difference between estimated
f(x)
and actualy
value for a singlex_i
data point. -
LSE -- Least Square Error is computed as the sum of residual squares:
LSE = sum(y - f(x))^2)
- R-square values -- it is a good indicator of how strong the established relationship is: it determines how the variance of the target variable can be explained by the variance of input variables. In the case of linear regression, it is simply the (Pearson's) correlation coefficient between Y and the vector containing the predicted values.
Acceptance rules of thumb: the linear model is considered good, if
- the R-square value is high enough and
- the distribution of residuals is Gaussian.
A nice overview of using either the SKLearn or Statsmodels library for linear regression problems can be found here
Throughout the lab, you must work with the MERGED.csv
that contains the results of every team.
Pandas can load the CSV directly from Github (but you must link to the raw file content), so you always have access to the newest version:
import pandas as pd
long_df = pd.read_csv('https://raw.githubusercontent.com/ftsrg-csi/BenchmarkDataPool/master/MERGED.csv', low_memory=False)
- Transform your long-format input CSV into a wide-format data frame containing only queueing and processing times of nodes in
ms
and not the output of theSystem.nanoTime()
call. (1p) - Analyze your own processing times first! How does the processing time of the individual task types (e.g., Tokenize, Collect, ...) scale according to the input sizes (note, that the size of the input texts increases exponentially)? (2p)
- How volatile are the processing times of the tasks when run with the same inputs in different measurement rounds? (Note that you will run the Tokenize process on Sense1 6 times, for example) (2p)
- How do the executed steps of the workflow contribute to the total execution time? How does the input size influences it? Can you guess the performance bottleneck of your solution based on these data? (4p)
- Compare your execution times with those of other teams. (1p)
- Extra task: fit a linear model that tries to predict the execution time of the workflow (target variable), based on the size of its input (input variable). Fit the model without the measurements having either Sense6 or Pride6 (so, big) inputs. Check how your model predicts the execution time for the big inputs. (2p)
- General remark for every exercise: use the "pythonic" and pandas ways to perform your data transformations. You probably don't need a for-loop to iterate over the DF, pandas has built-in functions for most scenarios. This will make your code more efficient and more readable! 👀
- For exercise 1: Since you will work with the wide-format DF throughout the seminar, this step is crucial. If you suspect that your solution might be wrong, ask for help ❗ ❗
- For exercise 2:
- Since your processing nodes have unique IDs, you need to derive the type of the node from its name (which should be easy if you followed the suggested logging format).
- The input text sizes are not included in your log by default. But you can process the files easily (defining some measure of size), assemble a DF from it, which can be merged with your original data. 📈
- For exercise 4: There are probably overlaps between the parallelly executed processing nodes, so be careful about defining the "total" execution time and identifying the bottleneck. 🤔
- For exercise 5: Comparing the performance of different solutions needs a target metric that will serve as the basis of comparison. It is up to you to select this metric. Different metrics could result in different winners 👑 🥇
- For exercise 6: You will need to derive an end-to-end execution time for a measurement, which is scattered across multiple rows/tasks. Group-by might help ℹ️