Data exploration and visualization using R

Scripting in R and Jupyter

The following repository helps you learn how to create a dataset from end-to-end and performing some data exploration and vizualization.

Prerequisites : the story you want to tell

To implement data visualization in R programming, you should have some interest for data you use daily either in your job or at uni. Before I created this gist, I imagined how my data visualization could be of interest to Hadoop professionals on social networks since ultimately I share my gist to my Twitter and Linkedin followers. I therefore decided to find some available data related to this technology that could be interesting, to create a dataset in which I would use these data, to read this dataset using R, to perfom some analysis and cleaning operations on this dataset and to create a vizualisation chart that could tell a story about this dataset.

How to : make a data exploration and visualization using R

The following steps will help you visualize "the number of nodes in a Hadoop cluster used by major tech companies" (the story that I want to tell). To implement what I did, you may wish to proceed as follows:

you can follow the below steps to understand all the steps from end-to-end
or you can use program.rda in R Studio or in your favorite workbench to check the output

Steps

Running the tests

I am using Ubuntu (18.04 bionic).

Check on your shell if R Studio is correctly installed using this:

Check RStudio version

$ R --version

Check Jupyter Notebook version

$ jupyter --version

You need RStudio and Jupyter Nptebook installed on your PC to proprely use this gist.
Jupyter Notebook is not compulsory. It is another way to read R programming scripts.
You can still use Jupyter Notebook on remote sites to perform same operations you would perform in RStudio.

use https://labs.cognitiveclass.ai (create a free account, then click on "JupyterLab" in the Build Analytics section)
use https://dataplatform.ibm.com (recommended for IBM Coders)

Built With

Notepadqq - A text editor - Linux/Unix
R Studio - A statistical computing environment
ggvis - a package for creating histograms
ggplot2 - a famous package for plotting in R

This dataset was created using notepadqq.
Data is sorted by company name, number of nodes, country name, server type, server version and position in the table.
Save the code below in .csv and read it using RStudio before you invoke vizualisation functions.
Data are provided by various sites. Some of them are listed in Tips.md

Versioning

I used no vesioning system for this gist, which repos status is flagged as concept because it is intended to be a demo or POC (proof-of-concept).

Author

Isaac Arnault

Licence

All public gists https://gist.github.com/aiPhD
Copyright 2018, Isaac Arnault
MIT License, http://www.opensource.org/licenses/mit-license.php

Exercise

As an IT or Big Data Project Manager, your are asked by the Information System Manager to use a dataset in order to do some presentation regarding the management of Hadoop clusters all over the world. For your presentation, you have decided to include some metrics related to the number of nodes processed by top Internet companies and to locate the servers on which the nodes are processed by Internet Protocol address. Since some data are available in the Public Domain (on the Internet), you have decided to go for them. This excercise is only a part of a whole set of steps you'd have conducted on top of your presentation (Business understanding, Analytic approach, Data requirements / - collection / - analysis / - preparation, - modeling). Completing this exercise could be seen as a prerequisite regarding data analysis for enterprise.

Create your dataset by using data from this Slideshare

Consider the following range of data while extracting them from the above link: dataset = {2, 21}

Name the variables of your dataset Id, Company, Nodes, Country, Server

Go to Tips.md to find sources where you can find Server name and Country

Assign to each Id a Company, number of Nodes, Country and Server Name

Read your dataset using RStudio or Jupyter

Use Jupyter to perform some exploration of your dataset

Use RStudio to perform some visualisation of your dataset:

Install and activate ggvis and ggplot2 packages from the CRAN

Use geom_dotplot function for plotting. Sort the graph by Company per Nodes.

Question: How many companies use {500, 1500} nodes? Name the companies while visualizing the graph.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
LICENSE		LICENSE
OUTPUT.md		OUTPUT.md
README.md		README.md
Tips.md		Tips.md
dataset_original.csv		dataset_original.csv
exercise_solution.md		exercise_solution.md
scripting.R		scripting.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data exploration and visualization using R

Scripting in R and Jupyter

Prerequisites : the story you want to tell

How to : make a data exploration and visualization using R

Running the tests

Built With

Versioning

Author

Licence

Exercise

About

Releases

Packages

Languages

License

isaacarnault/R

Folders and files

Latest commit

History

Repository files navigation

Data exploration and visualization using R

Scripting in R and Jupyter

Prerequisites : the story you want to tell

How to : make a data exploration and visualization using R

Running the tests

Built With

Versioning

Author

Licence

Exercise

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages