Skip to content

Commit

Permalink
Merge pull request #26 from UMM-CSci-3601-S17/UpdateExcelParserDocs
Browse files Browse the repository at this point in the history
Update excel parser docs
  • Loading branch information
leonidscott authored May 4, 2017
2 parents db44309 + 41f15cc commit ab57177
Show file tree
Hide file tree
Showing 4 changed files with 40 additions and 21 deletions.
22 changes: 16 additions & 6 deletions Documentation/ExcelFileRequirements.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,30 @@ One of our priorities is to provide a modular future proof system for inputing a
Our code takes in a simple Excel file with a `.xlsx` file ending.

## What our code needs from the Excel File
We allow any spreadsheet provided it contains these properties.
We allow any spreadsheet provided it contains these properties.

![ExampleSpreadSheet](Graphics/SpreadSheetRequirements.png)

### Key Row
Our system does not search for certain keys(categories), instead, it will add every category and its values into our database. The red box in the figue above describes a region of the spreadsheet we refer to as the *key rows*. These are rows *2* through *4* in the spreadsheet. Any text in this text will be interpreted as a category, there can be text in one row in a column or in all three. For our code to parse your file correctly, it will need to have all of the categories, in some form, on these rows. Our system, does not, however, have a limit on how many categories there can be.
Our system does not search for certain keys(categories), instead, it will add every category and its values into our database. The red box in the figue above describes a region of the spreadsheet we refer to as the *key rows*. These are rows *2* through *4* in the spreadsheet. Any text in this text will be interpreted as a category, there can be text in one row in a column or in all three. For our code to parse your file correctly, it will need to have all of the categories, in some form, on these rows. Our system, does not, however, have a limit on how many categories there can be.

### First Row
None of the information from the first row will be read into our system.
None of the information from the first row will be read into our system.

### First Column
One of few assumptions we make about the format of the spreadsheet is that the first column has a value for every row in the spreadsheet. Beyond that the actual contents of the first row does not affect how the file will be parsed.

### Text Styling
The tools we are using to read in the Excel file do not give our code information about text styling. For this reason, we can not change how our system parses based on text styling.
### Ignoring Rows
The Accession List's primary use is for keeping track of flowers through out the WCROC staff, and there might be times in which a flower exists in the Accession List that you don't want on the website. For this reason, we will ignore a flower under two certain conditions:
1. If the garden location is blank, the flowr doesn't belong to a particular bed, so we will ignore it.
2. You may use a column in your Excel Spreadsheet named `Not Included`.
If there is a row you would like our system to ignore, put an **x** in the `Not Included` column of that row.

**Beyond that, the spreedsheet is yours!!!**
The following picture shows an example. The `""` indicates an empty cell, and any section in red is being ignored.

![ignored row](Graphics/ignoreRow.png)

### Text Styling
The tools we are using to read in the Excel file do not give our code information about text styling. For this reason, we can not change how our system parses based on text styling.

**Beyond that, the spreedsheet is yours!!!**
37 changes: 22 additions & 15 deletions Documentation/ExcelParser.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
# ExcelParser.java
`ExcelParser` takes a non-deterministic approach to parsing `.xlsx` files. It allows the customer to not only submit an `.xslx` file which they would prefer, but they can insert, delete, and rearange their content in any with a vast degree of feedom and get a reliably populated database. Here is a page explaining what `ExcelParser` needs from an `.xlsx` file to populate the database.
`ExcelParser` takes a non-deterministic approach to parsing `.xlsx` files. It allows the customer to not only submit an `.xslx` file which they would prefer, but they can insert, delete, and rearange their content in any with a vast degree of feedom and get a reliably populated database. Here is a page explaining what `ExcelParser` needs from an `.xlsx` file to populate the database.

This java class is responsible for converting from an excel file (`.xlsx`),
to our mongo database. This documentation was prepared to walk through how and why we implemented this class the way we did.

### Setup:
In order to use this parser, you will need to setup **Apachi POI** with your project. To do this, you will need edit your server level `build.gradle` file. First, add this line into your `dependencies` collection.
### Setup:
In order to use this parser, you will need to setup **Apachi POI** with your project. To do this, you will need edit your server level `build.gradle` file. First, add this line into your `dependencies` collection.

```gradle
compile 'org.apache.poi:poi-ooxml:3.15'
```
You will then need to refresh gradle. In **IntelliJ IDEA**, you can go to the *gradle window* and in the top left, press the blue refresh button, and **BAM YOU WIN**.
You will then need to refresh gradle. In **IntelliJ IDEA**, you can go to the *gradle window* and in the top left, press the blue refresh button, and **BAM YOU WIN**.

In our constructor we pass in a boolean, `test`.
This boolean will change the excel file to our test spreadsheet and populate the database so our excelParser tests know what the outputs should be.
In our constructor we pass in a boolean, `test`.
This boolean will change the excel file to our test spreadsheet and populate the database so our excelParser tests know what the outputs should be.

## Step 1: Extracting data from the xlsx document into a 2D Array
In our main method, the first thing we do is call `extractFromXLSX()`.
Expand All @@ -34,7 +34,7 @@ Because most of our 2D array is null at this point, we horizontally collapse the
We could have collapsed both vertically and horizontally at the same time, but for read and write simplicity, we opted for doing each of these steps individually. There are two steps involved in this proccess; locating the column to collapse the array at, and actually collapsing the array.

### Locating the collapse point: `collapseHorizontally()`
In the example `xlsx` file below, there are three rows that are grayed out. We designate these three rows (rows 1 through 3) as *key rows*. When collapsing horizontally, we start at row one at the rightmost part of our 2D array. We check to see if any of the three rows in the column are not null. If they are null, we will shift one column to the left and repeat. We keep doing this process until we reach a cell that is not null.
In the example `xlsx` file below, there are three rows that are grayed out. We designate these three rows (rows 1 through 3) as *key rows*. When collapsing horizontally, we start at row one at the rightmost part of our 2D array. We check to see if any of the three rows in the column are not null. If they are null, we will shift one column to the left and repeat. We keep doing this process until we reach a cell that is not null.

### Collapsing the array: `trimArrayHorizontally()`
This method starts where `collapseHorizontally()` leaves off. Because there is no built in method to trim arrays, let alone 2D arrays, we built one! It simply makes a new 2D array of a size specified by `collapseHorizontally()`, copies the old array into the new one and returns it.
Expand All @@ -44,25 +44,25 @@ This method starts where `collapseHorizontally()` leaves off. Because there is n

## Step 3: Vertically Collapse the Array
Vertically collapsing the array is much easier than collapsing it horizontally. We still use two steps to do this process.
### Locating the collapse point: `collapseVertically()`
Our assumption for finding the vertical collapse point is that `column 0` is consistently populated for every row we care about. To find the collapse point, we simply iterate on `column 0` from the bottom of the array until we find a non-null cell. We select this index as the collapse point.
### Locating the collapse point: `collapseVertically()`
Our assumption for finding the vertical collapse point is that `column 0` is consistently populated for every row we care about. To find the collapse point, we simply iterate on `column 0` from the bottom of the array until we find a non-null cell. We select this index as the collapse point.
### Collapsing the array: `trimArrayVertically()`
Once we know our collapse point we use `trimArrayVertically()` in a similar fashion to `trimArrayHorizontally()`.
We make a new 2D array as tall as `collapseVertically()` specifies and copy the old elements into it.
Once we know our collapse point we use `trimArrayVertically()` in a similar fashion to `trimArrayHorizontally()`.
We make a new 2D array as tall as `collapseVertically()` specifies and copy the old elements into it.

![VerticalCollapse](Graphics/VerticalCorrected.png)


## Step 4: Replace Nulls with Empty Strings: `replaceNulls()`
The method simply iterates through our 2D array and replaces all nulls with empty strings.
This prevents any null pointer exceptions in the future.
This prevents any null pointer exceptions in the future.

![ReplaceNulls](Graphics/ReplaceNulls.png)

## Step 5: Using *Key Rows* to generate Keys for Mongo Collections
To do this, we use the `getKeys()` method. This method accomplishes two things:
1. It dynamically makes a `String[]` of keys.
2. It filters the keys to match terms defined by the standards commite (eg, # becomes `id`) and not break things.
2. It filters the keys to match terms defined by the standards commite (eg, # becomes `id`) and not break things.

In order to make a `String[]` of keys, we iterate, column by column, through our key rows. For every column, we concatenate all the strings from each cell. In the table below, the key for that column would be `Common Name`, in the following table, the key would be `HB=Hang BasketC=ContainerW=Wall`.

Expand All @@ -77,10 +77,17 @@ In order to make a `String[]` of keys, we iterate, column by column, through our
|W=Wall |

`HB=Hang BasketC=ContainerW=Wall` is not a great for users or programmers alike.
There are some points in our project where passing this around can break things. For this reason we filter the keys.
There are some points in our project where passing this around can break things. For this reason we filter the keys.
We remove spaces, and equal signs. This is also a good oportunity to make our keys match what is specified by the standards committee. We change keys like `#` to `id`, and `Common Name` to `commonName`.

## Step 6: Populating the Database: `populateDatabase()`
This method starts at `row 4` (the first row after the key rows). It works by moving from left to right across each row, and at every cell, adding that cell's value as the value of a hashmap with its corresponding key as the key. The hashmap will be added into a document that gets added directly into the database. After this, the method moves to the next row, and repeats until it is at the bottom of the array.

**And that! Is how you turn any XLSX document into a populating database!**
It is also at this stage that we ignore certain rows that the customer would like ignored. On each row we check for:
* The gardenlocation is empty
* In a column called *not included*, if there is an `x`
If either of these are the case, such as in the following picture, we will not include that row in the database.

![not included](Graphics/ignoreRow.png)

**And that! Is how you turn any XLSX document into a populating database!**
Binary file added Documentation/Graphics/ignoreRow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,8 @@ public static String[] getKeys(String[][] cellValues){
if(keys[i].equals("Cultivar")) keys[i] = "cultivar";
if(keys[i].equals("Source")) keys[i] = "source";
if(keys[i].equals("Garden Location")) keys[i] = "gardenLocation";
if(keys[i].equals("Not Included")) keys[i] = "notIncluded";
if(keys[i].equals("not included")) keys[i] = "notIncluded";
if(keys[i].contains(" ")) keys[i] = keys[i].replace(" ","");
if(keys[i].contains("=")) keys[i] = keys[i].replace("=", "");
//if(keys[i].contains((UTF16.valueOf(0x00AE)))) keys[i].replaceAll(UTF16.valueOf(0x00AE), "");
Expand Down

0 comments on commit ab57177

Please sign in to comment.