index.html

<!doctype html>
<html lang="en">

  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

    <title>Files and Command-Line Arguments in Python</title>

    <meta name="description" content="
    Lecture on files and command-line arguments in Python for ME 599: Software Development for Engineering Research
    ">
    <meta name="author" content="Kyle Niemeyer">

    <meta name="apple-mobile-web-app-capable" content="yes" />
    <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />

    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

    <link rel="stylesheet" href="dist/reset.css">
    <link rel="stylesheet" href="dist/reveal.css">
    <link rel="stylesheet" href="dist/theme/simple.css" id="theme">

    <!-- For syntax highlighting -->
    <link rel="stylesheet" href="plugin/highlight/monokai.css">

    <!-- If the query includes 'print-pdf', include the PDF print sheet -->
    <script>
        if( window.location.search.match( /print-pdf/gi ) ) {
            var link = document.createElement( 'link' );
            link.rel = 'stylesheet';
            link.type = 'text/css';
            link.href = 'css/print/pdf.css';
            document.getElementsByTagName( 'head' )[0].appendChild( link );
        }
    </script>

		<!--[if lt IE 9]>
		<script src="lib/js/html5shiv.js"></script>
		<![endif]-->
	</head>

	<body>

		<div class="reveal">

			<!-- Any section element inside of this container is displayed as a slide -->
			<div class="slides">
                <section>
                    <h2>Files and Command-Line Inputs in Python</h2>
                    <h3>Software Development for Engineering Research</h3>
                    <br/>
                    <h3>Kyle Niemeyer. 20 January 2022</h3>
                    <h3>ME 599, Corvallis, OR</h3>
                </section>
                <section data-markdown>
                    <textarea data-template>
                        ## Today

                        - Working with files in Python, including HDF5
                        - Building nice command-line interfaces using `argparse`
                    </textarea>
                </section>
                <section>
                    <section>
                        <h3>Situations for working with files:</h3>
                        <br>
                        <ul>
                          <li class="fragment">
                              Collaborator emails you raw data, want to look at the results
                          </li>
                          <li class="fragment">
                              You want to email a collaborator data from Python
                          </li>
                          <li class="fragment">
                              You need to use external code that takes input or data file (potentially 100s/1000s of times);
                              want to automate generation of input files from data
                          </li>
                          <li class="fragment">
                              An external program writes out one or more results files, and you want to read and perform analysis
                          </li>
                          <li class="fragment">
                              You want to keep an intermediate calculation for debugging or validation
                          </li>
                        </ul>
                    </section>

                    <section>
                        <h3>Saving or loading data: file handle object</h3>
                        <br>
                        <pre><code data-trim class="python fragment">
                            f = open('data.txt')
                        </code></pre>
                        <p class="fragment">This <strong>does</strong>:
                        <ol class="fragment">
                            <li>Make sure <code>data.txt</code> exists</li>
                            <li>Create new handle to this file</li>
                            <li>Set cursor position <code>pos</code> to start of the file, <code>pos = 0</code></li>
                        </ol>
                        </p>
                        <p class="fragment"> This <strong>does not</strong> read into memory any of the file,
                            write anything to the file, or close the file.
                        </p>
                    </section>
                    <section>
                        <h3>File handle methods</h3>
                        <br>
                        <ul style="font-size:30px">
                            <li>
                                <code>f.read(n=-1)</code>: Reads in n bytes; if n=-1 or not present, read entire rest of file.
                            </li>
                            <li>
                                <code>f.readline()</code>: Read next full line, return string with newline character.
                            </li>
                            <li>
                                <code>f.readlines()</code>: Reads entire rest of file, returns list of strings (with newlines).
                            </li>
                            <li>
                                <code>f.seek(pos)</code>: Move file cursor to specified position.
                            </li>
                            <li>
                                <code>f.tell()</code>: Return current position in file.
                            </li>
                            <li>
                                <code>f.write(s)</code>: Insert string s at current position.
                            </li>
                            <li>
                                <code>f.flush()</code>: Perform all pending write operations.
                            </li>
                            <li>
                                <code>f.close()</code>: Close file (no more reading or writing).
                            </li>
                        </ul>
                    </section>
                    <section>
                        <h3>Example: read in matrix</h3>
                        <br>
                        <p><code>matrix.txt</code> contents:
                        <pre><code data-trim class="text">
                            1,4,15,9
                            0,11,7,3
                            2,8,12,13
                            14,5,10,6
                        </code></pre>
                        </p>
                        <pre><code data-trim class="python fragment">
                            f = open('matrix.txt')
                            matrix = []
                            for line in f.readlines():
                                row = [int(x) for x in line.split(',')]
                                matrix.append(row)
                            f.close()
                        </code></pre>
                        <p class="fragment">Make sure to close the file!</p>
                    </section>
                    <section>
                        <h3>How to do this better?</h3>
                        <br>
                        <p>Use context manager & <code>with</code> to automatically close:</p>
                        <pre><code data-trim class="python fragment">
                            import numpy as np
                            with open('matrix.txt', 'r') as f:
                                lines = f.readlines()

                            matrix = np.array([[int(x) for x in line.split(',')]
                                              for line in lines]
                                              )
                        </code></pre>
                        <p class="fragment"> ... actually, this example can be even easier:</p>
                        <pre><code data-trim class="python fragment">
                            import numpy as np
                            matrix = np.genfromtxt('matrix.txt', delimiter=',')
                        </code></pre>
                    </section>
                    <section>
                        <h3>File modes</h3>
                        <br>
                        <p>When opening a file, by default it opens in read-only mode. Other file modes:</p>
                        <ul style="font-size:30px">
                            <li>
                                <code>'r'</code>: Read-only, no writing. Starts at <code>pos = 0</code>.
                            </li>
                            <li>
                                <code>'w'</code>: Write. Creates file if it doesn't exist (if it does, contents are deleted).
                                Starts at <code>pos = 0</code>.
                            </li>
                            <li>
                                <code>'a'</code>: Append. Opens file for writing but does not delete contents;
                                creates file if it doesn't exist. Starting <code>pos</code> is end of file.
                            </li>
                            <li>
                                <code>'+'</code>: Update. Opens for reading and writing, does not delete current contents.
                                Starts at <code>pos = 0</code>.
                            </li>
                        </ul>
                    </section>
                    <section>
                        <h3>More complicated example: read in matrix and add</h3>
                        <br>
                        <pre><code data-trim class="python fragment">
                            with open('matrix.txt', 'r+') as f:
                                orig = f.read()
                                f.seek(0)
                                f.write('0,0,0,0\n')
                                f.write(orig)
                                f.write('\n1,1,1,1')
                        </code></pre>
                    </section>
                </section>

                <section>
                    <section>
                        <h3>Quick intro to HDF5</h3>
                        <br>
                        <p class="fragment">
                            Basic idea: better to store structured, numerical data in binary formats over
                            plain-text ASCII files. Why? Smaller.
                        </p>
                        <pre><code data-trim class="text fragment">
                            # small ints        # medium ints
                            42   (4 bytes)      123456   (4 bytes)
                            '42' (2 bytes)      '123456' (6 bytes)

                            # near-int floats   # e-notation floats
                            12.34   (8 bytes)   42.424242E+42   (8 bytes)
                            '12.34' (5 bytes)   '42.424242E+42' (13 bytes)
                        </code></pre>
                        <p class="fragment">
                            Also, faster I/O, since binary files save in native format.
                        </p>
                    </section>
                    <section>
                        <h3>HDF5: Hierarchical Data Format</h3>
                        <br>
                        <p class="fragment">
                            HDF5 files store data in binary format.
                        </p>
                        <p class="fragment">
                            HDF5 provides database features like storing many datasets,
                            user-defined metadata, optimized I/O, and ability to query contents.
                        </p>
                        <p class="fragment">
                            HDF5 is a filesystem in a file: provides nested tree structure for datasets.
                        </p>
                    </section>
                    <section>
                        <h3>Using PyTables for HDF5</h3>
                        <br>
                        <p><code>import tables as tb</code></p>
                        <p class="fragment">
                            PyTables provides five basic dataset classes:
                            <ul class="fragment">
                                <li><code>Array</code>: files of the filesystem</li>
                                <li><code>CArray</code>: chunked arrays</li>
                                <li><code>EArray</code>: extendable arrays</li>
                                <li><code>VLArray</code>: variable-length arrays</li>
                                <li><code>Table</code>: structured arrays</li>
                            </ul>
                        </p>
                    </section>
                    <section>
                        <p>Constructs need to be composed of atomic types:</p>
                        <ul class="fragment">
                            <li><code>bool</code>: true or false type</li>
                            <li><code>int</code>: signed integer types</li>
                            <li><code>uint</code>: unsigned integer types</li>
                            <li><code>float</code>: floating-point types</li>
                            <li><code>complex</code>: complex floating-point types</li>
                            <li><code>string</code>: fixed-length raw string type</li>
                        </ul>
                        <p class="fragment">Also: Groups, links, and hidden nodes</p>
                    </section>

                    <section>
                        <h3>Getting started with HDF5 files</h3>
                        <br>
                        <pre><code data-trim class="python fragment">
                            import tables as tb
                            h5file = tb.open_file('/path/to/file', 'a')
                            ...
                        </code></pre>
                        <p class="fragment">... or, better yet:</p>
                        <pre><code data-trim class="python fragment">
                            import tables as tb
                            with tb.open_file('/path/to/file', 'a') as h5file:
                                ...
                        </code></pre>
                    </section>
                    <section>
                        <h3>HDF5 file modes</h3>
                        <br>
                        <ul style="font-size:30px">
                            <li>
                                <code>'r'</code>: Read-only; no data can be modified.
                            </li>
                            <li>
                                <code>'w'</code>: Write; create a new file (delete existing file with that name).
                            </li>
                            <li>
                                <code>'a'</code>: Append; open existing file for reading or writing, or create new file.
                            </li>
                            <li>
                                <code>'r+'</code>: Similar to a, but file must exist.
                            </li>
                        </ul>
                    </section>
                    <section>
                        <p class="fragment">
                            HDF5 files: all nodes stem from root node: <code>/</code> or <code>h5file.root</code>
                        </p>
                        <p class="fragment">
                            <em>Natural naming</em>: can access subnodes as attributes of "parent" nodes, like <code>h5file.root.a_group.some_data</code>
                        </p>
                        <p class="fragment">
                            (all relevant nodes in the tree must have names that are valid Python variable names)
                        </p>
                    </section>
                    <section>
                        <h3>Creating nodes, arrays, and tables</h3>
                        <br>
                        <p class="fragment">
                            Create and access a group on the root node:
                            <pre><code data-trim class="python fragment">
                                h5file.create_group(h5file.root, 'a_group', "My Group")
                                h5file.root.a_group
                            </code></pre>
                        </p>
                        <p class="fragment">
                            Datasets: arrays and tables, each with a corresponding <code>create</code> method:
                        </p>
                        <ul>
                            <li class="fragment">
                                Arrays are fixed size, must be created with data.
                            </li>
                            <li class="fragment">
                                Tables have a set data type but are variable length, so we can append to them after creation.
                            </li>
                        </ul>
                    </section>
                    <section>
                        <h3>Creating nodes, arrays, and tables</h3>
                        <br>
                        <pre><code data-trim class="python fragment" style="font-size:20px">
                            # integer array
                            h5file.create_array(h5file.root.a_group, 'arthur_count', [1, 2, 5, 3])

                            # tables need descriptions
                            dt = np.dtype([('id', int), ('name', 'S10')])
                            knights = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt)
                            h5file.create_table(h5file.root, 'knights', dt)
                            h5file.root.knights.append(knights)
                        </code></pre>
                        <p class="fragment">Hierarchy at this point:</p>
                        <pre><code data-trim class="text fragment" style="font-size:20px">
                            /
                            |-- a_group/
                            |   |-- arther_count
                            |
                            |-- knights
                        </code></pre>
                    </section>
                    <section>
                        <h3>Arrays & tables preserve original flavor</h3>
                        <br>
                        <pre><code data-trim class="python fragment" style="font-size:15px">
                            h5file.root.a_group.arthur_count[:]
                            type(h5file.root.a_group.arthur_count[:])
                            type(h5file.root.a_group.arthur_count)
                        </code></pre>

                        <pre><code data-trim class="python fragment" style="font-size:15px">
                            h5file.root.knights[1] # pull out second row

                            h5file.root.knights[:1] # slice the first row

                            mask = (h5file.root.knights.cols.id[:] < 28)
                            h5file.root.knights[mask] # create mask and apply to table

                            h5file.root.knights[([1, 0],)] #Fancy index the second and first rows, in that order
                        </code></pre>
                        <p class="fragment">
                            Memory mapping: pull in data from disk only as needed; HDF5 takes care of this automatically
                        </p>
                    </section>

                    <section>
                        <h3>Can create rows and append to tables</h3>
                        <br>
                        <div style="font-size:25px">
                        <pre><code data-trim class="python fragment stretch">
                            table_def = {'time': tables.Float64Col(pos=0),
                                         'temperature': tables.Float64Col(pos=1),
                                         'pressure': tables.Float64Col(pos=2),
                                         }
                            with tables.open_file(filename, mode='w', title="Table title"
                                                  ) as h5file:
                                table = h5file.create_table(where=h5file.root,
                                                            name='simulation',
                                                            description=table_def
                                                            )
                                timestep = table.row
                                # Save initial conditions
                                timestep['time'] = sim.time
                                timestep['temperature'] = sim.temperature
                                timestep['pressure'] = sim.pressure
                                timestep.append()

                                # Main time integration loop
                                while sim.time < time_end:
                                    sim.step() # perform a single integration step

                                    # Save new timestep information
                                    timestep['time'] = sim.time
                                    timestep['temperature'] = sim.temperature
                                    timestep['pressure'] = sim.pressure

                                    # Add ``timestep`` to table
                                    timestep.append()
                                table.flush()
                        </code></pre>
                        </div>
                    </section>

                    <section>
                        <h3>Accessing this table</h3>
                        <br>
                        <pre><code data-trim class="python fragment stretch">
                            with tables.open_file(filename, 'r') as h5file:
                                # Load Table with Group name simulation
                                table = h5file.root.simulation

                                time = table.col('time')
                                pressure = table.col('pressure')
                                temperature = table.col('temperature')
                        </code></pre>
                        <p class="fragment">
                            These are NumPy arrays!
                        </p>
                    </section>

                    <section>
                        <h3>Hierarchy Layout</h3>
                        <br>
                        <pre><code data-trim class="python fragment">
                            # particles:  id, kind,       velocity
                            particles = [(42, 'electron', 72.0),
                                         (43, 'proton', 0.1),
                                         (44, 'electron', 76.8),
                                         (45, 'neutron', 0.39),
                                         (46, 'neutron', 0.72),
                                         (47, 'neutron', 0.55),
                                         (48, 'proton', 0.18),
                                         (49, 'neutron', 0.23),
                                         ...
                                         ]
                        </code></pre>
                        <p class="fragment">
                            If we know we want to look at neutral and charged particles separately,
                            then why search through all the particles all the time?
                        </p>
                    </section>
                    <section>
                        <pre><code data-trim class="python fragment">
                            neutral = [(45, 'neutron',  0.39),
                                       (46, 'neutron',  0.72),
                                       (47, 'neutron',  0.55),
                                       (49, 'neutron',  0.23),
                                       ...
                                       ]

                            charged = [(42, 'electron', 72.0),
                                       (43, 'proton', 0.1),
                                       (44, 'electron', 76.8),
                                       (48, 'proton', 0.18),
                                       ...
                                       ]
                        </code></pre>
                        <p class="fragment">
                            But now <code>kind</code> is redundant in the <code>neutral</code> table.
                            Let's delete this column and rely on the structure of the tables together
                            to dictate which table refers to what.
                        </p>
                    </section>
                    <section>
                        <pre><code data-trim class="python fragment">
                            neutral = [(45, 0.39),
                                       (46, 0.72),
                                       (47, 0.55),
                                       (49, 0.23),
                                       ...
                                       ]

                            charged = [(42, 'electron', 72.0),
                                       (43, 'proton', 0.1),
                                       (44, 'electron', 76.8),
                                       (48, 'proton', 0.18),
                                       ...
                                       ]
                        </code></pre>
                        <p class="fragment">
                            We are embedding information directly into the semantics of the hierarchy.
                        </p>
                    </section>
                    <section>
                        <p class="fragment">
                            We could add another layer that distinguishes the particles based on detector:
                        </p>
                        <pre><code data-trim class="text fragment">
                            /
                            |-- detector1/
                            |   |-- neutral
                            |   |-- charged
                            |
                            |-- detector2/
                            |   |-- neutral
                            |   |-- charged
                        </code></pre>
                        <p class="fragment">
                            Data should be broken up like this to improve access time speeds.
                            This is more efficient because there are: fewer rows to search, fewer rows
                            to pull from disk, and fewer columns in description (decreases size of rows)
                        </p>
                    </section>

                    <section>
                        <h3>More complicated topics: chunking, in-core and out-of-core operations, querying, compression</h3>
                    </section>
                </section>

                <section>
                    <section>
                        <h3>Creating command-line interfaces for Python programs with argparse</h3>
                        <br>
                        <p class="fragment">
                            Rather than only calling your package from another Python code, you can (fairly easily)
                            create a command-line interface.
                        </p>
                        <pre><code data-trim class="bash fragment">
                            myprogram --help

                            myprogram --input inputfile.txt --output output.h5
                        </code></pre>
                        <p class="fragment">
                            You can/should also use these to change inputs or settings to your programs without modifying code.
                        </p>
                    </section>
                    <section>
                        <h3>Simple example</h3>
                        <pre><code data-trim class="python fragment stretch">
                            # contents of example.py
                            # import the necessary packages
                            import sys
                            import argparse

                            # construct the argument parse and parse the arguments
                            parser = argparse.ArgumentParser(
                                description='This is a simple command-line program.'
                                )
                            parser.add_argument('-n', '--name', required=True,
                                                help='name of the user'
                                                )
                            args = parser.parse_args(sys.argv[1:])

                            # display a friendly message to the user
                            print("Hi there {}, it's nice to meet you!".format(args.name))
                        </code></pre>
                        <pre><code data-trim class="bash fragment">
                            $ python example.py --help
                            $ python example.py --name Kyle
                        </code></pre>
                    </section>
                    <section>
                        <h3>Optional arguments</h3>
                        <br>
                        <pre><code data-trim class="python fragment">
                            ...
                            parser.add_argument('-c', '--count', action='store_true',
                                                default=False,
                                                help='Count number of characters in name'
                                                )
                            args = parser.parse_args(sys.argv[1:])

                            print("Hi there {}, it's nice to meet you!".format(args.name))
                            if args.count:
                                print("Name length: {}".format(len(args.name)))
                        </code></pre>
                        <pre><code data-trim class="bash fragment">
                            $ python example.py --help
                            $ python example.py --name Kyle -c
                        </code></pre>
                    </section>
                    <section>
                        <h3>Optional arguments</h3>
                        <br>
                        <pre><code data-trim class="python fragment">
                            ...
                            parser.add_argument('-a', '--age',
                                                default=25,
                                                type=int,
                                                help='Age of person'
                                                )
                            args = parser.parse_args(sys.argv[1:])

                            print("Hi there {}, it's nice to meet you!".format(args.name))
                            if args.count:
                                print("Name length: {}".format(len(args.name)))
                            print("Age: {}".format(args.age))
                        </code></pre>
                        <pre><code data-trim class="bash fragment">
                            $ python example.py --help
                            $ python example.py --name Kyle --age 31
                        </code></pre>
                    </section>
                    <section>
                        <h3>More complex input</h3>
                        <br>
                        <p class="fragment">
                            For larger/more complex input, may be better to use an input file.
                        </p>
                        <p class="fragment">
                            YAML is a good option for this: use pyyaml (<code>import yaml</code>)
                        </p>
                    </section>
                    <section>
                        <p class="fragment"><code>input.yaml</code>:</p>
                        <pre><code data-trim class="yaml fragment">
                            case: square cavity
                            length: 1.0
                            resolution: 0.01
                            velocities:
                              - 1.0
                              - 10.0
                              - 100.0
                        </code></pre>
                        <pre><code data-trim class="python fragment">
                            import yaml
                            with open('input.yaml', 'r') as f:
                                inputs = yaml.safe_load(f)
                        </code></pre>
                        <p class="fragment">Returns a dict</p>
                    </section>
                </section>
                <section>
                    <h3>Takehome messages</h3>
                    <br>
                    <p class="fragment">
                        If you need to store numerical data, use HDF5 instead of plaintext.
                    </p>
                    <p class="fragment">
                        Consider implementing a command-line interface for your program,
                        possibly in combination with more complex YAML configuration/input files.
                    </p>
                </section>

			</div>

		</div>

		<script src="dist/reveal.js"></script>
		<script src="plugin/notes/notes.js"></script>
		<script src="plugin/markdown/markdown.js"></script>
		<script src="plugin/highlight/highlight.js"></script>
		<script>
			// More info about initialization & config:
			// - https://revealjs.com/initialization/
			// - https://revealjs.com/config/
			Reveal.initialize({
				hash: true,

				// Learn about plugins: https://revealjs.com/plugins/
				plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
			});
		</script>

	</body>
</html>