SQL Aggregators & Command Line Interfaces

Video

Motivation

Often, you want to get reliable, well-founded conclusions with a minimum of effort.

Today's tools are separate, but both help serve that common goal. First, we will talk about SQL Aggregate Functions - ways to run statistical commands from within SQL.

Then, we are going to talk about Python command-line arguments and how to invoke Python scripts/modules globally.

SQL Aggregate Functions

SQL - and by extension - SQLite define a variety of aggregate functions:

Function Meaning
count(*) Number of rows
count(X) Number of non-null rows in column(s) X
group_concat(X,delimiter) String conversion+concatenation, default delimiter ,
max(X) Maximum Value
min(X) Minimum Value
sum(X) Sum
total(X) Naive Floating Point Summation
avg(X) Mean

Note that there are not medians or modes, but you are able to construct them out of the functions that do exist.

Examples

Say we had some data. We'll be using python sqlite for this.

We can use numpy.random to simulate some data:

import sqlite3
import pandas
import numpy
gen = numpy.random.default_rng()

data = numpy.column_stack([gen.integers(1,5,100),
                    gen.integers(0,100,100)])                
df = pandas.DataFrame(data,columns=["type","value"])

with sqlite3.connect("statisticalexample.sqlite") as con:
    df.to_sql("ExampleData",con,if_exists="replace",index=False)

Let's play around with it a bit.

We can select the data out, for use in Python:

with sqlite3.connect("statisticalexample.sqlite") as con:
    cur = con.cursor()
    cur.execute("SELECT type,value FROM ExampleData")
    for row in cur:
        print(row)

Or, if we are just interested in the maximum value -

with sqlite3.connect("statisticalexample.sqlite") as con:
    cur = con.cursor()
    cur.execute("SELECT max(value) FROM ExampleData")
    for row in cur:
        print(row)

We can even do arithmetic on columns, and use the generated columns:

with sqlite3.connect("statisticalexample.sqlite") as con:
    cur = con.cursor()
    cur.execute("SELECT max(value)-min(value)+1 FROM ExampleData")
    for row in cur:
        print(row)

Group By

The GROUP BY keyword allows us to create tables of aggregates, instead of single aggregate values. This is very useful for getting summary data out of join tables!

Examples

With GROUP BY, we can find out how many values there are by type:

with sqlite3.connect("statisticalexample.sqlite") as con:
    cur = con.cursor()
    cur.execute("""
        SELECT type, count(value)
        FROM ExampleData
        GROUP BY type
        ORDER BY value
        """)
    for row in cur:
        print(row)

Or with the DISTINCT keyword, how many of them are repeats:

with sqlite3.connect("statisticalexample.sqlite") as con:
    cur = con.cursor()
    cur.execute("""
        SELECT type, count(value)-count(DISTINCT value)
        FROM ExampleData
        GROUP BY type
        """)
    for row in cur:
        print(row)

ORDER BY and LIMIT

Let's say you want more than the max - perhaps you want the top 5.

For that, there are the ORDER BY and LIMIT keywords - often used together.

ORDER BY sorts by columns - ascending or descending.

LIMIT - like the bash command head - cuts off the first n columns.

Examples

With ORDER BY, we can now get more than just the max, and do operations on multiple rows:

with sqlite3.connect("statisticalexample.sqlite") as con:
    cur = con.cursor()
    cur.execute("""SELECT type, value
      FROM ExampleData
      ORDER BY value DESC,type ASC
      LIMIT 5""")
    for row in cur:
        print(row)

Order By comes in handy when preparing tables for use in group_concat():

with sqlite3.connect("statisticalexample.sqlite") as con:
    cur = con.cursor()
    cur.execute("""
      SELECT type,group_concat(value) FROM
      (
        SELECT type,value
        FROM ExampleData
        ORDER BY value ASC
      )
      GROUP BY type
      ORDER BY type DESC
      """)
    for row in cur:
        print(row)

Python Command-Line Interfaces

__main__ Execution

We have seen - when writing and running our tests - that the "main" file Python executes is denoted with the __name__ property set to __main__.

For simple files that we only want to import from, this is an excellent place to put testing code. However, there are other ways to use this block - and other places we can put tests.

Testing an Executable File

The tests are actually the easy part - with a similar syntax, from a python file (for example tests.py in your directory) you can run the command:

doctest.testfile(filename)

Or to test an imported module,

doctest.testmod(module_name)

Now we are free to use the __main__ block to do execution.

For now, let's use a simple file executable.py:

#!/usr/bin/env python3

def function_to_execute():
    return "Executed Function!"

if __name__=="__main__":
    print(function_to_execute())

Argument Parsing with sys.argv

Let's try passing some extra information in.

The arguments are stored in the list sys.argv. Update our example file:

#!/usr/bin/env python3
import sys

def function_to_execute():
    return "Executed Function!"

if __name__=="__main__":
    print(sys.argv)
    print(function_to_execute())

Now try it:

python executable.py a "b c" -d e

We can see that sys.argv is a list of all the values, separated by and escaped with quotation marks - basically, bash arguments. However, it doesn't notice flags, and checking for more than one or two is complicated.

For making more complex command-line tools, we turn to:

Advanced Argument Parsing with argparse

PEP 389 lays out the current standard module argparse for handling command-line arguments and options.

The ArgumentParser Object

argparse defines an Argument Parser which can:

  • Load your command-line arguments from sys.args
  • Organize them by position or flag
  • Construct help for your arguments

You can parse options a few ways, but an easy way is with add_argument:

if __name__=="__main__": 
    import argparse
    parser = argparse.ArgumentParser(description="Argparse Example")
    parser.add_argument('positional',
        type=str,
        help="Single positional argument")
    parser.add_argument('many_positionals',
        nargs='*',
        type=int,
        help="Positional arguments")
    parser.add_argument('--flag',
        action='store_true',
        default=False,
        dest="flagkey",
        help="A flag with no arguments")
    parser.add_argument('--optional',
        nargs=1,
        action="store",
        default=False,
        help="An optional parameter, default False")
    args = parser.parse_args()
    print(args)
    print("Parsed Input")

One of the really great benefits to this method - beyond all the different parsing orders - is the automatically-generated --help flag:

python executable.py --help

Which builds a bash-style help function!

(This can also be a good place to hide your tests - run them and exit if a --test flag is passed.)

Arguments are stored as attributes of the result of parse_args(). We can get them out with:

function_to_call(args.many_positionals)

Or unpack them to keyword arugments with the builtin vars:

function_to_call(**vars(args))

python -m Flag

Sometimes you want to create a script which can be used anywhere on your computer with ease. A popular example - which you may have seen me using in this course - is:

python -m http.server

Which executes the http.server module, running a local webserver in the current directory.

By creating your own modules, and adding them to the python path, you can create short scripts you can invoke which can do fairly complex tasks. (Scripting like this is also possible in bash and cmd, but both are surprisingly technical.)

This means that you can execute files - or module directories - from anywhere using the short module reference name, instead of the long filepath!

To get your Python path, remember, you can do:

import sys
sys.path

a good place for your custom modules is site-packages - or if they are platform-dependent, in a platform-specific subdirectory.

If you want to be really clean, you can edit your python path to include a directory such as ~/lib/python, but that is outside the scope of this course.

Directory modules and __main__.py

To define the execution behavior of a single-file module, it is identical to the execution of a local file. In fact, the implementation is the same; simply place your execution-specific code in an:

if __name__=="__main__":
    #...

block, and you will be fine and dandy.

For a multi-file architecture, however, you will need to place in your module:

module/
    __init__.py
    __main__.py
    ...

Any code you put in your __main__.py will be executed when the module is called, but not imported - essentially a whole file block like if __name__=="__main__".

Worksheet

Today's project, which spans both Friday and Today, continues; you will now create a Python frontend for a Gradebook.

Department of Mathematics, Purdue University
150 N. University Street, West Lafayette, IN 47907-2067
Phone: (765) 494-1901 - FAX: (765) 494-0548
Contact the Webmaster for technical and content concerns about this webpage.
Copyright© 2018, Purdue University, all rights reserved.
West Lafayette, IN 47907 USA, 765-494-4600
An equal access/equal opportunity university
Accessibility issues? Contact the Web Editor (webeditor@math.purdue.edu).