Exploration
While datajudge seeks to tackle the use case of expressing and evaluating tests against data, its fairly generic inner workings allow for using it in a rather explorative workflow as well.
Let's first clarify terminology by exemplifying both scenarios. A person wishing to test data might ask the question
Has the number of rows not grown too much from version 1 of the table to version 2 of the table?
whereas a person wishing to explore the data might ask the question
By how much has the number of rows grown from version 1 to version 2 of the table?
Put differently, a test typically revolves around a binary outcome while an exploration usually doesn't.
In the following we will attempt to illustrate possible usages of datajudge for exploration by looking at three simple examples.
These examples rely on some insight about how most datajudge Constraint s work under
the hood. Importantly, Constraint s typically come with
- a
retrievemethod: this method fetches relevant data from database, given aDataReference - a
get_factual_valuemethod: this is typically a wrapper aroundretrievefor the firstDataReferenceof the givenRequirement/Constraint - a
get_target_valuemethod: this is either a wrapper aroundretrievefor the secondDataReferencein the case of aBetweenRequirementor an echoing of theConstraints key reference value in the case of aWithinRequirement
Moreover, as is the case when using datajudge for testing purposes, these approaches rely on a sqlalchemy engine. The latter is the gateway to the database at hand.
Example 1: Comparing numbers of rows
Assume we have two tables in the same database called table1 and table2. Now we
would like to compare their numbers of rows. Naturally, we would like to retrieve
the respective numbers of rows before we can compare them. For this purpose we create
a BetweenTableRequirement referring to both tables and add a NRowsEquality
Constraint onto it.
import sqlalchemy as sa
from datajudge import BetweenRequirement
engine = sa.create_engine(your_connection_string)
req = BetweenRequirement.from_tables(
db_name,
schema_name,
"table1",
db_name,
schema_name,
"table2",
)
req.add_n_rows_equality_constraint()
n_rows1 = req[0].get_factual_value(engine)
n_rows2 = req[0].get_target_value(engine)
Note that here, we access the first (and only) Constraint that has been added to the
BetweenRequirement by writing req[0]. Requirements are are sequences of
Constraint s, after all.
Once the numbers of rows are retrieved, we can compare them as we wish. For instance, we
could compute the absolute and relative growth (or loss) of numbers of rows from
table1 to table2:
absolute_change = abs(n_rows2 - n_rows1)
relative_change = (absolute_change) / n_rows1 if n_rows1 != 0 else None
Importantly, many datajudge staples, such as Condition s can be used, too. We shall see
this in our next example.
Example 2: Investigating unique values
In this example we will suppose that there is a table called table consisting of
several columns. Two of its columns are supposed to be called col_int and
col_varchar. We are now interested in the unique values in these two columns combined.
Put differently, we are wondering:
Which unique pairs of values in
col_intandcol_varcharhave we encountered?
To add to the mix, we will moreover only be interested in tuples in which col_int has a
value of larger than 10.
As before, we will start off by creating a Requirement. Since we are only dealing with
a single table this time, we will create a WithinRequirement.
import sqlalchemy as sa
from datajudge import WithinRequirement, Condition
engine = sa.create_engine(your_connection_string)
req = requirements.WithinRequirement.from_table(
db_name,
schema_name,
"table",
)
condition = Condition(raw_string="col_int >= 10")
req.add_uniques_equality_constraint(
columns=["col_int", "col_varchar"],
uniques=[], # This is really just a placeholder.
condition=condition,
)
uniques = req[0].get_factual_value(engine)
If one was to investigate this uniques variable further, one could, e.g. see the
following:
([(10, 'hi10'), (11, 'hi11'), (12, 'hi12'), (13, 'hi13'), (14, 'hi14'), (15, 'hi15'), (16, 'hi16'), (17, 'hi17'), (18, 'hi18'), (19, 'hi19')], [1, 100, 12, 1, 7, 8, 1, 1, 1337, 1])
This becomes easier to parse when inspecting the underlying retrieve method of the
UniquesEquality Constraint: the first value of the tuple corresponds to the list
of unique pairs in columns col_int and col_varchar. The second value of the tuple
are the respective counts thereof.
Moreoever, one could manually customize the underlying SQL query. In order to do so, one
can use the fact that retrieve methods typically return an actual result or value
as well as the sqlalchemy selections that led to said result or value. We can use these
selections and compile them to a standard, textual SQL query:
values, selections = req[0].retrieve(engine, constraint.ref)
print(str(selections[0].compile(engine, compile_kwargs={"literal_binds": True}))
In the case from above, this would return the following query:
SELECT
anon_1.col_int,
anon_1.col_varchar,
count(*) AS count_1
FROM
(SELECT
tempdb.dbo.table.col_int AS col_int,
tempdb.dbo.table.col_varchar AS col_varchar
FROM
tempdb.dbo.table WITH (NOLOCK)
WHERE col_int >= 10) AS anon_1
GROUP BY anon_1.col_int, anon_1.col_varchar
Example 3: Comparing column structure
While we often care about value tuples of given columns, i.e. rows, it can also provide meaningful insights to compare the column structure of two tables. In particular, we might want to compare whether columns of one table are a subset or superset of another table. Moreover, for columns present in both tables, we'd like to learn about their respective types.
In order to illustrate such an example, we will again assume that there are two tables
called table1 and table2, irrespective of prior examples.
We can now create a BetweenRequirement for these two tables and use the
ColumnSubset Constraint. As before, we will rely on the get_factual_value
method to retrieve the values of interest for the first table passed to the
BetweenRequirement and the get_target_value method for the second table passed
to the BetweenRequirement.
import sqlalchemy as sa
from datajudge import BetweenRequirement
engine = sa.create_engine(your_connection_string)
req = BetweenRequirement.from_tables(
db_name,
schema_name,
"table1",
db_name,
schema_name,
"table2",
)
req.add_column_subset_constraint()
columns1 = req[0].get_factual_value(engine)
columns2 = req[0].get_target_value(engine)
print(f"Columns present in both: {set(columns1) & set(columns2)}")
print(f"Columns present in only table1: {set(columns1) - set(columns2)}")
print(f"Columns present in only table2: {set(columns2) - set(columns1)}")
This could, for instance result in the following printout:
Columns present in both: {'col_varchar', 'col_int'}
Columns present in only table1: set()
Columns present in only table2: {'col_date'}
Now, we can investigate the types of the columns present in both tables:
for column in set(columns1) & set(columns2):
req.add_column_type_constraint(column1=column, column2=column)
type1 = req[0].get_factual_value(engine)
type2 = req[0].get_target_value(engine)
print(f"Column '{column}' has type '{type1}' in table1 and type '{type2}' in table2.")
Depending on the underlying database management system and data, the output of this could for instance be:
Column 'col_varchar' has type 'varchar' in table1 and type 'varchar' in table2.
Column 'col_int' has type 'integer' in table1 and type 'integer' in table2.