User guide

Once you’ve installed Reladiff, you can run it from the command-line, or from Python.

How to use from the shell / command-line

The basic syntax for reladiff is:

# Cross-DB diff, using hashes
reladiff  DB1_URI  TABLE1_NAME  DB2_URI  TABLE2_NAME  [OPTIONS]

When both tables belong to the same database, a shorter syntax is available:

# Same-DB diff, using outer join
reladiff  DB_URI  TABLE1_NAME  TABLE2_NAME  [OPTIONS]

DB_URL is either a database URL, or the name of a database definition that is specified in a configuration file. Our database URLs conform to the same format as SQLAlchemy.

We recommend using a configuration file, with the --conf switch, to keep the command simple and manageable.

For a list of example URLs, see list of supported databases.

Note: Because URLs allow many special characters, and may collide with the syntax of your shell, it’s recommended to surround them with quotes.

Options

  • --help - Show help message and exit.

  • -k or --key-columns - Name of the primary key column. If none provided, default is ‘id’. Can be used more than once, for a compound key.

  • -t or --update-column - Name of updated_at/last_updated column

  • -c or --columns - Names of extra columns to compare. Can be used more than once in the same command. Accepts a name or a pattern like in SQL. Example: -c col% -c another_col -c %foob.r%

  • -l or --limit - Maximum number of differences to find (limits maximum bandwidth and runtime)

  • -s or --stats - Print stats instead of a detailed diff

  • -d or --debug - Print debug info

  • -v or --verbose - Print extra info

  • -i or --interactive - Confirm queries, implies --debug

  • --json - Print JSONL output for machine readability

  • --skip-sort-results - Skip sorting the hashdiff output by key for better performance. Entries with the same key but different column values may not appear adjacent in the output.

  • --min-age - Considers only rows older than specified. Useful for specifying replication lag. Example: --min-age=5min ignores rows from the last 5 minutes. Valid units: d, days, h, hours, min, minutes, mon, months, s, seconds, w, weeks, y, years

  • --max-age - Considers only rows younger than specified. See --min-age.

  • -j or --threads - Number of worker threads to use per database. Default=1.

  • -w, --where - An additional ‘where’ expression to restrict the search space.

  • --allow-empty-tables - Allows diffing on empty tables. Otherwise, we raise an error.

  • --case-sensitive - Column names are treated as case-sensitive. Otherwise, reladiff corrects their case according to schema.

  • --conf, --run - Specify the run and configuration from a TOML file. (see below)

  • --bisection-threshold - Minimal size of segment to be split. Smaller segments will be downloaded and compared locally.

  • --bisection-factor - Segments per iteration. When set to 2, it performs binary search.

  • -m, --materialize - Materialize the diff results into a new table in the database. If a table exists by that name, it will be replaced. Use %t in the name to place a timestamp. Example: -m test_mat_%t

  • --assume-unique-key - Skip validating the uniqueness of the key column during joindiff, which is costly in non-cloud dbs. Also, disables support for duplicate rows in hashdiff, offering a small performance gain.

  • --sample-exclusive-rows - Sample several rows that only appear in one of the tables, but not the other. Use with -s.

  • --materialize-all-rows - Materialize every row, even if they are the same, instead of just the differing rows.

  • --table-write-limit - Maximum number of rows to write when creating materialized or sample tables, per thread. Default=1000.

  • -a, --algorithm [auto|joindiff|hashdiff] - Force algorithm choice

How to use with a configuration file

Reladiff lets you load the configuration for a run from a TOML file.

Reasons to use a configuration file:

  • Convenience: Set-up the parameters for diffs that need to run often

  • Easier and more readable: You can define the database connection settings as separate config values, instead of in a single URI.

  • Gives you fine-grained control over the settings switches, without requiring any Python code.

Use --conf to specify that path to the configuration file. reladiff will load the settings from run.default, if it’s defined.

Then you can, optionally, use --run to choose to load the settings of a specific run, and override the settings run.default. (all runs extend run.default, like inheritance).

Finally, CLI switches have the final say, and will override the settings defined by the configuration file, and the current run.

Example TOML file:

# Specify the connection params to the test database.
[database.test_postgresql]
driver = "postgresql"
user = "postgres"
password = "Password1"

# Specify the default run params
[run.default]
update_column = "timestamp"
verbose = true

# Specify params for a run 'test_diff'.
[run.test_diff]
verbose = false
# Source 1 ("left")
1.database = "test_postgresql"                      # Use options from database.test_postgresql
1.table = "rating"
# Source 2 ("right")
2.database = "postgresql://postgres:Password1@/"    # Use URI like in the CLI
2.table = "rating_del1"

In this example, running reladiff --conf myconfig.toml --run test_diff will compare between rating and rating_del1. It will use the timestamp column as the update column, as specified in run.default. However, it won’t be verbose, since that flag is overwritten to false.

Running it with reladiff --conf myconfig.toml --run test_diff -v will set verbose back to true.

How to use from Python

Import the reladiff module, and use the following functions:

  • connect_to_table() to connect to a specific table in the database

  • diff_tables() to diff those tables

Example:

# Optional: Set logging to display the progress of the diff
import logging
logging.basicConfig(level=logging.INFO)

from reladiff import connect_to_table, diff_tables

table1 = connect_to_table("postgresql:///", "table_name", "id")
table2 = connect_to_table("mysql:///", "table_name", "id")

sign: Literal['+' | '-']
row: tuple[str, ...]
for sign, row in diff_tables(table1, table2):
    print(sign, row)

To learn more about the different options, read the API reference or run help(diff_tables).

Tips

  • If you are only interested in whether something changed, i.e. a yes/no answer, set --limit 1. Reladiff will return as soon as it finds the first difference.

  • Ensure that you have indexes on the columns you are comparing. Preferably a compound index, if relevant. You can run with --interactive to see an EXPLAIN for the queries.

  • Setting a higher thread count may help performance significantly, depending on the database. For databases that limit concurrency per query, such as PostgreSQL/MySQL, this can improve performance dramatically.

  • A low --bisection-threshold will minimize the amount of network transfer. But if network isn’t an issue, a high --bisection-threshold will make Reladiff run a lot faster.

  • If you run into timeouts for very large tables, try increasing the --bisection-factor.

  • The fewer columns you verify, the faster Reladiff will be. If you’re only interested in additions/deletions, verifying the primary key could be enough. If you have an automatic updated column, it might be enough to capture changes, i.e. comparing all the data isn’t always necessary.