Server configuration and administration¶
datalab has 3 main configuration sources.
- The Python
ServerConfig
(described below) that allows for datalab-specific configuration, such as database connection info, filestore locations and remote filesystem configuration. .- This can be provided via a JSON or YAML config file at the location provided by the
PYDATALAB_CONFIG_FILE
environment variable, or as environment variables themselves, prefixed withPYDATALAB_
. The available configuration variables and their default values are listed below.
- This can be provided via a JSON or YAML config file at the location provided by the
- Additional server configuration provided as environment variables, such as secrets like Flask's
SECRET_KEY
, API keys for external services (e.g., SMTP) and OAuth client credentials (for logging in via GitHub or ORCID).- These can be provided as environment variables or in a
.env
file in the directory from whichpydatalab
is launched.
- These can be provided as environment variables or in a
- Web app configuration, such as the URL of the relevant datalab API and branding (logo URLs, external homepage links).
- These are typically provided as a
.env
file in the directory from which the webapp is built/served.
- These are typically provided as a
Mandatory settings¶
There is only one mandatory setting when creating a deployment.
This is the IDENTIFIER_PREFIX
, which shall be prepended to every entry's refcode to enable global uniqueness of datalab entries.
For now, the prefixes themselves are not checked for uniqueness across the fledling datalab federation, but will in the future.
This prefix should be set to something relatively short (max 10 chars.) that describes your group or your deployment, e.g., the PI's surname, project ID or department.
This can be set either via a config file, or as an environment variable (e.g., PYDATALAB_IDENTIFIER_PREFIX='grey'
).
Be warned, if the prefix changes between server launches, all entries will have to be migrated manually to the desired prefix, or maintained at the old prefix.
User registration & authentication¶
datalab has three supported user registration/authentication mechanisms:
- OAuth2 via GitHub accounts that are public members of appropriate GitHub organizations
- OAuth2 via ORCID
- via magic links sent to email addresses
Each is configured differently.
GitHub OAuth2¶
For GitHub, you must register a GitHub OAuth
application for your instance, providing the client ID and secret in the .env
for the API, using the variable names GITHUB_OAUTH_CLIENT_ID
and GITHUB_OAUTH_CLIENT_SECRET
.
These should be provided in a .env
file local to your app and not added to your main config file.
The authorization callback URL in the GitHub app settings should be set to <YOUR_API_URL>/login/github/authorized
.
A user's first login may direct them to this page rather than the web app, depending on their browser.
The user will then simply have to navigate back to the URL of the web app, where they should find themselves to be logged in.
Then, you can configure GITHUB_ORG_ALLOW_LIST
with a list of string IDs of GitHub organizations that user's must be a public member of to register an account.
If this value is set to None
, then no accounts will be able to register, and if it is set to an empty list, then no restrictions will apply.
You can find the relevant organization IDs using the GitHub API, for example at https://api.github.com/orgs/<org_name>
.
ORCID OAuth2¶
For ORCID integration, each datalab instance must currently register for the ORCID developer program and request new credentials. As such, this may be tricky to support for new instances. We are looking for ways around this in the future.
Email magic links¶
To support sign-in via email magic-links, you must currently provide additional configuration for authorized SMTP server.
The SMTP server must be configured via the settings EMAIL_AUTH_SMTP_SETTINGS
, with expected values MAIL_SERVER
, MAIL_USER
, MAIL_PASSWORD
, MAIL_DEFAULT_SENDER
, MAIL_PORT
and MAIL_USE_TLS
, following the environment variables described in the Flask-Mail documentation.
Third-party options could include SendGrid, which can be configured to use the MAIL_USER
apikey
with an appropriate API key, after verifying ownership of the MAIL_DEFAULT_SENDER
address via DNS (see the SendGrid documentation for an example configuration).
The email addresses that are allowed to sign up can be restricted by domain/subdomain using the EMAIL_DOMAIN_ALLOW_LIST
setting.
Remote filesystems¶
This package allows you to attach files from remote filesystems to samples and other entries.
These filesystems can be configured in the config file with the REMOTE_FILESYSTEMS
option.
In practice, these options should be set in a centralised deployment.
Currently, there are two mechanisms for accessing remote files:
- You can mount the filesystem locally and provide the path in your datalab config file. For example, for Cambridge Chemistry users, you will have to (connect to the ChemNet VPN and) mount the Grey Group backup servers on your local machine, then define these folders in your config.
- Access over SSH: alternatively, you can set up passwordless
ssh
access to a machine (e.g., usingcitadel
as a proxy jump), and paths on that remote machine can be configured as separate filesystems. The filesystem metadata will be synced periodically, and any files attached indatalab
will be downloaded and stored locally on thepydatalab
server (with the file being kept younger than 1 hour old on each access).
General Server administration¶
Currently most administration tasks must be handled directly inside the Python API container.
Several helper routines are available as invoke
tasks in tasks.py
in the pydatalab
root folder.
You can list all available tasks by running invoke --list
in the root pydatalab
folder after installing the package with the [dev]
extras.
In the future, many admin tasks (e.g., updating user info, allowing/blocking user accounts, defining subgroups) will be accessible in the web UI.
Importing chemical inventories¶
One such invoke
task implements the ingestion of a ChemInventory chemical inventory into datalab.
It relies on the Excel export feature of ChemInventory and is achieved with invoke admin.import-cheminventory <filename>
.
If a future export is made and reimported, the old entries will be kept and updated, rather than overwritten.
datalab currently has no functionality for chemical inventory management itself; if you wish to support importing from another inventory system, please raise an issue.
Backups¶
datalab provides a way to configure and create a snapshot backups of the database and filestore.
The option BACKUP_STRATEGIES
allows you to list strategies for scheduled backups, with their frequency, storage location (can be local or remote) and retention.
These backups are only performed when scheduled externally (e.g., via cron
on the hosting server), or when triggered manually using the invoke admin.create-backup
task.
The simplest way to create a backup is to run invoke admin.create-backup --output-path /tmp/backup.tar.gz
, which will create a compressed backup.
This should be run from the server or container for the API, and will make use of the config to connect to the database and file store.
This approach will not follow any retention strategy.
Alternatively, you can create a backup given the strategy name defined in the server config, using the same task:
invoke admin.create-backup --strategy-name daily-snapshots
This will apply the retention strategy and any copying to remote resources as configured.
When scheduling backups externally, it is recommended you do not use cron
inside the server Docker container.
Instead, you could schedule a job that calls, for example:
# <container name> <invoke task name> <configured strategy name>
# ^ ^ ^
docker compose exec api invoke pipenv run admin.create-backup --strategy-name daily-snapshots
Care must be taken to schedule this command to run from the correct directory.
In the future, this may be integrated directly into the datalab server using a Python-based scheduler.
Config API Reference¶
pydatalab.config.ServerConfig (BaseSettings)
pydantic-model
¶
A model that provides settings for deploying the API.
SECRET_KEY: str
pydantic-field
¶
The secret key to use for Flask. This value should be changed and/or loaded from an environment variable for production deployments.
MONGO_URI: str
pydantic-field
¶
The URI for the underlying MongoDB.
SESSION_LIFETIME: int
pydantic-field
¶
The lifetime of each authenticated session, in hours.
FILE_DIRECTORY: Union[str, pathlib.Path]
pydantic-field
¶
The path under which to place stored files uploaded to the server.
LOG_FILE: Union[str, pathlib.Path]
pydantic-field
¶
The path to the log file to use for the server and all associated processes (e.g., invoke tasks)
DEBUG: bool
pydantic-field
¶
Whether to enable debug-level logging in the server.
TESTING: bool
pydantic-field
¶
Whether to run the server in testing mode, i.e., without user auth.
IDENTIFIER_PREFIX: str
pydantic-field
¶
The prefix to use for identifiers in this deployment, e.g., 'grey' in grey:AAAAAA
REFCODE_GENERATOR: Type[pydatalab.models.utils.RefCodeFactory]
pydantic-field
¶
The class to use to generate refcodes.
REMOTE_FILESYSTEMS: List[pydatalab.config.RemoteFilesystem]
pydantic-field
¶
REMOTE_CACHE_MAX_AGE: int
pydantic-field
¶
The maximum age, in minutes, of the remote filesystem cache after which it should be invalidated.
REMOTE_CACHE_MIN_AGE: int
pydantic-field
¶
The minimum age, in minutes, of the remote filesystem cache, below which the cache will not be invalidated if an update is manually requested.
BEHIND_REVERSE_PROXY: bool
pydantic-field
¶
Whether the Flask app is being deployed behind a reverse proxy. If True
, the reverse proxy middleware described in the Flask docs will be attached to the app.
GITHUB_ORG_ALLOW_LIST: List[str]
pydantic-field
¶
A list of GitHub organization IDs (available from https://api.github.com/orgs/<org_name>
, and are immutable) or organisation names (which can change, so be warned), that the membership of which will be required to register a new datalab account. Setting the value to None
will allow any GitHub user to register an account.
DEPLOYMENT_METADATA: DeploymentMetadata
pydantic-field
¶
A dictionary containing metadata to serve at /info
.
EMAIL_DOMAIN_ALLOW_LIST: List[str]
pydantic-field
¶
A list of domains for which user's will be able to register accounts if they have a matching email address. Setting the value to None
will allow any email addresses at any domain to register an account, otherwise the default []
will not allow any email addresses.
EMAIL_AUTH_SMTP_SETTINGS: SMTPSettings
pydantic-field
¶
A dictionary containing SMTP settings for sending emails for account registration.
MAX_CONTENT_LENGTH: int
pydantic-field
¶
Direct mapping to the equivalent Flask setting. In practice, limits the file size that can be uploaded. Defaults to 100 GB to avoid filling the tmp directory of a server.
Warning: this value will overwrite any other values passed to FLASK_MAX_CONTENT_LENGTH
but is included here to clarify
its importance when deploying a datalab instance.
BACKUP_STRATEGIES: dict
pydantic-field
¶
The desired backup configuration.
validate_cache_ages(values)
classmethod
¶
validate_identifier_prefix(v, values)
classmethod
¶
Make sure that the identifier prefix is set and is valid, raising clear error messages if not.
If in testing mode, then set the prefix to 'test' too. The app startup will test for this value and should also warn aggressively that this is unset.
deactivate_backup_strategies_during_testing(values)
classmethod
¶
make_missing_log_directory(v)
classmethod
¶
Make sure that the log directory exists and is writable.
update(self, mapping)
¶
pydatalab.config.RemoteFilesystem (BaseModel)
pydantic-model
¶
pydatalab.config.SMTPSettings (BaseModel)
pydantic-model
¶
Configuration for specifying SMTP settings for sending emails.
MAIL_SERVER: str
pydantic-field
¶
The SMTP server to use for sending emails.
MAIL_PORT: int
pydantic-field
¶
The port to use for the SMTP server.
MAIL_USERNAME: str
pydantic-field
¶
The username to use for the SMTP server.
MAIL_PASSWORD: str
pydantic-field
¶
The password to use for the SMTP server.
MAIL_USE_TLS: bool
pydantic-field
¶
Whether to use TLS for the SMTP connection.
MAIL_DEFAULT_SENDER: str
pydantic-field
¶
The email address to use as the sender for emails.