Datasets

To see all operations available on datasets:

okdata datasets -h

Contents:

What is a dataset

Documentation is available on GitHub.

List all datasets

To explore datasets in Okdata you can use the following commands:

okdata datasets ls
okdata datasets ls <datasetid>
okdata datasets ls <datasetid> <versionid> <editionid>

To start exploring the datasets in Okdata you do not need to log in, but based on the permissions set on each dataset you might get different lists.

Note: For the correct, up to date, schema definition, please see the metadata-api schema catalogue. The datasets below are for demonstration purposes.

To search for a specific dataset you can use the --filter option to search for only a subset of datasets available:

okdata datasets ls --filter=<my-filter-string>

Create dataset

Enter okdata datasets create to start the dataset creation wizard. After answering a number of questions, a new dataset is created along with a selected processing pipeline, ready to receive files.

From a configuration file

Datasets can also be created from a configuration file if you need more fine grained control (this will not set up a pipeline). This method is also suitable if you need to script the dataset creation flow.

File: dataset.json

{
    "title": "My dataset",
    "description": "My dataset description",
    "keywords": ["keyword", "for-indexing"],
    "accessRights": "public",
    "objective": "The objective for this dataset",
    "contactPoint": {
        "name": "Contact Name",
        "email": "contact.name@example.org",
        "phone": "999555111"
    },
    "publisher": "my organization"
}

Create the dataset by referencing the file:

okdata datasets create --file=dataset.json

This will create a dataset with ID my-dataset. The ID is derived from the title of the dataset. If another dataset exists with the same ID, an ID will be created with a random set of characters at the end of the ID (e.g. my-dataset-4nf7). There are no restrictions on dataset naming, but it is best practice to use your organization as the first part of the dataset title. For instance, "title": "Origo developer portal statistics" will generate a dataset with ID origo-developer-portal-statistics.

Write down the ID of the dataset. This must be used when creating versions and editions.

Parent dataset

If you have several datasets that are logically grouped together under a parent concept or idea, group them together by using the parent_id property of a dataset:

File: dataset_with_parent.json

{
    "title": "Origo statistics developer portal",
    "description": "My dataset description",
    "keywords": ["keyword", "for-indexing"],
    "accessRights": "public",
    "objective": "The objective for this dataset",
    "contactPoint": {
        "name": "Contact Name",
        "email": "contact.name@example.org",
        "phone": "999555111"
    },
    "publisher": "my organization",
    "parent_id": "origo-statistics"
}

This will logically group all statistics together, and you can set permissions on the parent_id to grant access to all child datasets.

Create version

A version named “1” is created by default for new datasets. Unless you need to create additional versions, you may safely skip the rest of this section.

File: version.json

{
  "version": "2"
}

Create a new dataset version by piping the contents of version.json:

cat version.json | okdata datasets create-version <datasetid>

Or create it by referencing the file:

okdata datasets create-version <datasetid> --file=version.json

Create edition

File: edition.json

{
    "edition": "2019-01-01T12:00:00+01:01",
    "description": "My edition description",
    "startTime": "2019-01-01",
    "endTime": "2019-12-31"
}

Create the dataset version edition by piping the contents of edition.json:

cat edition.json | okdata datasets create-edition <datasetid> <versionid>

Or create it by referencing the file:

okdata datasets create-edition <datasetid> <versionid> --file=edition.json

Upload file to edition

File: /tmp/hello_world.csv

hello, world
world, hello

Upload the file with the cp command to the <datasetid> dataset. Note the ds: prefix for the target dataset.

To upload a file to a specific version and edition:

okdata datasets cp /tmp/test.txt ds:<datasetid>/<versionid>/<editionid>

By using the special edition ID latest, the file will be uploaded to the latest edition.

If no version or edition is provided, a new edition will be created for the latest version automatically:

okdata datasets cp /tmp/test.txt ds:<datasetid>

Or to upload to a new edition of a specific version:

okdata datasets cp /tmp/test.txt ds:<datasetid>/<versionid>

Inspecting the upload status

After uploading a file to a dataset using the okdata datasets cp command, a trace ID is displayed which can be used to track the uploading process status:

+-------------+---------------+---------------+-------------+
| Dataset     | Local file    | Upload status | Trace ID    |
+-------------+---------------+---------------+-------------+
| <datasetid> | /tmp/test.txt | True          | <trace_id>  |
+-------------+---------------+---------------+-------------+

To see the latest status of the upload, run:

okdata status <trace_id>

Or to see the complete status history of the uploading process:

okdata status <trace_id> --history

Passing json to the --format option displays the status in JSON format instead, making the output more suitable for use in scripts. For instance to continuously poll the upload status until it’s finished:

######### Check status for the newly uploaded file #########
uploaded=false
echo "Checking status for uploaded file"
while ! $uploaded; do
  echo "\Checking upload status..."
  upload_status=`okdaata status $trace_id --format=json`
  uploaded=`echo $upload_status | jq -r '.done'`
done
echo "Uploaded file is processed and ready to be consumed"

Download file from dataset

The okdata datasets cp command can also be used to download data form a dataset URI:

okdata datasets cp ds:<datasetid>/<versionid>/<editionid> my/target/directory

If no version or edition is provided, the latest version and edition will be used by default (if they exist):

okdata datasets cp ds:<datasetid> my/target/directory

The target directory will be created if it doesn’t already eixst on the local filesystem. The CLI also supports the use of . to specify the current working directory as output target.

Dataset access

See permissions.

Boilerplate

The process of setting up a full dataset, version, edition with a properly configured pipeline that will process your data involves a few steps. A boilerplate command is provided for you to create a full set of files and configurations that will set everything up, all you have to do is to update a few files with the correct information, and you will be up and running in no time.

Currently there are two pipelines available: csv-to-parquet and data-copy, but we are working on more, and if you have been given a custom pipeline from Okdata you can still use the boilerplate functionality, you just need to update the pipeline.json file generated with one you will get from us.

  • data-copy: a pipeline that does not alter the original input data, useful for Excel files, documents, or any other file that your dataset contains

  • csv-to-parquet: generate Parquet files from CSV input files

To create a set of files run the following command. It will create a directory in the current working directory called my-dataset:

okdata datasets boilerplate my-dataset

The boilerplate command will give a input prompt to gather all necessary information needed in order to generate a dataset with corresponding pipeline.

When running the command (see output of the boilerplate command) a default file will be uploaded to test the pipeline. To override this add a --file=<file>:

okdata datasets boilerplate my-dataset --file=/tmp/file_to_upload.csv

If you don’t need, or want to customize the files before running the supplied script, you can skip the prompt, but you then need to supply one of the pipelines available:

okdata datasets boilerplate my-dataset --pipeline=data-copy --prompt=no

The output of the command will notify you on which files you will need to update before running the supplied run.sh command.

Best practice

In order to keep your datasets and processing pipelines structured it is recommended that you create a directory structure with the following layout:

- my-organization-datasets
  - my-organization-statistics
  - my-organization-insight
  - my-organization-events

and commit this to your source repository (git or other). This will also help in debugging or troubleshooting any issues that might arise, and will make it easy to deploy to production after testing.

Any output from the run.sh command or manually executed command should also be piped to a logfile in order to look up the IDs created and be a part of the troubleshooting if need be.