B2D Build Process

The Build process performs the following:

  • A connection is checked to external data source(s).
  • A connection to the S3/bucket is checked depending on what is configured.
  • A connection is checked to an existing warehouse/database in the given destination database.
  • Instantiates an EC Build Pod that will be used during the process.
  • Creates the schema in the destination database fully based on the data model.
  • Queries for data from base tables and creates data files (in parquet/csv format, depending on the configuration).
  • Uploads the data files to the given bucket/storage.
  • Performs the SQL statements to insert the data into the base tables.
  • Performs custom elements updates.
  • Shuts down the EC BLD.
  • Activates the cube as a 'Live' setup, setup to query directly from the destination database.

FAQ

Question Answer

Is there still an EC BLD Pod that is launched during the build process?

Yes, however it is a watered-down version of the regular EC BLD and does not contain a MonetDB instance. Therefore, this Pod will require significantly less resources during the build operations.

The Pod also contains and runs the build connector for the given destination database.

Can I monitor the uploading of the output data files to the given bucket/folder within S3 in the AWS console?

Yes, you can.

The data files process during the build is as follows:

  1. The output data files are temporarily created in the Sisense storage (e.g., location of farm).
  2. The folder is created inside the configured S3 bucket.
  3. The files are uploaded to the S3.
  4. The files are deleted from the Sisense storage.
  5. The files are deleted from S3 after they finish importing into the destination database.

Note that this process is not performed when a build optimization process is in place (Short Circuit mode).

How does the data stored in the data files in S3 get imported to the destination database?

The data files (e.g., the parquet or CSV files) that have been uploaded to S3 are imported into the given table via a SQL statement.

Who deletes the files from S3?

The Sisense build process deletes the given files that have been processed/imported completely to the given destination database once we receive a successful response from the SQL statement execution to import the files. This of course requires Sisense to be able to delete files from Storage, meaning it has the required permission. The temporary data files are also deleted in a case of a canceled or failed build.

Does B2D increase storage needs?

No, storage utilization does not increase using B2D because all temporary files that are uploaded to S3 are deleted after they are processed.

Can data be sourced from, and be built into, the same instance of the data warehouse (e.g., Snowflake)?

Yes, the same destination database instance can be used to get the build data and then build the cube. However, it is highly recommended to setup a different warehouse to store the built data.

Note that this setup will perform a build using a different method, Build Optimization (or Short Circuit). In this mode, the build will not involve an exchange of data between Sisense and the destination database, as data will be received from the source database and will be directly inserted into the given schema in the destination database. Furthermore, this process will not create data files on Sisense storage or S3, as these steps are not required.

This mode is activated based on the fact that the source and destination connection are identical (warehouse and the user role are not used to determine when to activate this mode).

This speeds up the overall build process significantly and mostly utilizes the warehouse system resources to process the build.

Are there any differences between Snowflake and Redshift when building the cube?

There are no differences in terms of the build flow that builds the cube in these given destinations, with the exception of how the schema is managed.

In the case of Snowflake, in non-full build mode, the Snowflake schema clone capability is used. The data updates are performed in the cloned schema, and at the end of the process, the schemas are swapped and the old one is dropped.

In the case of Redshift, in non-full build mode, the same existing schema is used during the build and temporary tables are created to allow for the parallel operation to work as follows:

  • Merge - Upsert/Accumulate - Creates a new temporary table which includes the data changes/additions. Then it is merged into the existing table and the temporary table is dropped.
  • Swap - Schema changes - In certain cases, the temporary table that is created becomes the base table and the old table is dropped.