Configure Collection
Before you can begin collecting logs, you must tell Tailpipe what to collect. Tailpipe configuration is defined using HCL in one or more Tailpipe config (.tpc) files in the config directory (~/.tailpipe/config by default).
Tables
Ultimately, the data that Tailpipe collects ends up in tables that you can query with SQL.
Tailpipe plugins define tables for common log sources and formats. You don't need to define these tables; simply create one or more partition for the table and begin collecting logs!
If your logs are not in a standard format or are not currently supported by a plugin, you can create custom tables to collect data from arbitrary log files and other sources.
Tailpipe creates DuckLake tables based on the data and metadata that it discovers in the workspace, along with the filter rules.
When you run tailpipe query or tailpipe connect with any filter arguments that you specify (--from,--to,--index,--partition), Tailpipe finds all the tables in the workspace according to the hive directory layout and filters the view of the table.
You can see what tables are available with the tailpipe table list command.
Partitions
A partition represents data gathered from a source. Partitions are defined in HCL and are required for collection.
The partition has two labels:
- The table name. The table name is meaningful and must match a table name for an installed plugin or a custom table.
- A partition name. The partition name must be unique for all partitions in a given table (though different tables may use the same partition names).
The partition must also contain a source block that defines the location of the source log files as well as the connection information to interact with it.
The data for each table partition will be stored in its own subdirectory in the hive.
At query time, Tailpipe discovers partitions in the workspace and automatically creates tables based on the partitions it finds. For instance, if you define three aws_cloudtrail_log partitions, the aws_cloudtrail_log table will include the data from all three.
Hive partitioning
Tailpipe uses hive partitioning to leverage automatic filter pushdown and Tailpipe is opinionated on the layout:
-
The data is written to Parquet files in the workspace directory, with a prescribed directory and filename structure. Each partition is written to a separate directory.
-
The tp_index is used to partition the data and defaults to "default" if not specified. You can configure the tp_index in your partition config to specify a column whose value should be used as tp_index. Be aware that defining a tp_index does not always increase performance and may, in fact, decrease it as it can result in many small parquet files.
The standard partitioning/hive structure enables efficient queries that only need to read subsets of the hive filtered by index or date. Because the data is laid out into partitions, performance is optimized when the partition appears in a where or join clause. The index provides a way to segment the data to optimize lookup performance in a way that is optimal for your specific use case. For example, you might index on account ID for AWS tables, subscription for Azure tables, or project ID for GCP tables.
Sources
A partition acquires data from a source. Often, a source will connect to a resource via a connection, which specifies the credentials and account scope.
The block label denotes the source type - aws_s3_bucket, file, etc. The source types are defined in plugins, and the arguments vary by type. The Tailpipe Hub provides extended documentation and examples for plugin sources. The file source is provided by the core plugin, which is included in every Tailpipe installation.
The source is responsible for:
- turning raw data into rows
- initiating file transfers or requests
- downloading or copying raw data
- unzipping/untaring, etc
- incremental transfer, tracking, and retrying
Connections
Connections provide credentials and configuration options to connect to external services. Tailpipe connections are similar to connections in Steampipe and Flowpipe.
Connection types are defined in plugins. Each type creates a default connection (e.g., connection.aws.default), which can be overridden in a .tpc file. Each plugin has its own default credential resolution.