Collection
The tailpipe collect command runs a plugin that reads from a source and writes to the hive. Every time you run tailpipe collect, Tailpipe refreshes its views over all collected Parquet files. Those views are the tables you query with tailpipe query.
Examples:
Collect everything.
Collect all partitions in the aws_cloudtrail_log table.
Collect a specific partition.
See collect for more examples.
The collection process always writes to a local workspace, and does so on a per-partition basis. While you may specify multiple partitions on the command line, partition is the unit of collection.
When a partition is collected, each source resumes from the last time it was collected. Source data is ingested, standardized, then written to Parquet files in the standard hive.
Queries can slice the data by partition using the tp_partition field.
Initial collection
Often, the source data to be ingested is large, and the first ingestion would take quite a long time. To improve the first-run experience for collection Tailpipe collects by days in reverse chronological order. In other words, it starts with the current day and moves backward. The default is NOT be to collect all data on the initial collection, there is a default 7-day lookback window. You can override that on the command line, e.g.:
Subsequent collection runs occur chronologically resuming from the last collection by default, so there are no time gaps while the data is being collected.