Skip to content

Create a dataset

Datasets

Gentropy Datasets are the most basic concept that allows to represent various abstract data modalities: Variant, Gene, Locus, etc.

The full list of Datasets is available in the Python API documentation.

Any instance of Dataset will have 2 common attributes

  • df: the Spark DataFrame that contains the data
  • schema: the definition of the data structure in Spark format

Dataset implementation - pyspark DataFrame

Datasets are implemented as Classes that are composed of the PySpark DataFrames, this means that the Dataset has a df attribute that references a dataframe with the specific schema.

Dataset schemas - contract with the user

Each dataset specifies the contract that has to be met by the data provided by the package user in order to run methods implemented in Gentropy.

The dataset contract is implemented as a pyspark DataFrame schema under schema attribute and includes the table field names, types and allows for specifying if a field is required or optional to construct the dataset. All dataset schemas can be found in the corresponding documentation pages under the Dataset API.

Dataset initialization

In this section you'll learn the different ways of how to create a Dataset instances.

Initializing datasets from parquet files

Each dataset has a method to read the parquet files into the Dataset instance with schema validation. This is implemented in the Dataset.from_parquet abstract method.

# Create a SummaryStatistics object by loading data from the specified path
from gentropy import SummaryStatistics

path = "path/to/summary/stats"
summary_stats = SummaryStatistics.from_parquet(session, path)

Parquet files

Parquet is a columnar storage format that is widely used in the Spark ecosystem. It is the recommended format for storing large datasets. For more information about parquet, please visit https://parquet.apache.org/.

Reading multiple files

If you have multiple parquet files, you can pass a

  • directory path like path/to/summary/stats - reading all parquet files from stats directory.
  • glob pattern like path/to/summary/stats/*h.parquet - reading all files that ends with .parquet from `stats directory.

to the from_parquet method. The method will read all the parquet files in the directory and return a Dataset instance.

Initializing datasets from pyspark DataFrames

Once one already has a pyspark DataFrame, it can be converted to a dataset using the default Dataset constructor. The constructor also validates the schema of the provided DataFrame against the dataset schema.

Initializing datasets from a data source

Alternatively, Datasets can be created using a data source harmonisation method. For example, to create a SummaryStatistics object from Finngen's raw summary statistics, you can use the FinnGen data source.

# Create a SummaryStatistics object by loading raw data from Finngen
from gentropy.datasource.finngen.summary_stats import FinnGenSummaryStats

path = "path/to/finngen/summary/stats"
summary_stats = FinnGenSummaryStats.from_source(session.spark, path)

Initializing datasets from a pandas DataFrame

If none of our data sources fit your needs, you can create a Dataset object from your own data. To do so, you need to transform your data to fit the Dataset schema.

The schema of a Dataset is defined in Spark format

The Dataset schemas can be found in the documentation of each Dataset. For example, the schema of the SummaryStatistics dataset can be found here.

You can also create a Dataset from a pandas DataFrame. This is useful when you want to create a Dataset from a small dataset that fits in memory.

import pyspark.pandas as ps

from gentropy import SummaryStatistics


# Load your transformed data into a pandas DataFrame
path = "path/to/your/data"
custom_summary_stats_pandas_df = pd.read_csv(path)

# Create a SummaryStatistics object specifying the data and schema
custom_summary_stats_df = custom_summary_stats_pandas_df.to_spark()
custom_summary_stats = SummaryStatistics(_df=custom_summary_stats_df)

What's next?

In the next section, we will explore how to apply well-established algorithms that transform and analyse genetic data within the Gentropy framework.