Create a dataset
Datasets¶
Gentropy Datasets are the most basic concept that allows to represent various abstract data modalities: Variant, Gene, Locus, etc.
The full list of Dataset
s is available in the Python API documentation.
Any instance of Dataset will have 2 common attributes
- df: the Spark DataFrame that contains the data
- schema: the definition of the data structure in Spark format
Dataset implementation - pyspark DataFrame¶
Datasets are implemented as Classes that are composed of the PySpark DataFrames, this means that the Dataset
has a df
attribute that references a dataframe with the specific schema.
Dataset schemas - contract with the user¶
Each dataset specifies the contract that has to be met by the data provided by the package user in order to run methods implemented in Gentropy.
The dataset contract is implemented as a pyspark DataFrame schema under schema
attribute and includes the table field names, types and allows for specifying if a field is required or optional to construct the dataset. All dataset schemas can be found in the corresponding documentation pages under the Dataset API.
Dataset initialization¶
In this section you'll learn the different ways of how to create a Dataset
instances.
Initializing datasets from parquet files¶
Each dataset has a method to read the parquet
files into the Dataset
instance with schema validation. This is implemented in the Dataset.from_parquet
abstract method.
# Create a SummaryStatistics object by loading data from the specified path
from gentropy import SummaryStatistics
path = "path/to/summary/stats"
summary_stats = SummaryStatistics.from_parquet(session, path)
Parquet files
Parquet is a columnar storage format that is widely used in the Spark ecosystem. It is the recommended format for storing large datasets. For more information about parquet, please visit https://parquet.apache.org/.
Reading multiple files
If you have multiple parquet files, you can pass a
- directory path like
path/to/summary/stats
- reading all parquet files fromstats
directory. - glob pattern like
path/to/summary/stats/*h.parquet
- reading all files that ends with.parquet
from `stats directory.
to the from_parquet
method. The method will read all the parquet files in the directory and return a Dataset
instance.
Initializing datasets from pyspark DataFrames¶
Once one already has a pyspark DataFrame, it can be converted to a dataset using the default Dataset
constructor. The constructor also validates the schema of the provided DataFrame against the dataset schema.
Initializing datasets from a data source¶
Alternatively, Dataset
s can be created using a data source harmonisation method. For example, to create a SummaryStatistics
object from Finngen's raw summary statistics, you can use the FinnGen
data source.
# Create a SummaryStatistics object by loading raw data from Finngen
from gentropy.datasource.finngen.summary_stats import FinnGenSummaryStats
path = "path/to/finngen/summary/stats"
summary_stats = FinnGenSummaryStats.from_source(session.spark, path)
Initializing datasets from a pandas DataFrame¶
If none of our data sources fit your needs, you can create a Dataset
object from your own data. To do so, you need to transform your data to fit the Dataset
schema.
The schema of a Dataset is defined in Spark format
The Dataset schemas can be found in the documentation of each Dataset. For example, the schema of the SummaryStatistics
dataset can be found here.
You can also create a Dataset
from a pandas DataFrame. This is useful when you want to create a Dataset
from a small dataset that fits in memory.
import pyspark.pandas as ps
from gentropy import SummaryStatistics
# Load your transformed data into a pandas DataFrame
path = "path/to/your/data"
custom_summary_stats_pandas_df = pd.read_csv(path)
# Create a SummaryStatistics object specifying the data and schema
custom_summary_stats_df = custom_summary_stats_pandas_df.to_spark()
custom_summary_stats = SummaryStatistics(_df=custom_summary_stats_df)
What's next?¶
In the next section, we will explore how to apply well-established algorithms that transform and analyse genetic data within the Gentropy framework.