Create a dataset
Gentropy provides a collection of Dataset
s that encapsulate key concepts in the field of genetics. For example, to represent summary statistics, you'll use the SummaryStatistics
class. This datatype comes with a set of useful operations to disentangle the genetic architecture of a trait or disease.
The full list of Dataset
s is available in the Python API documentation.
Any instance of Dataset will have 2 common attributes
- df: the Spark DataFrame that contains the data
- schema: the definition of the data structure in Spark format
In this section you'll learn the different ways of how to create a Dataset
instances.
Creating a dataset from parquet¶
All the Dataset
s have a from_parquet
method that allows you to create any Dataset
instance from a parquet file or directory.
# Create a SummaryStatistics object by loading data from the specified path
from gentropy.dataset.summary_statistics import SummaryStatistics
path = "path/to/summary/stats"
summary_stats = SummaryStatistics.from_parquet(session, path)
Parquet files
Parquet is a columnar storage format that is widely used in the Spark ecosystem. It is the recommended format for storing large datasets. For more information about parquet, please visit https://parquet.apache.org/.
Creating a dataset from a data source¶
Alternatively, Dataset
s can be created using a data source harmonisation method. For example, to create a SummaryStatistics
object from Finngen's raw summary statistics, you can use the FinnGen
data source.
# Create a SummaryStatistics object by loading raw data from Finngen
from gentropy.datasource.finngen.summary_stats import FinnGenSummaryStats
path = "path/to/finngen/summary/stats"
summary_stats = FinnGenSummaryStats.from_source(session.spark, path)
Creating a dataset from a pandas DataFrame¶
If none of our data sources fit your needs, you can create a Dataset
object from your own data. To do so, you need to transform your data to fit the Dataset
schema.
The schema of a Dataset is defined in Spark format
The Dataset schemas can be found in the documentation of each Dataset. For example, the schema of the SummaryStatistics
dataset can be found here.
You can also create a Dataset
from a pandas DataFrame. This is useful when you want to create a Dataset
from a small dataset that fits in memory.
import pyspark.pandas as ps
from gentropy.dataset.summary_statistics import SummaryStatistics
# Load your transformed data into a pandas DataFrame
path = "path/to/your/data"
custom_summary_stats_pandas_df = pd.read_csv(path)
# Create a SummaryStatistics object specifying the data and schema
custom_summary_stats_df = custom_summary_stats_pandas_df.to_spark()
custom_summary_stats = SummaryStatistics(
_df=custom_summary_stats_df, _schema=SummaryStatistics.get_schema()
)
What's next?¶
In the next section, we will explore how to apply well-established algorithms that transform and analyse genetic data within the Gentropy framework.