Skip to content

onETL

Repo status - Active PyPI - Latest Release PyPI - License PyPI - Python Version PyPI - Downloads

Documentation - ReadTheDocs Github Actions - latest CI build status pre-commit.ci Status Test coverage - percent

onETL logo

What is onETL?

Python ETL/ELT library powered by Apache Spark & other open-source tools.

Goals

  • Provide unified classes to extract data from (E) & load data to (L) various stores.
  • Provides Spark DataFrame API for performing transformations (T) in terms of ETL.
  • Provide direct assess to database, allowing to execute SQL queries, as well as DDL, DML, and call functions/procedures. This can be used for building up ELT pipelines.
  • Support different read strategies for incremental and batch data fetching.
  • Provide hooks & plugins mechanism for altering behavior of internal classes.

Non-goals

  • onETL is not a Spark replacement. It just provides additional functionality that Spark does not have, and improves UX for end users.
  • onETL is not a framework, as it does not have requirements to project structure, naming, the way of running ETL/ELT processes, configuration, etc. All of that should be implemented in some other tool.
  • onETL is deliberately developed without any integration with scheduling software like Apache Airflow. All integrations should be implemented as separated tools.
  • Only batch operations, no streaming. For streaming prefer Apache Flink.

Requirements

  • Python 3.7 - 3.14
  • PySpark 3.2.x - 4.1.x (depends on used connector)
  • Java 8+ (required by Spark, see below)
  • Kerberos libs & GCC (required by Hive, HDFS and SparkHDFS connectors)

Supported storages

Type Storage Powered by
Database Clickhouse
MSSQL
MySQL
Postgres
Oracle
Teradata


Apache Spark JDBC Data Source
Hive Apache Spark Hive integration
Kafka Apache Spark Kafka integration
Greenplum VMware Greenplum Spark connector
MongoDB MongoDB Spark connector
File HDFS HDFS Python client
S3 minio-py client
SFTP Paramiko library
FTP
FTPS
FTPUtil library
WebDAV WebdavClient3 library
Samba pysmb library
Files as DataFrame SparkLocalFS
SparkHDFS
Apache Spark File Data Source
SparkS3 Hadoop AWS library