onETL¶
What is onETL?¶
Python ETL/ELT library powered by Apache Spark & other open-source tools.
Goals¶
- Provide unified classes to extract data from (E) & load data to (L) various stores.
- Provides Spark DataFrame API for performing transformations (T) in terms of ETL.
- Provide direct assess to database, allowing to execute SQL queries, as well as DDL, DML, and call functions/procedures. This can be used for building up ELT pipelines.
- Support different read strategies for incremental and batch data fetching.
- Provide hooks & plugins mechanism for altering behavior of internal classes.
Non-goals¶
- onETL is not a Spark replacement. It just provides additional functionality that Spark does not have, and improves UX for end users.
- onETL is not a framework, as it does not have requirements to project structure, naming, the way of running ETL/ELT processes, configuration, etc. All of that should be implemented in some other tool.
- onETL is deliberately developed without any integration with scheduling software like Apache Airflow. All integrations should be implemented as separated tools.
- Only batch operations, no streaming. For streaming prefer Apache Flink.
Requirements¶
- Python 3.7 - 3.14
- PySpark 3.2.x - 4.1.x (depends on used connector)
- Java 8+ (required by Spark, see below)
- Kerberos libs & GCC (required by
Hive,HDFSandSparkHDFSconnectors)
Supported storages¶
| Type | Storage | Powered by |
|---|---|---|
| Database | Clickhouse MSSQL MySQL Postgres Oracle Teradata |
Apache Spark JDBC Data Source |
| Hive | Apache Spark Hive integration | |
| Kafka | Apache Spark Kafka integration | |
| Greenplum | VMware Greenplum Spark connector | |
| MongoDB | MongoDB Spark connector | |
| File | HDFS | HDFS Python client |
| S3 | minio-py client | |
| SFTP | Paramiko library | |
| FTP FTPS |
FTPUtil library | |
| WebDAV | WebdavClient3 library | |
| Samba | pysmb library | |
| Files as DataFrame | SparkLocalFS SparkHDFS |
Apache Spark File Data Source |
| SparkS3 | Hadoop AWS library |