onETL¶

What is onETL?¶

Python ETL/ELT library powered by Apache Spark & other open-source tools.

Provide unified classes to extract data from (E) & load data to (L) various stores.
Provides Spark DataFrame API for performing transformations (T) in terms of ETL.
Provide direct assess to database, allowing to execute SQL queries, as well as DDL, DML, and call functions/procedures. This can be used for building up ELT pipelines.
Support different read strategies for incremental and batch data fetching.
Provide hooks & plugins mechanism for altering behavior of internal classes.

onETL is not a Spark replacement. It just provides additional functionality that Spark does not have, and improves UX for end users.
onETL is not a framework, as it does not have requirements to project structure, naming, the way of running ETL/ELT processes, configuration, etc. All of that should be implemented in some other tool.
onETL is deliberately developed without any integration with scheduling software like Apache Airflow. All integrations should be implemented as separated tools.
Only batch operations, no streaming. For streaming prefer Apache Flink.

Type	Storage	Powered by
Database	Clickhouse MSSQL MySQL Postgres Oracle Teradata	Apache Spark JDBC Data Source
	Hive	Apache Spark Hive integration
	Kafka	Apache Spark Kafka integration
	Greenplum	VMware Greenplum Spark connector
	MongoDB	MongoDB Spark connector
File	HDFS	HDFS Python client
	S3	minio-py client
	SFTP	Paramiko library
	FTP FTPS	FTPUtil library
	WebDAV	WebdavClient3 library
	Samba	pysmb library
Files as DataFrame	SparkLocalFS SparkHDFS	Apache Spark File Data Source
Files as DataFrame	SparkS3	Hadoop AWS library