Prerequisites¶
Version Compatibility¶
- MongoDB server versions:
- Officially declared: 4.0 or higher
- Actually tested: 4.0.0, 8.2.2
- Spark versions: 3.2.x - 4.1.x
- Java versions: 8 - 22
Installing PySpark¶
To use MongoDB connector you should have PySpark installed (or injected to sys.path)
BEFORE creating the connector instance.
See installation instruction for more details.
Connecting to MongoDB¶
Connection host¶
It is possible to connect to MongoDB host by using either DNS name of host or it's IP address.
It is also possible to connect to MongoDB shared cluster:
mongo = MongoDB(
host="master.host.or.ip",
user="user",
password="*****",
database="target_database",
spark=spark,
extra={
# read data from secondary cluster node, switch to primary if not available
"readPreference": "secondaryPreferred",
},
)
Supported readPreference values are described in official documentation.
Connection port¶
Connection is usually performed to port 27017. Port may differ for different MongoDB instances.
Please ask your MongoDB administrator to provide required information.
Required grants¶
Ask your MongoDB cluster administrator to set following grants for a user, used for creating a connection:
// allow writing data to specific database
db.grantRolesToUser("username", [{db: "somedb", role: "readWrite"}])
// allow reading data from specific database
db.grantRolesToUser("username", [{db: "somedb", role: "read"}])
See: