pydrill¶
Python Driver for Apache Drill.
Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Free software: MIT license
Documentation: https://pydrill.readthedocs.org.
Features¶
Python 2/3 compatibility,
Support for all rest API calls inluding profiles/options/metrics docs with full list.
Mapping Results to internal python types,
Compatibility with Pandas data frame,
Drill Authentication using PAM,
Installation¶
Version from https://pypi.python.org/pypi/pydrill:
$ pip install pydrill
Latest version from git:
$ pip install git+git://github.com/PythonicNinja/pydrill.git
Sample usage¶
from pydrill.client import PyDrill
drill = PyDrill(host='localhost', port=8047)
if not drill.is_active():
raise ImproperlyConfigured('Please run Drill first')
yelp_reviews = drill.query('''
SELECT * FROM
`dfs.root`.`./Users/macbookair/Downloads/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_review.json`
LIMIT 5
''')
for result in yelp_reviews:
print("%s: %s" %(result['type'], result['date']))
# pandas dataframe
df = yelp_reviews.to_dataframe()
print(df[df['stars'] > 3])
Supported api calls¶
-
class
pydrill.client.
PyDrill
(host='localhost', port=8047, trasport_class=<class 'pydrill.transport.Transport'>, connection_class=<class 'pydrill.connection.requests_conn.RequestsHttpConnection'>, auth=None, **kwargs)[source]¶ >>> drill = PyDrill(host='localhost', port=8047) >>> drill.is_active() True
-
metrics
(timeout=10)[source]¶ Get the current memory metrics.
- Parameters
timeout – int
- Returns
pydrill.client.Result
-
options
(timeout=10)[source]¶ List the name, default, and data type of the system and session options.
- Parameters
timeout – int
- Returns
pydrill.client.Result
-
plan
(sql, timeout=10)[source]¶ - Parameters
sql – string
timeout – int
- Returns
pydrill.client.ResultQuery
-
profile
(query_id, timeout=10)[source]¶ Get the profile of the query that has the given queryid.
- Parameters
query_id – The UUID of the query in standard UUID format that Drill assigns to each query.
timeout – int
- Returns
pydrill.client.Result
-
profile_cancel
(query_id, timeout=10)[source]¶ Cancel the query that has the given queryid.
- Parameters
query_id – The UUID of the query in standard UUID format that Drill assigns to each query.
timeout – int
- Returns
pydrill.client.Result
-
profiles
(timeout=10)[source]¶ Get the profiles of running and completed queries.
- Parameters
timeout – int
- Returns
pydrill.client.Result
-
query
(sql, timeout=10)[source]¶ Submit a query and return results.
- Parameters
sql – string
timeout – int
- Returns
pydrill.client.ResultQuery
-
stats
(timeout=10)[source]¶ Get Drillbit information, such as ports numbers.
- Parameters
timeout – int
- Returns
pydrill.client.Stats
-
storage
(timeout=10)[source]¶ Get the list of storage plugin names and configurations.
- Parameters
timeout – int
- Returns
pydrill.client.Result
-
storage_delete
(name, timeout=10)[source]¶ Delete a storage plugin configuration.
- Parameters
name – The name of the storage plugin configuration to delete.
timeout – int
- Returns
pydrill.client.Result
-
storage_detail
(name, timeout=10)[source]¶ Get the definition of the named storage plugin.
- Parameters
name – The assigned name in the storage plugin definition.
timeout – int
- Returns
pydrill.client.Result
-
storage_enable
(name, value=True, timeout=10)[source]¶ Enable or disable the named storage plugin.
- Parameters
name – The assigned name in the storage plugin definition.
value – Either True (to enable) or False (to disable).
timeout – int
- Returns
pydrill.client.Result
-
storage_update
(name, config, timeout=10)[source]¶ Create or update a storage plugin configuration.
- Parameters
name – The name of the storage plugin configuration to create or update.
config – Overwrites the existing configuration if there is any, and therefore, must include all
required attributes and definitions. :param timeout: int :return: pydrill.client.Result
-