pydrill

https://img.shields.io/travis/PythonicNinja/pydrill.svg https://img.shields.io/pypi/v/pydrill.svg Documentation Status https://coveralls.io/repos/PythonicNinja/pydrill/badge.svg?branch=master&service=github

Python Driver for Apache Drill.

Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage

Features

  • Python 2/3 compatibility,

  • Support for all rest API calls inluding profiles/options/metrics docs with full list.

  • Mapping Results to internal python types,

  • Compatibility with Pandas data frame,

  • Drill Authentication using PAM,

Installation

Version from https://pypi.python.org/pypi/pydrill:

$ pip install pydrill

Latest version from git:

$ pip install git+git://github.com/PythonicNinja/pydrill.git

Sample usage

from pydrill.client import PyDrill

drill = PyDrill(host='localhost', port=8047)

if not drill.is_active():
    raise ImproperlyConfigured('Please run Drill first')

yelp_reviews = drill.query('''
  SELECT * FROM
  `dfs.root`.`./Users/macbookair/Downloads/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_review.json`
  LIMIT 5
''')

for result in yelp_reviews:
    print("%s: %s" %(result['type'], result['date']))


# pandas dataframe

df = yelp_reviews.to_dataframe()
print(df[df['stars'] > 3])

Supported api calls

class pydrill.client.PyDrill(host='localhost', port=8047, trasport_class=<class 'pydrill.transport.Transport'>, connection_class=<class 'pydrill.connection.requests_conn.RequestsHttpConnection'>, auth=None, **kwargs)[source]
>>> drill = PyDrill(host='localhost', port=8047)
>>> drill.is_active()
True
is_active(timeout=2)[source]
Parameters

timeout – int

Returns

boolean

metrics(timeout=10)[source]

Get the current memory metrics.

Parameters

timeout – int

Returns

pydrill.client.Result

options(timeout=10)[source]

List the name, default, and data type of the system and session options.

Parameters

timeout – int

Returns

pydrill.client.Result

perform_request(method, url, params=None, body=None)[source]
plan(sql, timeout=10)[source]
Parameters
  • sql – string

  • timeout – int

Returns

pydrill.client.ResultQuery

profile(query_id, timeout=10)[source]

Get the profile of the query that has the given queryid.

Parameters
  • query_id – The UUID of the query in standard UUID format that Drill assigns to each query.

  • timeout – int

Returns

pydrill.client.Result

profile_cancel(query_id, timeout=10)[source]

Cancel the query that has the given queryid.

Parameters
  • query_id – The UUID of the query in standard UUID format that Drill assigns to each query.

  • timeout – int

Returns

pydrill.client.Result

profiles(timeout=10)[source]

Get the profiles of running and completed queries.

Parameters

timeout – int

Returns

pydrill.client.Result

query(sql, timeout=10)[source]

Submit a query and return results.

Parameters
  • sql – string

  • timeout – int

Returns

pydrill.client.ResultQuery

stats(timeout=10)[source]

Get Drillbit information, such as ports numbers.

Parameters

timeout – int

Returns

pydrill.client.Stats

storage(timeout=10)[source]

Get the list of storage plugin names and configurations.

Parameters

timeout – int

Returns

pydrill.client.Result

storage_delete(name, timeout=10)[source]

Delete a storage plugin configuration.

Parameters
  • name – The name of the storage plugin configuration to delete.

  • timeout – int

Returns

pydrill.client.Result

storage_detail(name, timeout=10)[source]

Get the definition of the named storage plugin.

Parameters
  • name – The assigned name in the storage plugin definition.

  • timeout – int

Returns

pydrill.client.Result

storage_enable(name, value=True, timeout=10)[source]

Enable or disable the named storage plugin.

Parameters
  • name – The assigned name in the storage plugin definition.

  • value – Either True (to enable) or False (to disable).

  • timeout – int

Returns

pydrill.client.Result

storage_update(name, config, timeout=10)[source]

Create or update a storage plugin configuration.

Parameters
  • name – The name of the storage plugin configuration to create or update.

  • config – Overwrites the existing configuration if there is any, and therefore, must include all

required attributes and definitions. :param timeout: int :return: pydrill.client.Result

threads(timeout=10)[source]

Get the status of threads.

Parameters

timeout – int

Returns

pydrill.client.Result