Using PL/Python in HAWQ
This section provides an overview of the HAWQ PL/Python procedural language extension.
About HAWQ PL/Python
PL/Python is embedded in your HAWQ product distribution or within your HAWQ build if you chose to enable it as a build option.
With the HAWQ PL/Python extension, you can write user-defined functions in Python that take advantage of Python features and modules, enabling you to quickly build robust HAWQ database applications.
HAWQ uses the system Python installation.
HAWQ PL/Python Limitations
- HAWQ does not support PL/Python trigger functions.
- PL/Python is available only as a HAWQ untrusted language.
Enabling and Removing PL/Python Support
To use PL/Python in HAWQ, you must either install a binary version of HAWQ that includes PL/Python or specify PL/Python as a build option when you compile HAWQ from source.
You must register the PL/Python language with a database before you can create and execute a PL/Python UDF on that database. You must be a database superuser to register and remove new languages in HAWQ databases.
On every database to which you want to install and enable PL/Python:
Connect to the database using the
psqlclient:gpadmin@hawq-node$ psql -d <dbname>Replace <dbname> with the name of the target database.
Run the following SQL command to register the PL/Python procedural language:
dbname=# CREATE LANGUAGE plpythonu;Note:
plpythonuis installed as an untrusted language; it offers no way of restricting what you can program in UDFs created with the language. Creating and executing PL/Python UDFs is permitted only by database superusers and other database users explicitlyGRANTed the permissions.
To remove support for plpythonu from a database, run the following SQL command; you must be a database superuser to remove a registered procedural language:
dbname=# DROP LANGUAGE plpythonu;
Developing Functions with PL/Python
PL/Python functions are defined using the standard SQL CREATE FUNCTION syntax.
The body of a PL/Python user-defined function is a Python script. When the function is called, its arguments are passed as elements of the array args[]. You can also pass named arguments as ordinary variables to the Python script.
PL/Python function results are returned with a return statement, or a yield statement in the case of a result-set statement.
The following PL/Python function computes and returns the maximum of two integers:
=# CREATE FUNCTION mypymax (a integer, b integer)
RETURNS integer
AS $$
if (a is None) or (b is None):
return None
if a > b:
return a
return b
$$ LANGUAGE plpythonu;
To execute the mypymax function:
=# SELECT mypymax(5, 7);
mypymax
---------
7
(1 row)
Adding the STRICT keyword to the LANGUAGE subclause instructs HAWQ to return null when any of the input arguments are null. When created as STRICT, the function itself need not perform null checks.
The following example uses an unnamed argument, the built-in Python max() function, and the STRICT keyword to create a UDF named mypymax2:
=# CREATE FUNCTION mypymax2 (a integer, integer)
RETURNS integer AS $$
return max(a, args[0])
$$ LANGUAGE plpythonu STRICT;
=# SELECT mypymax(5, 3);
mypymax2
----------
5
(1 row)
=# SELECT mypymax(5, null);
mypymax2
----------
(1 row)
Creating the Sample Data
Perform the following steps to create, and insert data into, a simple table. This table will be used in later exercises.
Create a database named
testdb:gpadmin@hawq-node$ createdb testdbCreate a table named
sales:gpadmin@hawq-node$ psql -d testdbtestdb=> CREATE TABLE sales (id int, year int, qtr int, day int, region text) DISTRIBUTED BY (id);Insert data into the table:
testdb=> INSERT INTO sales VALUES (1, 2014, 1,1, 'usa'), (2, 2002, 2,2, 'europe'), (3, 2014, 3,3, 'asia'), (4, 2014, 4,4, 'usa'), (5, 2014, 1,5, 'europe'), (6, 2014, 2,6, 'asia'), (7, 2002, 3,7, 'usa') ;
Python Modules
A Python module is a text file containing Python statements and definitions. Python modules are named, with the file name for a module following the <python-module-name>.py naming convention.
Should you need to build a Python module, ensure that the appropriate software is installed on the build system. Also be sure that you are building for the correct deployment architecture, i.e. 64-bit.
HAWQ Considerations
When installing a Python module in HAWQ, you must add the module to all segment nodes in the cluster. You must also add all Python modules to any new segment hosts when you expand your HAWQ cluster.
PL/Python supports the built-in HAWQ Python module named plpy. You can also install 3rd party Python modules.
plpy Module
The HAWQ PL/Python procedural language extension automatically imports the Python module plpy. plpy implements functions to execute SQL queries and prepare execution plans for queries. The plpy module also includes functions to manage errors and messages.
Executing and Preparing SQL Queries
Use the PL/Python plpy module plpy.execute() function to execute a SQL query. Use the plpy.prepare() function to prepare an execution plan for a query. Preparing the execution plan for a query is useful if you want to run the query from multiple Python functions.
plpy.execute()
Invoking plpy.execute() with a query string and an optional limit argument runs the query, returning the result in a Python result object. This result object:
- emulates a list or dictionary object
- returns rows that can be accessed by row number and column name; row numbering starts with 0 (zero)
- can be modified
- includes an
nrows()method that returns the number of rows returned by the query - includes a
status()method that returns theSPI_execute()return value
For example, the following Python statement when present in a PL/Python user-defined function will execute a SELECT * FROM mytable query:
rv = plpy.execute("SELECT * FROM my_table", 3)
As instructed by the limit argument 3, the plpy.execute function will return up to 3 rows from my_table. The result set is stored in the rv object.
Access specific columns in the table by name. For example, if my_table has a column named my_column:
my_col_data = rv[i]["my_column"]
You specified that the function return a maximum of 3 rows in the plpy.execute() command above. As such, the index i used to access the result value rv must specify an integer between 0 and 2, inclusive.
Example: plpy.execute()
Example: Use plpy.execute() to run a similar query on the sales table you created in an earlier section:
Define a PL/Python UDF that executes a query to return at most 5 rows from the
salestable:=# CREATE OR REPLACE FUNCTION mypytest(a integer) RETURNS text AS $$ rv = plpy.execute("SELECT * FROM sales ORDER BY id", 5) region = rv[a-1]["region"] return region $$ LANGUAGE plpythonu;When executed, this UDF returns the
regionvalue from theididentified by the input valuea. Since row numbering of the result set starts at 0, you must access the result set with indexa - 1.Specifying the
ORDER BY idclause in theSELECTstatement ensures that subsequent invocations ofmypytestwith the same input argument will return identical result sets.Run
mypytestwith an argument identifyingid3:=# SELECT mypytest(3); mypytest ---------- asia (1 row)Recall that the row numbering starts from 0 in a Python returned result set. The valid input argument for the
mypytest2function is an integer between 0 and 4, inclusive.The query returns the
regionfrom the row withid = 3,asia.
Note: This example demonstrates some of the concepts discussed previously. It may not be the ideal way to return a specific column value.
plpy.prepare()
The function plpy.prepare() prepares the execution plan for a query. Preparing the execution plan for a query is useful if you plan to run the query from multiple Python functions.
You invoke plpy.prepare() with a query string. Also include a list of parameter types if you are using parameter references in the query. For example, the following statement in a PL/Python user-defined function returns the execution plan for a query:
plan = plpy.prepare("SELECT * FROM sales ORDER BY id WHERE
region = $1", [ "text" ])
The string text identifies the data type of the variable $1.
After preparing an execution plan, you use the function plpy.execute() to run it. For example:
rv = plpy.execute(plan, [ "usa" ])
When executed, rv will include all rows in the sales table where region = usa.
Read on for a description of how one passes data between PL/Python function calls.
Saving Execution Plans
When you prepare an execution plan using the PL/Python module, the plan is automatically saved. See the Postgres Server Programming Interface (SPI) documentation for information about execution plans.
To make effective use of saved plans across function calls, you use one of the Python persistent storage dictionaries, SD or GD.
The global dictionary SD is available to store data between function calls. This variable is private static data. The global dictionary GD is public data, and is available to all Python functions within a session. Use GD with care.
Each function gets its own execution environment in the Python interpreter, so that global data and function arguments from myfunc1 are not available to myfunc2. The exception is the data in the GD dictionary, as mentioned previously.
This example saves an execution plan to the SD dictionary and then executes the plan:
=# CREATE FUNCTION usesavedplan() RETURNS text AS $$
select1plan = plpy.prepare("SELECT region FROM sales WHERE id=1")
SD["s1plan"] = select1plan
# other function processing
# execute the saved plan
rv = plpy.execute(SD["s1plan"])
return rv[0]["region"]
$$ LANGUAGE plpythonu;
=# SELECT usesavedplan();
Example: plpy.prepare()
Example: Use plpy.prepare() and plpy.execute() to prepare and run an execution plan using the GD dictionary:
Define a PL/Python UDF to prepare and save an execution plan to the GD. Also return the name of the plan:
=# CREATE OR REPLACE FUNCTION mypy_prepplan() RETURNS text AS $$ plan = plpy.prepare("SELECT * FROM sales WHERE region = $1 ORDER BY id", [ "text" ]) GD["getregionplan"] = plan return "getregionplan" $$ LANGUAGE plpythonu;This UDF, when run, will return the name (key) of the execution plan generated from the
plpy.prepare()call.Define a PL/Python UDF to run the execution plan; this function will take the plan name and
regionname as an input:=# CREATE OR REPLACE FUNCTION mypy_execplan(planname text, regionname text) RETURNS integer AS $$ rv = plpy.execute(GD[planname], [ regionname ], 5) year = rv[0]["year"] return year $$ LANGUAGE plpythonu STRICT;This UDF executes the
plannameplan that was previously saved to the GD. You will callmypy_execplan()with theplannamereturned from theplpy.prepare()call.Execute the
mypy_prepplan()andmypy_execplan()UDFs, passingregionusa:=# SELECT mypy_execplan( mypy_prepplan(), 'usa' ); mypy_execplan --------------- 2014 (1 row)
Handling Python Errors and Messages
The plpy module implements the following message- and error-related functions, each of which takes a message string as an argument:
plpy.debug(msg)plpy.log(msg)plpy.info(msg)plpy.notice(msg)plpy.warning(msg)plpy.error(msg)plpy.fatal(msg)
plpy.error() and plpy.fatal() raise a Python exception which, if uncaught, propagates out to the calling query, possibly aborting the current transaction or subtransaction. raise plpy.ERROR(msg) and raise plpy.FATAL(msg) are equivalent to calling plpy.error() and plpy.fatal(), respectively. Use the other message functions to generate messages of different priority levels.
Messages may be reported to the client and/or written to the HAWQ server log file. The HAWQ server configuration parameters log_min_messages and client_min_messages control where messages are reported.
Example: Generating Messages
In this example, you will create a PL/Python UDF that includes some debug log messages. You will also configure your psql session to enable debug-level client logging.
Define a PL/Python UDF that executes a query that will return at most 5 rows from the
salestable. Invoke theplpy.debug()method to display some additional information:=# CREATE OR REPLACE FUNCTION mypytest_debug(a integer) RETURNS text AS $$ plpy.debug('mypytest_debug executing query: SELECT * FROM sales ORDER BY id') rv = plpy.execute("SELECT * FROM sales ORDER BY id", 5) plpy.debug('mypytest_debug: query returned ' + str(rv.nrows()) + ' rows') region = rv[a]["region"] return region $$ LANGUAGE plpythonu;Execute the
mypytest_debug()UDF, passing the integer2as an argument:=# SELECT mypytest_debug(2); mypytest_debug ---------------- asia (1 row)Enable
DEBUG2level client logging:=# SET client_min_messages=DEBUG2;Execute the
mypytest_debug()UDF again:=# SELECT mypytest_debug(2); ... DEBUG2: mypytest_debug executing query: SELECT * FROM sales ORDER BY id ... DEBUG2: mypytest_debug: query returned 5 rows ...Debug output is very verbose. You will parse a lot of output to find the
mypytest_debugmessages. Hint: look both near the start and end of the output.Turn off client-level debug logging:
=# SET client_min_messages=NOTICE;
3rd-Party Python Modules
PL/Python supports installation and use of 3rd-party Python Modules. This section includes examples for installing the setuptools and NumPy Python modules.
Note: You must have superuser privileges to install Python modules to the system Python directories.
Example: Installing setuptools
In this example, you will manually install the Python setuptools module from the Python Package Index repository. setuptools enables you to easily download, build, install, upgrade, and uninstall Python packages.
You will first build the module from the downloaded package, installing it on a single host. You will then build and install the module on all segment nodes in your HAWQ cluster.
Download the
setuptoolsmodule package from the Python Package Index site. For example, run thiswgetcommand on a HAWQ node as thegpadminuser:$ ssh gpadmin@<hawq-node> gpadmin@hawq-node$ . /usr/local/hawq/greenplum_path.sh gpadmin@hawq-node$ mkdir plpython_pkgs gpadmin@hawq-node$ cd plpython_pkgs gpadmin@hawq-node$ export PLPYPKGDIR=`pwd` gpadmin@hawq-node$ wget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-18.4.tar.gzExtract the files from the
tar.gzpackage:gpadmin@hawq-node$ tar -xzvf setuptools-18.4.tar.gzRun the Python scripts to build and install the Python package; you must have superuser privileges to install Python modules to the system Python installation:
gpadmin@hawq-node$ cd setuptools-18.4 gpadmin@hawq-node$ python setup.py build gpadmin@hawq-node$ sudo python setup.py installRun the following command to verify the module is available to Python:
gpadmin@hawq-node$ python -c "import setuptools"If no error is returned, the
setuptoolsmodule was successfully imported.The
setuptoolspackage installs theeasy_installutility. This utility enables you to install Python packages from the Python Package Index repository. For example, this command installs the Pythonpiputility from the Python Package Index site:gpadmin@hawq-node$ sudo easy_install pipCopy the
setuptoolspackage to all HAWQ nodes in your cluster. For example, this command copies thetar.gzfile from the current host to the host systems listed in the filehawq-hosts:gpadmin@hawq-node$ cd $PLPYPKGDIR gpadmin@hawq-node$ hawq scp -f hawq-hosts setuptools-18.4.tar.gz =:/home/gpadminRun the commands to build, install, and test the
setuptoolspackage you just copied to all hosts in your HAWQ cluster. For example:gpadmin@hawq-node$ hawq ssh -f hawq-hosts >>> mkdir plpython_pkgs >>> cd plpython_pkgs >>> tar -xzvf ../setuptools-18.4.tar.gz >>> cd setuptools-18.4 >>> python setup.py build >>> sudo python setup.py install >>> python -c "import setuptools" >>> exit
Example: Installing NumPy
In this example, you will build and install the Python module NumPy. NumPy is a module for scientific computing with Python. For additional information about NumPy, refer to http://www.numpy.org/.
This example assumes yum is installed on all HAWQ segment nodes and that the gpadmin user is a member of sudoers with root privileges on the nodes.
Prerequisites
Building the NumPy package requires the following software:
- OpenBLAS libraries - an open source implementation of BLAS (Basic Linear Algebra Subprograms)
- Python development packages - python-devel
- gcc compilers - gcc, gcc-gfortran, and gcc-c++
Perform the following steps to set up the OpenBLAS compilation environment on each HAWQ node:
Use
yumto install gcc compilers from system repositories. The compilers are required on all hosts where you compile OpenBLAS. For example:root@hawq-node$ yum -y install gcc gcc-gfortran gcc-c++ python-devel(Optionally required) If you cannot install the correct compiler versions with
yum, you have the option to download the gcc compilers, includinggfortran, from source and build and install them manually. Refer to Building gfortran from Source forgfortranbuild and install information.Create a symbolic link to
g++, naming itgxx:root@hawq-node$ ln -s /usr/bin/g++ /usr/bin/gxxYou may also need to create symbolic links to any libraries that have different versions available; for example, linking
libppl_c.so.4tolibppl_c.so.2.You can use the
hawq scputility to copy files to HAWQ hosts and thehawq sshutility to run commands on those hosts.
Obtaining Packages
Perform the following steps to download and distribute the OpenBLAS and NumPy source packages:
Download the OpenBLAS and NumPy source files. For example, these
wgetcommands downloadtar.gzfiles into apackagesdirectory in the current working directory:$ ssh gpadmin@<hawq-node> gpadmin@hawq-node$ wget --directory-prefix=packages http://github.com/xianyi/OpenBLAS/tarball/v0.2.8 gpadmin@hawq-node$ wget --directory-prefix=packages http://sourceforge.net/projects/numpy/files/NumPy/1.8.0/numpy-1.8.0.tar.gz/downloadDistribute the software to all nodes in your HAWQ cluster. For example, if you downloaded the software to
/home/gpadmin/packages, these commands create thepackagesdirectory on all nodes and copies the software to the nodes listed in thehawq-hostsfile:gpadmin@hawq-node$ hawq ssh -f hawq-hosts mkdir packages gpadmin@hawq-node$ hawq scp -f hawq-hosts packages/* =:/home/gpadmin/packages
Build and Install OpenBLAS Libraries
Before building and installing the NumPy module, you must first build and install the OpenBLAS libraries. This section describes how to build and install the libraries on a single HAWQ node.
Extract the OpenBLAS files from the file:
$ ssh gpadmin@<hawq-node> gpadmin@hawq-node$ cd packages gpadmin@hawq-node$ tar xzf v0.2.8 -C /home/gpadmin/packages gpadmin@hawq-node$ mv /home/gpadmin/packages/xianyi-OpenBLAS-9c51cdf /home/gpadmin/packages/OpenBLASThese commands extract the OpenBLAS tar file and simplify the unpacked directory name.
Compile OpenBLAS. You must set the
LIBRARY_PATHenvironment variable to the current$LD_LIBRARY_PATH. For example:gpadmin@hawq-node$ cd OpenBLAS gpadmin@hawq-node$ export LIBRARY_PATH=$LD_LIBRARY_PATH gpadmin@hawq-node$ make FC=gfortran USE_THREAD=0 TARGET=SANDYBRIDGEReplace the
TARGETargument with the target appropriate for your hardware. TheTargetList.txtfile identifies the list of supported OpenBLAS targets.Compiling OpenBLAS make take some time.
Install the OpenBLAS libraries in
/usr/localand then change the owner of the files togpadmin. You must haverootprivileges. For example:gpadmin@hawq-node$ sudo make PREFIX=/usr/local install gpadmin@hawq-node$ sudo ldconfig gpadmin@hawq-node$ sudo chown -R gpadmin /usr/local/libThe following libraries are installed to
/usr/local/lib, along with symbolic links:gpadmin@hawq-node$ ls -l gpadmin@hawq-node$ ... libopenblas.a -> libopenblas_sandybridge-r0.2.8.a libopenblas_sandybridge-r0.2.8.a libopenblas_sandybridge-r0.2.8.so libopenblas.so -> libopenblas_sandybridge-r0.2.8.so libopenblas.so.0 -> libopenblas_sandybridge-r0.2.8.so ...Install the OpenBLAS libraries on all nodes in your HAWQ cluster. You can use the
hawq sshutility to similarly build and install the OpenBLAS libraries on each of the nodes.Or, you may choose to copy the OpenBLAS libraries you just built to all of the HAWQ cluster nodes. For example, these
hawq sshandhawq scpcommands install prerequisite packages, and copy and install the OpenBLAS libraries on the hosts listed in thehawq-hostsfile.$ hawq ssh -f hawq-hosts -e 'sudo yum -y install gcc gcc-gfortran gcc-c++ python-devel' $ hawq ssh -f hawq-hosts -e 'ln -s /usr/bin/g++ /usr/bin/gxx' $ hawq ssh -f hawq-hosts -e sudo chown gpadmin /usr/local/lib $ hawq scp -f hawq-hosts /usr/local/lib/libopen*sandy* =:/usr/local/lib$ hawq ssh -f hawq-hosts >>> cd /usr/local/lib >>> ln -s libopenblas_sandybridge-r0.2.8.a libopenblas.a >>> ln -s libopenblas_sandybridge-r0.2.8.so libopenblas.so >>> ln -s libopenblas_sandybridge-r0.2.8.so libopenblas.so.0 >>> sudo ldconfig
Build and Install NumPy
After you have installed the OpenBLAS libraries, you can build and install NumPy module. These steps install the NumPy module on a single host. You can use the hawq ssh utility to build and install the NumPy module on multiple hosts.
Extract the NumPy module source files:
gpadmin@hawq-node$ cd /home/gpadmin/packages gpadmin@hawq-node$ tar xzf numpy-1.8.0.tar.gzUnpacking the
numpy-1.8.0.tar.gzfile creates a directory namednumpy-1.8.0in the current directory.Set up the environment for building and installing NumPy:
gpadmin@hawq-node$ export BLAS=/usr/local/lib/libopenblas.a gpadmin@hawq-node$ export LAPACK=/usr/local/lib/libopenblas.a gpadmin@hawq-node$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/ gpadmin@hawq-node$ export LIBRARY_PATH=$LD_LIBRARY_PATHBuild and install NumPy. (Building the NumPy package might take some time.)
gpadmin@hawq-node$ cd numpy-1.8.0 gpadmin@hawq-node$ python setup.py build gpadmin@hawq-node$ sudo python setup.py installNote: If the NumPy module did not successfully build, the NumPy build process might need a
site.cfgfile that specifies the location of the OpenBLAS libraries. Create thesite.cfgfile in the NumPy package directory:gpadmin@hawq-node$ touch site.cfgAdd the following to the
site.cfgfile and run the NumPy build command again:[default] library_dirs = /usr/local/lib [atlas] atlas_libs = openblas library_dirs = /usr/local/lib [lapack] lapack_libs = openblas library_dirs = /usr/local/lib # added for scikit-learn [openblas] libraries = openblas library_dirs = /usr/local/lib include_dirs = /usr/local/includeVerify that the NumPy module is available for import by Python:
gpadmin@hawq-node$ cd $HOME gpadmin@hawq-node$ python -c "import numpy"If no error is returned, the NumPy module was successfully imported.
As performed in the
setuptoolsPython module installation, use thehawq sshutility to build, install, and test the NumPy module on all HAWQ nodes.The environment variables that were required to build the NumPy module are also required in the
gpadminruntime environment to run Python NumPy functions. You can use theechocommand to add the environment variables togpadmin’s.bashrcfile. For example, the followingechocommands add the environment variables to the.bashrcfile ingpadmin’s home directory:$ echo -e '\n#Needed for NumPy' >> ~/.bashrc $ echo -e 'export BLAS=/usr/local/lib/libopenblas.a' >> ~/.bashrc $ echo -e 'export LAPACK=/usr/local/lib/libopenblas.a' >> ~/.bashrc $ echo -e 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib' >> ~/.bashrc $ echo -e 'export LIBRARY_PATH=$LD_LIBRARY_PATH' >> ~/.bashrcYou can use the
hawq sshutility with theseechocommands to add the environment variables to the.bashrcfile on all nodes in your HAWQ cluster.
Testing Installed Python Modules
You can create a simple PL/Python user-defined function (UDF) to validate that a Python module is available in HAWQ. This example tests the NumPy module.
Create a PL/Python UDF that imports the NumPy module:
gpadmin@hawq_node$ psql -d testdb=# CREATE OR REPLACE FUNCTION test_importnumpy(x int) RETURNS text AS $$ try: from numpy import * return 'SUCCESS' except ImportError, e: return 'FAILURE' $$ LANGUAGE plpythonu;The function returns SUCCESS if the module is imported, and FAILURE if an import error occurs.
Create a table that loads data on each HAWQ segment instance:
=> CREATE TABLE disttbl AS (SELECT x FROM generate_series(1,50) x ) DISTRIBUTED BY (x);Depending upon the size of your HAWQ installation, you may need to generate a larger series to ensure data is distributed to all segment instances.
Run the UDF on the segment nodes where data is stored in the primary segment instances.
=# SELECT gp_segment_id, test_importnumpy(1) AS status FROM disttbl GROUP BY gp_segment_id, status ORDER BY gp_segment_id, status;The
SELECTcommand returns SUCCESS if the UDF imported the Python module on the HAWQ segment instance. FAILURE is returned if the Python module could not be imported.
Troubleshooting Python Module Import Failures
Possible causes of a Python module import failure include:
A problem accessing required libraries. For the NumPy example, HAWQ might have a problem accessing the OpenBLAS libraries or the Python libraries on a segment host.
Try: Test importing the module on the segment host. This
hawq sshcommand tests importing the NumPy module on the segment host named mdw1.gpadmin@hawq-node$ hawq ssh -h mdw1 python -c "import numpy"Environment variables may not be configured in the HAWQ environment. The Python import command may not return an error in this case.
Try: Ensure that the environment variables are properly set. For the NumPy example, ensure that the environment variables listed at the end of the section Build and Install NumPy are defined in the
.bashrcfile for thegpadminuser on the master and all segment nodes.Note: The
.bashrcfile for thegpadminuser on the HAWQ master and all segment nodes must source thegreenplum_path.shfile.HAWQ might not have been restarted after adding environment variable settings to the
.bashrcfile. Again, the Python import command may not return an error in this case.Try: Ensure that you have restarted HAWQ.
gpadmin@master$ hawq restart cluster
Using the GD Dictionary to Improve PL/Python Performance
Importing a Python module is an expensive operation that can adversely affect performance. If you are importing the same module frequently, you can use Python global variables to import the module on the first invocation and forego loading the module on subsequent imports.
The following PL/Python function uses the GD persistent storage dictionary to avoid importing the module NumPy if it has already been imported in the GD. The UDF includes a call to plpy.notice() to display a message when importing the module.
=# CREATE FUNCTION mypy_import2gd() RETURNS text AS $$
if 'numpy' not in GD:
plpy.notice('mypy_import2gd: importing module numpy')
import numpy
GD['numpy'] = numpy
return 'numpy'
$$ LANGUAGE plpythonu;
=# SELECT mypy_import2gd();
NOTICE: mypy_import2gd: importing module numpy
CONTEXT: PL/Python function "mypy_import2gd"
mypy_import2gd
----------------
numpy
(1 row)
=# SELECT mypy_import2gd();
mypy_import2gd
----------------
numpy
(1 row)
The second SELECT call does not include the NOTICE message, indicating that the module was obtained from the GD.
References
This section lists references for using PL/Python.
Technical References
For information about PL/Python in HAWQ, see the PL/Python - Python Procedural Language PostgreSQL documentation.
For information about Python Package Index (PyPI), refer to PyPI - the Python Package Index.
The following Python modules may be of interest:
SciPy library provides user-friendly and efficient numerical routines including those for numerical integration and optimization. To download the SciPy package tar file:
hawq-node$ wget http://sourceforge.net/projects/scipy/files/scipy/0.10.1/scipy-0.10.1.tar.gzNatural Language Toolkit (
nltk) is a platform for building Python programs to work with human language data.The Python
distributepackage is required fornltk. Thedistributepackage should be installed before installingntlk. To download thedistributepackage tar file:hawq-node$ wget http://pypi.python.org/packages/source/d/distribute/distribute-0.6.21.tar.gzTo download the
nltkpackage tar file:hawq-node$ wget http://pypi.python.org/packages/source/n/nltk/nltk-2.0.2.tar.gz#md5=6e714ff74c3398e88be084748df4e657
Useful Reading
For information about the Python language, see http://www.python.org/.
A set of slides that were used in a talk about how the Pivotal Data Science team uses the PyData stack in the Pivotal MPP databases and on Pivotal Cloud Foundry http://www.slideshare.net/SrivatsanRamanujam/all-thingspythonpivotal.