当前位置:网站首页>Opengauss database source code analysis series articles - AI deepsql

Opengauss database source code analysis series articles - AI deepsql

2021-10-14 06:28:22 Gauss squirrel Club

️‍ Hello everyone , I am a Gauss Squirrel Club , Welcome to study ~️‍

In the last introduction 8.6 AI Query time forecast Related content of , This article introduces “8.7 DeepSQL、8.8 Summary ” Introduction to the relevant wonderful contents of .

8.7 DeepSQL

The functions mentioned above are AI4DB field ,AI There is another general direction to combine with the database , namely DB4AI. In this chapter , We will introduce openGauss Of DB4AI Ability , Explore efficient drive through database AI A new approach to the task .

Use scenarios

database DB4AI Function realization , That is to realize... In the database AI Algorithm , To better support the rapid analysis and calculation of big data . at present openGauss Of DB4AI Ability to pass DeepSQL Features to render . Here is a complete set based on SQL Machine learning of 、 Data mining and statistical algorithms , Users can use SQL Statement for machine learning .DeepSQL Can abstract out end-to-end 、 Data development process from data to model , Cooperate with the underlying computing engine and database automatic optimization , Let have the foundation SQL Knowledge users can complete most of the machine learning model training and prediction tasks . The whole analysis and processing run in the database engine , Users can directly analyze and process the data in the database , There is no need to transfer data between the database and other platforms , Avoid unnecessary data movement between multiple environments , It also integrates the fragmented data development technology stack .

Existing technology

Now , Academia and industry are in DB4AI Many achievements have been made in this direction . Many traditional business relational databases already support DB4AI Ability , Through the built-in AI The component adapts to the data processing and environment in the database , The data stored in the database can be processed , Minimize the cost of data movement . meanwhile , Many cloud databases 、 Cloud computing data analysis platforms also have DB4AI Ability . It may also have Python、R Language and other interfaces , Easy for data analysts to get started quickly .
stay DB4AI field , It also has excellent open source software , for example Apache Top open source projects MADlib. It is compatible with PostgreSQL database , Many are based on PostgreSQL The database developed from the database source code baseline can also be easily adapted .MADlib It can provide statistical and machine learning methods for structured and unstructured data , The parallel computing on distributed database is realized by using aggregation function .MADlib Support a variety of machine learning 、 Data mining algorithms , For example, return 、 classification 、 clustering 、 Statistics 、 Graph algorithm, etc , The cumulative supported algorithms reach 70 Multiple , In the current release 1.17 In the version MADlib Support deep learning .MADlib Use class SQL Syntax as an external interface , By creating a UDF(user-defined function, User defined functions ) The way to AI Integration of tasks into the database .
At present openGauss Of DB4AI modular , Compatible with open source MADlib, In primitive MADlib On the basis of open source software, it adapts and enhances each other , Performance comparison in PostgreSQL Running on the database MADlib Better performance . meanwhile ,openGauss be based on MADlib frame , Other industrial grade 、 Common algorithms , for example XGBoost、Prophet、GBDT And recommendation system . meanwhile ,openGauss It also has native AI Execution plan and execution operator , This part of the feature will be open source in subsequent versions . therefore , This chapter mainly introduces openGauss How compatible MADlib Of .

Key source code analysis

1. MADLib Project structure

MADlib The document structure and description of are shown in table 8-16 Shown ,MADlib The code can be obtained through its official website :https://madlib.apache.org/.

surface 8-16 MADlib The main file structure of

File structure

 

explain  

cmake

-

Cmake Related documents

 
 

/array_ops

Array array Operation module

 
 

/kmeans

Kmeans Related modules

 
 

/sketch

Word frequency statistics processing related modules

 
 

/stemmer

Stem processing related modules

 
 

/svec

Sparse matrix correlation module

 
 

/svec_util

Sparse matrix dependency module

 
 

/utils

Other common modules

 

src/bin

-

Tool module , For installation 、 uninstall 、 Deployment, etc.

 

src/bin/madpack

-

Database interaction module

 

src/dbal

-

Stem processing related modules

 

src/libstemmer

-

Tools depend on files

 

src/madpack

-

It contains common modules

 

src/modules

-

Association rule algorithm

 

/assoc_rules

Including the implementation of convex algorithm

 

/convex

Including conditional random field algorithm

 

/crf

Elastic network algorithm

 

/elastic_net

Generalized linear model

 

/glm

Yindirichlet distribution

 

/lda

Linear algebra operations

 

/linalg

Linear system module

 

/linear_systems

Probability module

 

/prob

Decision trees and random forests

 

/recursive_partitioning

Regression algorithm

 

/regress

Sampling module

 

/sample

Mathematical statistics module

 

/stats

The time series

 

/utilities

contain pg,gaussdb Platform related interfaces

 

src/ports

-

Interface , link db

 

src/ports/postgres

-

in the light of pg system , Related algorithms

 

/dbconnector

Association rule algorithm

 

/modules

Bayesian algorithm

 

/modules/bayes

conjugate gradient method

 

/modules/conjugate_gradient

Including multi-layer perceptron

 

/modules/convex

Conditional random field

 

/modules/crf

Elastic network

 

/modules/elastic_net

Prophet Timing prediction

 

/modules/gbdt

Gdbt Algorithm

 

/modules/glm

Generalized linear model

 

/modules/graph

Graph model

 

/modules/kmeans

Kmeans Algorithm

 

/modules/knn

Knn Algorithm

 

/modules/lda

Yindirichlet distribution

 

/modules/linalg

Linear algebra operations

 

/modules/linear_systems

Linear system module

 

/modules/pca

PCA Dimension reduction

 

/modules/prob

Probability module

 

/modules/recursive_partitioning

Decision trees and random forests

 

/modules/sample

Regression algorithm

 

/modules/stats

Sampling module

 

/modules/summary

Mathematical statistics module

 

/modules/svm

Summary function of descriptive statistics

 

/modules/tsa

Svm Algorithm

 

/modules/validation

The time series

 

/modules/xgboost_gs

Cross validation

 

src/utils

-

Xgboost Algorithm

 

2. MADlib stay openGauss Execution process on

By calling UDF The model can be trained and predicted , The relevant results will be saved in the table , Stored on the database . Take the training process as an example ,MADlib stay openGauss The overall flow of execution on is shown in Figure 8-22 Shown .
 Insert picture description here

chart 8-22 MADlib stay openGauss Flow chart of upper training model

be based on MADlib Extensions to the framework

The previous article shows MADlib Functions and functions of each module , structurally , Users can extend their own algorithms . As mentioned above XGBoost、GBDT and Prophet The three algorithms are our extended algorithms based on the original . This section will be based on GBDT Module as an example , The introduction is based on MADlib Extensions to the framework .

GBDT The file structure is shown in table 8-17 Shown .

surface 8-17 GBDT The main file structure of the algorithm

File structure

explain

gbdt/gbdt.py_in

python Code

gbdt/gbdt.sql_in

Stored procedure code

gbdt/test/gbdt.sql

Test code

stay sql_in In file , Define the upper layer SQL-like Interface , Use PL/pgSQL perhaps PL/python Realization .
stay SQL Defined in layer UDF function , The following code implements functions similar to overloading .

CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.gbdt_train(
    training_table_name         TEXT,
    output_table_name           TEXT,
    id_col_name                 TEXT,
    dependent_variable          TEXT,
    list_of_features            TEXT,
    list_of_features_to_exclude TEXT,
    weights                     TEXT
)
RETURNS VOID AS $$
    SELECT MADLIB_SCHEMA.gbdt_train($1, $2, $3, $4, $5, $6, $7, 30::INTEGER);
$$ LANGUAGE sql VOLATILE;

CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.gbdt_train(
    training_table_name         TEXT,
    output_table_name           TEXT,
    id_col_name                 TEXT,
    dependent_variable          TEXT,
    list_of_features            TEXT,
    list_of_features_to_exclude TEXT
)
RETURNS VOID AS $$
    SELECT MADLIB_SCHEMA.gbdt_train($1, $2, $3, $4, $5, $6, NULL::TEXT);
$$ LANGUAGE sql VOLATILE;

CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.gbdt_train(
    training_table_name         TEXT,
    output_table_name           TEXT,
    id_col_name                 TEXT,
    dependent_variable          TEXT,
    list_of_features            TEXT
)
RETURNS VOID AS $$
    SELECT MADLIB_SCHEMA.gbdt_train($1, $2, $3, $4, $5, NULL::TEXT);
$$ LANGUAGE sql VOLATILE;

among , Input table 、 Output table 、 Features and other necessary information need to be specified by the user . Other parameters provide default parameters , Such as weight weights, If the user does not specify custom parameters , The program will operate with default parameters .
stay SQL Layer definition PL/python Interface , The code is as follows :

CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.gbdt_train(
    training_table_name         TEXT,
    output_table_name           TEXT,
    id_col_name                 TEXT,
    dependent_variable          TEXT,
    list_of_features            TEXT,
    list_of_features_to_exclude TEXT,
    weights                     TEXT,
    num_trees                   INTEGER,
    num_random_features         INTEGER,
    max_tree_depth              INTEGER,
    min_split                   INTEGER,
    min_bucket                  INTEGER,
    num_bins                    INTEGER,
    null_handling_params        TEXT,
    is_classification           BOOLEAN,
    predict_dt_prob             TEXT,
    learning_rate               DOUBLE PRECISION,
    verbose                     BOOLEAN,
    sample_ratio                DOUBLE PRECISION
)
RETURNS VOID AS $$
PythonFunction(gbdt, gbdt, gbdt_fit)
$$ LANGUAGE plpythonu VOLATILE;

PL/pgSQL perhaps SQL The function will eventually call a PL/python function .
“PythonFunction(gbdt, gbdt, gbdt_fit)” Is a fixed usage , This is also an encapsulated m4 macro , When compiling and installing , Macro replacement .
PythonFunction in , The first parameter is the folder name , The second parameter is the file name , The third parameter is the function name .PythonFunction The macro will be replaced with “from gdbt.gdbt import gbdt_fit” sentence . So make sure the file path and function are correct .
stay python Layer , Implement the training function , The code is as follows :

def gbdt_fit(schema_madlib,training_table_name, output_table_name, id_col_name, dependent_variable, list_of_features, list_of_features_to_exclude, weights, num_trees, num_random_features, max_tree_depth, min_split, min_bucket, num_bins, null_handling_params, is_classification, predict_dt_prob = None, learning_rate = None, verbose=False, **kwargs):
     …
    plpy.execute("""ALTER TABLE {training_table_name} DROP COLUMN IF EXISTS gradient CASCADE """.format(training_table_name=training_table_name))
                            
    create_summary_table(output_table_name, null_proxy, bins['cat_features'], 
                        bins['con_features'], learning_rate, is_classification, predict_dt_prob, 
                        num_trees, training_table_name)

stay python The layer implements the prediction function , The code is as follows :

def gbdt_predict(schema_madlib, test_table_name, model_table_name, output_table_name, id_col_name, **kwargs):

    num_tree = plpy.execute("""SELECT COUNT(*) AS count FROM {model_table_name}""".format(**locals()))[0]['count']
    if num_tree == 0:
        plpy.error("The GBDT-method has no trees")
    

    elements = plpy.execute("""SELECT * FROM {model_table_name}_summary""".format(**locals()))[0]
…

stay py_in In file , Define the corresponding business code , use python Implement corresponding processing logic .
During the installation phase ,sql_in and py_in Will be GNU m4 Resolve to normal python and sql file . What needs to be pointed out here is , At present MADlib Framework only supports python2 edition , therefore , The above code implementation is also based on python2 Accomplished .

MADlib stay openGauss Use examples on

This paper takes the house price classification by support vector machine algorithm as an example , Demonstrate how to use .
(1) Data set preparation , The code is as follows :

DROP TABLE IF EXISTS houses;
CREATE TABLE houses (id INT, tax INT, bedroom INT, bath FLOAT, price INT,  size INT, lot INT);
INSERT INTO houses VALUES
(1 ,  590 ,       2 ,    1 ,  50000 ,  770 , 22100),
(2 , 1050 ,       3 ,    2 ,  85000 , 1410 , 12000),
(3 ,   20 ,       3 ,    1 ,  22500 , 1060 ,  3500),
 
(12 , 1620 ,       3 ,    2 , 118600 , 1250 , 20000),
(13 , 3100 ,       3 ,    2 , 140000 , 1760 , 38000),
(14 , 2070 ,       2 ,    3 , 148000 , 1550 , 14000),
(15 ,  650 ,       3 ,  1.5 ,  65000 , 1450 , 12000);

(2) model training
① Configure corresponding equipment before training schema And compatibility parameters , The code is as follows :

SET search_path="$user",public,madlib;
SET behavior_compat_options = 'bind_procedure_searchpath';

② Use the default parameters for training , The conditions for classification are ‘price < 100000’,SQL The statement is as follows :

DROP TABLE IF EXISTS houses_svm, houses_svm_summary; 
SELECT madlib.svm_classification('public.houses','public.houses_svm','price < 100000','ARRAY[1, tax, bath, size]');

(3) Look at the model , The code is as follows :

\x on
SELECT * FROM houses_svm;
\x off

give the result as follows :

-[ RECORD 1 ]------+-----------------------------------------------------------------
coef               | {.113989576847,-.00226133300602,-.0676303607996,.00179440841072}
loss               | .614496714256667
norm_of_gradient   | 108.171180769224
num_iterations     | 100
num_rows_processed | 15
num_rows_skipped   | 0
dep_var_mapping    | {f,t}

(4) To make predictions , The code is as follows :

DROP TABLE IF EXISTS houses_pred; 
SELECT madlib.svm_predict('public.houses_svm','public.houses','id','public.houses_pred');

(5) View forecast results , The code is as follows :

SELECT *, price < 100000 AS actual FROM houses JOIN houses_pred USING (id) ORDER BY id;

give the result as follows :

id | tax  | bedroom | bath | price  | size |  lot  | prediction | decision_function | actual
----+------+---------+------+--------+------+-------+------------+-------------------+--------
  1 |  590 |       2 |    1 |  50000 |  770 | 22100 | t          |      .09386721875 | t
  2 | 1050 |       3 |    2 |  85000 | 1410 | 12000 | t          |     .134445058042 | t
 
 14 | 2070 |       2 |    3 | 148000 | 1550 | 14000 | f          |  -1.9885277913972 | f
 15 |  650 |       3 |  1.5 |  65000 | 1450 | 12000 | t          |   1.1445697772786 | t
(15 rows

Check the misclassification rate , The code is as follows :

SELECT COUNT(*) FROM houses_pred JOIN houses USING (id) WHERE houses_pred.prediction != (houses.price < 100000);

give the result as follows :

count
-------
     3
(1 row)

(6) Use svm Other nuclear training , The code is as follows :

DROP TABLE IF EXISTS houses_svm_gaussian, houses_svm_gaussian_summary, houses_svm_gaussian_random; 
SELECT madlib.svm_classification( 'public.houses','public.houses_svm_gaussian','price < 100000','ARRAY[1, tax, bath, size]','gaussian','n_components=10', '', 'init_stepsize=1, max_iter=200' );

To make predictions , And see the results .

DROP TABLE IF EXISTS houses_pred_gaussian; 
SELECT madlib.svm_predict('public.houses_svm_gaussian','public.houses','id', 'public.houses_pred_gaussian');
SELECT COUNT(*) FROM houses_pred_gaussian JOIN houses USING (id) WHERE houses_pred_gaussian.prediction != (houses.price < 100000);

give the result as follows :

count 
-------+    
0 
(1 row)

(7) The other parameters
In addition to specifying different kernel methods , You can also specify the number of iterations 、 Initial parameter , such as init_stepsize,max_iter,class_weight etc. .

Evolution route

openGauss At present, it is compatible with open source Apache MADlib Machine learning library to have machine learning ability . Through the original MADlib Frame adaptation ,openGauss A variety of user-defined engineering algorithm extensions are implemented .
In addition to being compatible with industry benchmarks PostgreSQL Systematic Apache MADlib To get out of its business ecosystem ,openGauss Also studying the original DB4AI engine , And support end-to-end whole process AI Ability , This includes model management 、 Superparameter optimization 、 Native SQL-like grammar 、 Database native AI Operators and execution plans, etc , Performance comparison MADlib have 5 More than times the increase . This function will be gradually open source in the future .

8.8 Summary

In this chapter , It introduces openGauss Team in AI Exploration in combination with database , And focuses on AI4DB Parameter self-tuning in 、 Index recommendation 、 Anomaly detection 、 Query time forecast 、 slow SQL Discovery and other characteristics , as well as openGauss Of DB4AI function . In any way ,AI The combination with database is much more than that , These features described here are just the beginning , stay openGauss Of AI There are still a lot of things to do functionally 、 There is still a long way to go . Include AI Further integration with the optimizer ; Create the whole process AI Autonomy , Realize fault discovery and automatic repair in the whole scene ; utilize AI Transforming the algorithm and logic in the database is the direction of evolution .
although AI The combination with database has made long-term progress , But there are still the following challenges .
(1) The problem of calculating power : additional AI How to solve the cost of computing power ? Will it lead to performance degradation .
(2) Algorithmic problems : Use AI Whether the combination of algorithm and database will bring significant benefits ? Whether the extra cost of the algorithm is great ? Whether the algorithm can be generalized , Applicable to pervasive scenarios ? What kind of algorithm can better solve practical problems ?
(3) Data problems : How to safely extract and store AI Data required for model training , How to face the problems of data hot and cold classification and loading startup ?
The above problem is largely a trade-off problem , We should make full use of AI Creative inspiration , We should also fully inherit and carry forward the existing theory and practice of database , This is also openGauss The direction the team continues to explore .

Thank you for learning 8 Chapter AI In technology “8.7 DeepSQL、8.8 Summary ” Highlights of , Next, we'll start “ The first 9 Chapter Security management source code analysis ” Introduction to the relevant contents of .
Coming soon .

版权声明
本文为[Gauss squirrel Club]所创,转载请带上原文链接,感谢
https://chowdera.com/2021/10/20211002145418024v.html

随机推荐