Design of sliding mode controller for servo feed system based on generalized extended state observer with reinforcement learning | Scientific Reports

Scientific Reports volume 14, Article number: 24976 (2024) Cite this article

414 Accesses

Metrics details

Nonlinear friction, system uncertainty, and external disturbances have a significant impact on the performance of high-precision servo feed systems. In order to achieve higher tracking accuracy, a sliding mode controller based on a generalized extended state observer with the double critic deep deterministic policy gradient algorithm is designed. Firstly, a flexible two mass drive model (FTMDM) is established for the two-axis differential micro-feed system (TDMS). Next, a generalized extended state observer (GESO) is designed to estimate matched interference and mismatched interference. And it is proved that the observation error of GESO is bounded. A sliding mode controller based on GESO is further proposed. The stability of the controller has been proven through Lyapunov theory, and the error is bounded and converges to zero in finite time. The tuning process of controller parameters is simplified by using quadratic optimal control principle. Furthermore, a double critic deep deterministic policy gradient algorithm (DCDDPG) is proposed to achieve dynamic optimization of parameters about GESO. The simulation results show that GESO with DCDDPG can reduce the observation error of step signal and sinusoidal signal, and improve the observation accuracy of nonlinear friction significantly. Finally, experimental results show that the proposed control method achieves more accurate position tracking performance on TDMS.

Ultra-precision machining technology is an important technical guarantee in many fields such as aerospace, intelligent manufacturing, and gene editing. As a key technology and core competitiveness of ultra-precision machining technology, high-precision servo feed system has been widely studied and applied. The current commonly used traditional feed systems are those driven directly by linear motors or by servo motors to drive ball screws1,2,3. However, due to the inherent characteristics of the structure, the workbench is severely affected by nonlinear friction during low-speed feed motion and motion reversal, making it extremely difficult to achieve ultra-high positioning accuracy. Although the macro dimensional motion platform based on intelligent materials such as piezoelectric ceramics4,5 can achieve nanoscale positioning, it is constrained by factors such as small stroke and insufficient driving force, making it difficult to be competent in the field of ultra-precision machining.

In order to achieve high-precision and large stroke micro feed motion, the novel two-axis differential micro-feed system (TDMS) is proposed6,7. The system synthesizes the micro feed motion of the workbench by driving the screw and nut rotation separately by two servo motors. Their rotation directions are the same, their sizes are close to equal, and they are higher than the critical speed for generating crawling. This reduces the interference of nonlinear friction generated by the servo motor and ball screw pair in low-speed motion. However, the system is still affected by linear friction and nonlinear friction at the workbench guide rail. Therefore, in order to further improve the position tracking accuracy and low-speed performance of TDMS, it is necessary to study high-performance motion control algorithms.

A large number of control algorithms have been proposed, such as classical PID control, proportional integral control based on feedforward compensation, adaptive control, sliding mode control, and active disturbance rejection control, all of which can achieve friction compensation and disturbance suppression. In order to suppress the nonlinear friction caused by the reversal of platform motion direction, Liu8 proposed a multi square pulse compensation method. The accurate LuGre friction model was used to calculate the waveform and analyze the waveform signals of different speeds and accelerations. The error compensation performance was improved by 95%. But this method is limited by the accuracy of friction parameter identification. Li9 achieved feedforward compensation for tracking errors by establishing a dynamic model, which basically eliminated dynamic mechanical tracking errors. However, it still requires a complex parameter identification process to obtain the stiffness and friction parameters in the model. Farrage10 constructed a new friction model that included unknown friction sources in low-speed and high-speed regions. And a friction compensator that can accurately describe actual friction behavior was designed. The combination of friction feedforward compensation and sliding mode control improved accuracy and significantly reduced energy consumption. In order to reduce the impact of inaccurate friction identification on friction compensation, some adaptive friction compensation methods have also been studied. Liang11 established a dual observer based on the LuGre friction model to handle complex friction parameter estimation problems, and combined it with adaptive sliding mode control to improve tracking accuracy. Zeng12 proposed a fixed-time convergence controller to solve the load tracking problem using the concept of multifaceted sliding mode. The influence of dead zone nonlinearity is effectively reduced by high gain nonlinearity compensator. The method is effective, reliable and practical, but it is still necessary to establish a continuously differentiable friction model and identify its parameters. An adaptive feedforward friction compensation method was proposed by Wan et al.13, which established a dynamic friction model to characterize asymmetric and nonlinear static friction phenomena. And an adaptive controller to achieve friction compensation without measuring speed and acceleration was designed, achieving lower tracking errors.

However, the above methods still require the establishment of a friction model and have a certain dependence on the identification results of friction parameters. Changes in the environment can cause changes in friction parameters, which will seriously weaken the friction compensation effect and lead to a decrease in the tracking accuracy of the system. Therefore, some scholars regard friction as a disturbance and suppress it to reduce the adverse effects of friction, system uncertainty, parameter changes, and other disturbances on motion, in order to achieve high-precision motion control of the feed system. Sun et al.14 proposed a second-order sliding mode disturbance observer, and designed a position controller with adjustable phase to suppress periodic disturbances. This effectively reduced position tracking errors. Cheng15 proposed a compound nonlinear control method for high performance servomechanism. The unmeasured velocity and interference were estimated by a reduced order extended state observer, and disturbance compensation was realized by reference feedforward. Bao16 proposed a multi-input and multi-output nonlinear disturbance observer, and designed an adaptive sliding mode controller based on the flexible ball screw model, which effectively improved the tracking performance. Gao17 regarded uncertain friction as a disturbance and established a friction state observer by introducing a friction auxiliary model, effectively solving the problem of low-speed crawling. Lu18 fully combined load observer, sliding mode observer, and robust speed controller for low-speed and high torque permanent magnet synchronous motors, improving the instantaneous response performance of the system and successfully suppressing the effects of disturbance mutations, demonstrating good robustness. Zeng19 formulated guarantee cost control in the integrated design scheme in order to suppress the unknown interference and parameter uncertainty of the system. This method significantly improves the robustness of the system. Hu20 proposed a pre-defined time convergence synchronization controller, which can realize the presetting of synchronization errors by adjusting one parameter, significantly improving the tracking accuracy and having good transient performance. Li21 designed an extended state observer for nonintegral-chain systems, which achieved timed convergence of estimation errors for disturbances such as load changes, torque disturbances, and nonlinear friction. And this method had good superiority. These advanced servo motion control algorithms have significant effects in friction compensation, disturbance suppression, etc., effectively improving the tracking accuracy, robustness, and anti-interference ability of the system. But they all face the problem of tedious and complex parameter tuning. Engineering tuning method is often used in the tuning of control parameters, which is very complex and time-consuming. It is difficult to achieve optimal results.

In order to solve the above problems, relevant algorithms of artificial intelligence are applied to the process of parameter tuning. Li22 used an improved genetic algorithm to tune the parameters of a nonlinear sliding mode controller that combined a nonlinear approaching law and a state observer, achieving optimal parameter control. The system has no overshoot and strong robustness. He23 introduced reinforcement learning into the parameter optimization process of servo control and established an upper level controller based on the soft actor-critic algorithm. And continuous parameter tuning of the lower level PID controller was carried out based on tracking error and system status. The high precision hydraulic servo control under uncertain working conditions was realized. Shuprajhaa24 improved the proximal policy optimization algorithm by introducing criteria such as action repetition and early stop, and developed a data-driven parameter optimization algorithm to determine the optimal gain of the PID controller. Ding25 also proposed a multi-phase focused PID adaptive tuning method for PID control algorithms, which concentrated the PID parameters output by the intelligent agent in a stable region and could maintain controller stability even with limited prior knowledge. Although these algorithms can control various types of systems such as linear and nonlinear systems without the need for system modeling and have strong universality, they lack certain resistance to external disturbances in the system. Zhao26 designed a model free optimal control scheme based on reinforcement learning. In the case of unknown system parameters, non-policy reinforcement learning algorithms were used to approximate the solution of the regulator, achieving accurate speed tracking performance and good transient response. Oh27 addressed the difficulty of parameter tuning for multiple notch filters simultaneously, and designed a deep deterministic policy gradient algorithm with vector stability margin as the reward function to achieve optimal parameter tuning for multiple notch filters. Some scholars have also directly applied reinforcement learning algorithms to the design of servo motion controllers. Song28 described the speed control problem as a Markov decision process in response to various disturbances such as load torque, moment of inertia, and friction in permanent magnet synchronous motor servo systems. Based on the DQN algorithm, a deep reinforcement learning controller was designed to improve the robustness to disturbances. But the action output of this method is discrete and could not minimize the speed error to the greatest extent. Wang29 designed a reinforcement learning feedback controller that combined a feedforward controller and a disturbance observer based on neural networks, achieving automatic adjustment of control gain. However, this method required long-term training and is difficult to ensure the stability of the control system for different inputs, especially for input states that have not been learned.

This article is based on the transmission mechanism and advantages of the two-axis differential micro-feed system, and further studies high-performance servo control algorithms to achieve high-precision position tracking performance. The main contributions of this article are summarized as follows:

A generalized extended state observer is designed based on the established flexible two mass drive model of TDMS, which can observe matched interference and mismatched interference in the model. And it has been proven that the observer estimation error is bounded.

A sliding mode controller based on GESO is proposed. The system friction is regarded as a disturbance, and the friction is accurately observed and compensated by the GESO. Based on Lyapunov theory, it is proved that the control system is asymptotically stable and the error can converge in a finite time.

Improved the deep deterministic policy gradient algorithm, using the double critic deep deterministic policy gradient algorithm (DCDDPG) to achieve dynamic self-tuning of the parameters about GESO, avoiding the complex parameter tuning process of GESO. Based on the principle of quadratic optimal control, the adjustment process of controller parameters is simplified.

The organizational structure of the remaining parts of this article is further summarized as follows. In the second section, the structure and working principle of TDMS are introduced, and the flexible two mass drive model of TDMS is established. The design steps of the proposed sliding mode controller based on GESO is introduced in the third section. The fourth section introduces the intelligent agent construction process of DCDDPG, designs a reward function, and provides the algorithm flow for DCDDPG to optimize GESO parameters (DCDDPG_GESO). In the fifth section, simulation and experiment are carried out.

The mechanical structure of the two-axis differential micro-feed system (TDMS) is shown in Fig. 1. Based on the traditional ball screw, an angular contact ball bearing is installed between the workbench and the nut. The traditional screw rotation driven feed system is changed to TDMS that can be driven by both the screw and the nut. The screw drive shaft is directly driven by a servo motor through a coupling to rotate the screw. The nut drive shaft drives the synchronous belt with a transmission ratio of 1 through the servo motor, thereby driving the nut to rotate. TDMS can work in the lead screw and nut rotate at the same time and in the same direction. At this time, the linear displacement speed of the table is determined by the speed difference between the two drive shafts. The two servo motors respectively drive the screw drive shaft and the nut drive shaft to operate at a higher speed, and the synthesized speed is equal to the target speed of the table. This method can effectively avoid the low-speed creep area of the two drive shafts, reduce the critical speed of the system to produce crawling phenomenon, and obtain more excellent low-speed performance30,31.

Mechanical structure diagram of TDMS.

On the one hand, the test bench can avoid the crawling phenomenon of the motor, on the other hand, it can reduce the precision error caused by the use of the reducer and other transmission mechanisms. It has potential application prospects in precision injection medical equipment, processing and manufacturing, assembly measurement, printing, packaging and other occasions requiring low speed feed.

In the design of servo system controllers, the system is usually equivalent to a rigid body model, which is actually an ideal situation. In this article, based on the two mass drive model32, a flexible two mass drive model (FTMDM) of TDMS is established. The schematic diagram of the two mass drive model is shown in Fig. 2a. The moment of inertia on the input side is \({J_i}\), the moment of inertia on the output side is \({J_o}\), and an input torque is \({T_i}\). Under the influence of the equivalent elastic coefficient and damping coefficient of the system, the final output torque of the system is \({T_o}\). The above ideas are applied to the traditional ball screw servo system. The traditional ball screw servo system refers to the feed system in which the screw is rotated by the motor and the nut follows the table for linear displacement. Figure 2b is the schematic diagram of FTMDM of the traditional ball screw servo system. And the system dynamics equation can be expressed as

where u is the torque control quantity of the servo motor; \({m_1}\) is the moment of inertia of the equivalent rolling element; \({m_2}\) is the moment of inertia of the equivalent moving element; \({f_1}\) is equivalent disturbance at the input end; \({f_2}\) is equivalent disturbance at the output end; \({b_1}\) is the viscous damping of equivalent rolling element; \({b_2}\) is the viscous damping of equivalent moving element; \({x_1}\) is the rotation displacement of the servo system; \({x_2}\) is the linear displacement of the servo system.

Two mass drive model: (a) Schematic diagram of two mass drive model, (b) FTMDM of traditional ball screw servo system.

On this basis, FTMDM of TDMS is established, as shown in Fig. 3. The damping of the sliding guide rail at the workbench is equivalent to the screw drive shaft, and the mathematical expression of the TDMS system model is established as follows.

where \({b_{s1}}\) is the lead screw motor viscous damping; \({C_s}\) is equivalent damping of the lead screw drive shaft, including material damping, connection part damping and structure damping; \({b_{s2}}\) is the viscous damping of the screw nut pair and the sliding guide rail; \({x_{s1}}\) is the rotation displacement of the lead screw motor; \({x_{s2}}\) is the actual linear displacement of the lead screw drive shaft; \({m_{s1}}\) is the equivalent mass of the rotating element of the lead screw drive shaft; \({m_{s2}}\) is the equivalent mass of the linear displacement element of the lead screw drive shaft; \({u_s}\) is torque control value of the lead screw drive shaft motor; \({f_{s1}}\) and \({f_{s2}}\) are interferences caused by rotating and linear moving components of the screw drive shaft, including nonlinear friction, unmodeled dynamics, and parameter uncertainty; \({b_{n1}}\) is viscous damping for nut motors; \({C_n}\) is equivalent damping for nut drive shaft; \({x_{n1}}\) is rotation displacement of the nut motor; \({x_{n2}}\) is the actual linear displacement of the nut drive shaft; \({m_{n1}}\) is equivalent mass of rotating components for nut drive shaft; \({m_{n2}}\) is equivalent mass of linear displacement element for nut drive shaft; \({u_n}\) is torque control value of the lead nut drive shaft motor; \({f_{n1}}\) and \({f_{n2}}\) are interferences caused by rotating and linear moving components of the screw drive shaft; \({k_s}\) is equivalent stiffness of the screw drive shaft; knis equivalent stiffness of the nut drive shaft; x is the actual displacement of the workbench.

Flexible two mass drive model of TDMS.

Due to the high similarity in the mathematical model and control process between the screw drive shaft and the nut drive shaft, subsequent research will take the screw drive shaft as an example. Define the state variables of the screw drive shaft system\({{\varvec{x}}_s}={[{x_{s1}},{x_{s2}},{\dot {x}_{s1}},{\dot {x}_{s2}}]^T}\),\({{\varvec{y}}_s}={[{y_{s1}},{y_{s2}}]^T}\). The state space expression of the screw drive shaft can be obtained from Eqs. (2) and (3)

where \({\varvec{A}}=\left[ {\begin{array}{*{20}{c}} 0&0&1&0 \\ 0&0&0&1 \\ { - \frac{{{k_s}}}{{{m_{s1}}}}}&{\frac{{{k_s}}}{{{m_{s1}}}}}&{ - \frac{{{b_{s1}}+{C_s}}}{{{m_{s1}}}}}&{\frac{{{C_s}}}{{{m_{s1}}}}} \\ {\frac{{{k_s}}}{{{m_{s2}}}}}&{ - \frac{{{k_s}}}{{{m_{s2}}}}}&{\frac{{{C_s}}}{{{m_{s2}}}}}&{ - \frac{{{b_{s2}}+{C_s}}}{{{m_{s2}}}}} \end{array}} \right]\), \({{\varvec{B}}_1}={\left[ {\begin{array}{*{20}{c}} 0&0&{\frac{1}{{{m_{s1}}}}}&0 \end{array}} \right]^T}\), \({{\varvec{B}}_2}={\left[ {\begin{array}{*{20}{c}} 0&0&{\frac{1}{{{m_{s1}}}}}&0 \\ 0&0&0&{\frac{1}{{{m_{s2}}}}} \end{array}} \right]^T}\), \({\varvec{C}}=\left[ {\begin{array}{*{20}{c}} 1&0&0&0 \\ 0&1&0&0 \end{array}} \right]\), \({{\varvec{f}}_s}={\left[ {\begin{array}{*{20}{c}} {{f_{s1}}}&{{f_{s2}}} \end{array}} \right]^T}\).

Similarly, the state space expression of the nut drive shaft can be obtained from Eqs. (4) and (5).

On the basis of the above mathematical model, the motion planning, SMC design, observer design and parameter optimization process are further completed. The block diagram of the overall control scheme is shown in Fig. 4.

Control scheme block diagram of TDMS.

In order to improve the tracking accuracy of the feed system, it is necessary to compensate for the friction of the system. A common method is to treat friction as a disturbance and establish an observer to observe and compensate for it. For conventional n order systems with integral chain form, an effective estimation of unknown disturbances can be achieved through a standard form state observer. This traditional state observer typically establishes an n + 1 order state observer by expanding the one order state variable to achieve observation and estimation of a disturbance.

However, in the flexible two mass drive model of TDMS established in Fig. 4, there are two types of disturbances in both the screw drive shaft and the nut drive shaft. Taking the screw drive shaft as an example, the disturbance exists in Eqs. (2) and (3) respectively. And the disturbance in the same channel as the system input signal is called matched interference. That is, the disturbance is in the control channel. Disturbance that is not in the same channel as the system input signal is called mismatched interference. In other words, the disturbance is not within the control channel. In practice, the matched interference includes nonlinear friction and parameter uncertainty of the motor. The mismatched interference includes the friction of the lead screw, guide rail and transmission belt, and the external disturbance during operation. Traditional extended state observers cannot estimate both matched and mismatched interferences simultaneously. Therefore, it is necessary to expand the two order state variables and establish an n + 2 order generalized extended state observer to effectively estimate matched and mismatched interferences in two channels.

For the system of standard integral chain form \(\left\{ {\begin{array}{*{20}{l}} {\dot {{\varvec{x}}}={\varvec{A}}{\varvec{x}}+bu+h} \\ {y={\varvec{C}}{\varvec{x}}} \end{array}} \right.\), h is the total interference. In order to realize the observation of the interference, h will be used as an expansion state variable. And the form of ESO can be designed as \(\left\{ {\begin{array}{*{20}{l}} {{{\dot {\hat {{\varvec{x}}}}}_e}={\varvec{A}}{{\hat {{\varvec{x}}}}_e}+bu+{{\varvec{L}}_{(n+1) \times 1}}(y - \hat {y})} \\ {\hat {y}={\varvec{C}}{{\hat {{\varvec{x}}}}_e}} \end{array}} \right.\). \({{\varvec{x}}_e}\) is the expanded system state variable. Through this type of ESO, interference can be effectively managed. Similarly, for FTMDM which is not in the form of standard integral chain, GESO is improved on the basis of the mathematical expression of standard extended state observer. For the system described in Eq. 8, it can be expressed as \(\left\{ {\begin{array}{*{20}{l}} {\dot {{\varvec{x}}}={\varvec{A}}{\varvec{x}}+{\varvec{b}}u+{\varvec{D}}{\varvec{h}}} \\ {{{\varvec{y}}_{2 \times 1}}={\varvec{C}}{\varvec{x}}} \end{array}} \right.\) . Referring to the mathematical form of the standard ESO, the generalized extended state observer can be expressed as \(\left\{ {\begin{array}{*{20}{l}} {{{\dot {\hat {{\varvec{x}}}}}_e}={\varvec{A}}{{\hat {{\varvec{x}}}}_e}+{\varvec{b}}u+{{\varvec{L}}_{(n+2) \times 2}}({\varvec{y}} - \hat {{\varvec{y}}})} \\ {{{\hat {{\varvec{y}}}}_{2 \times 1}}={\varvec{C}}{{\hat {{\varvec{x}}}}_e}} \end{array}} \right.\). Similar to ESO, the GESO designed based on the accurate system model can effectively manage the expanded system state variables through the designed expressions.

Based on the above analysis, expand the matched interference \({f_{s1}}\) and mismatched interference \({f_{s2}}\) in Eqs. (2) and (3) to the state variables of the system. The expanded state variables of the system can be represented as \({{\varvec{x}}_{se}}={[{x_{s1}},{x_{s2}},{\dot {x}_{s1}},{\dot {x}_{s2}},{f_{s1}},{f_{s2}}]^T}\) . Further rewrite Eq. (7) as

where \({{\varvec{A}}_{\varvec{e}}}={\left[ {\begin{array}{*{20}{c}} {\varvec{A}}&{{{\varvec{B}}_2}} \\ 0&0 \end{array}} \right]_{6 \times 6}}\), \({{\varvec{B}}_{1{\varvec{e}}}}={\left[ {\begin{array}{*{20}{c}} {{{\varvec{B}}_1}} \\ 0 \end{array}} \right]_{6 \times 1}}\), \({\varvec{G}}={\left[ {\begin{array}{*{20}{c}} 0&0&0&0&1&0 \\ 0&0&0&0&0&1 \end{array}} \right]^T}\), \({{\varvec{C}}_e}=\left[ {\begin{array}{*{20}{c}} 1&0&0&0&0&0 \\ 0&1&0&0&0&0 \end{array}} \right]\).

Design the generalized extended state observer (GESO) for TDMS described in Eq. (8) that can observe both matched and mismatched interferences, with the following form

where \({\varvec{z}}={[{z_1},{z_2},{z_3},{z_4},{z_5},{z_6}]^T}\), it is the observed value of the extended state variable \({{\varvec{x}}_{se}}\). And two perturbations can be estimated through observation \({z_5}\) and \({z_6}\). Equation (9) can be expressed in the form of a state space equation as

where, \({\varvec{L}}{\text{=}}{\left[ {\begin{array}{*{20}{c}} {{L_{11}}}&{{L_{21}}}&{{L_{31}}}&{{L_{41}}}&{{L_{51}}}&{{L_{61}}} \\ {{L_{12}}}&{{L_{22}}}&{{L_{32}}}&{{L_{42}}}&{{L_{52}}}&{{L_{62}}} \end{array}} \right]^T}\), it is the gain coefficient matrix of the GESO. \(\Delta {\varvec{y}}{\text{=}}{\left[ {\begin{array}{*{20}{c}} {{y_{s1}} - {z_1}}&{{y_{s2}} - {z_2}} \end{array}} \right]^T}\). What’s more, \({\varvec{L}}\) needs to be determined by the designer through later simulation and experiments.

Designing a reasonable gain coefficient matrix can accurately and effectively estimate the system’s state variables and extended state variables. And through Lemma 1, it is proved mathematically that the observation errors of all the extended state variables of GESO are convergent and bounded. Therefore, GESO can effectively observe two types of disturbances. The effectiveness of GESO in managing matched and mismatched interference also depends on the parameters of the observer, which are usually continuously optimized by simulation and experimental verification. On this basis, reinforcement learning is introduced to realize dynamic optimization process of GESO, which can update parameters in real time, accurately estimate and compensate system interference, and realize effective management of matched interference and mismatched interference.

In order to prove that GESO can effectively observe system state variables such as disturbances and track the trend of disturbance changes, it is necessary to further prove that the observation error of the designed observer converges bounded. Define the estimation errors of the observer, such as position error, velocity error, disturbance error, etc., and represent them in vector form as

where \({{\varvec{e}}_o}={\left[ {\begin{array}{*{20}{c}} {{{\varvec{e}}_x}}&{{{\varvec{e}}_f}} \end{array}} \right]^T}\), \({{\varvec{e}}_f}={\left[ {\begin{array}{*{20}{c}} {{e_{f1}}}&{{e_{f2}}} \end{array}} \right]^T}\).

Taking the derivative of Eq. (11) can obtain

By substituting Eqs. (8) and (10) into Eq. (12), the expression for the derivative of the error vector can be obtained as

Lemma 133: By selecting an appropriate gain coefficient matrix \({\varvec{L}}\) such that \({{\varvec{A}}_e} - {\varvec{L}}{{\varvec{C}}_e}\) is a Hurwitz matrix, GESO gradually converges and the observation error \({{\varvec{e}}_o}\) of GESO is bounded for any bounded \({\dot {{\varvec{f}}}_s}\).

Proof: Let \({{\varvec{A}}_o}={{\varvec{A}}_e} - {\varvec{L}}{{\varvec{C}}_e}\), \({{\varvec{A}}_o}\) is Hurwitz matrix, \({\varvec{d}}={\varvec{G}}{\dot {{\varvec{f}}}_s}\), from Eq. (13), it can be concluded that

The expression for further defining the Lyapunov function is as follows

Because \({\varvec{A}}_{o}^{{}}\) is a Hurwitz matrix, there exists a positive definite real symmetric matrix \({\varvec{J}}\) that satisfies the following equation

where \({\varvec{J}}\) is a special solution of Eq. (16), and \({\varvec{Q}}\) is a positive definite real symmetric matrix.

Take the derivative of Eq. (15) and bring Eq. (14) into it to get

By substituting Eq. (16) into Eq. (17), it can be inferred that

To obtain \({\dot {V}_1} \leqslant 0\), it is necessary to further construct Eq. (18) in the following form

For Eq. (19), if \({\left\| {{\varvec{e}}_{o}^{T}{{\varvec{Q}}^{\frac{1}{2}}} - {{\varvec{d}}^T}{\varvec{J}}{{\varvec{Q}}^{ - \frac{1}{2}}}} \right\|_2}>{\left\| {{{\varvec{d}}^T}{\varvec{J}}{{\varvec{Q}}^{ - \frac{1}{2}}}} \right\|_2}\) is satisfied, that is, \({\left\| {{\varvec{e}}_{o}^{T}{{\varvec{Q}}^{\frac{1}{2}}}} \right\|_2}>2{\left\| {{{\varvec{d}}^T}{\varvec{J}}{{\varvec{Q}}^{ - \frac{1}{2}}}} \right\|_2}\), then \({\dot {V}_1}<0\). Taking \({\varvec{Q}}\) as the identity matrix, it can be obtained that if \({\left\| {{{\varvec{e}}_o}} \right\|_2}>2{\left\| {{\varvec{J}}{\varvec{d}}} \right\|_2}\),then \({\dot {V}_1}<0\). This means that for all \({{\varvec{e}}_o}\) that satisfies the above equation, \({\left\| {{{\varvec{e}}_o}} \right\|_2}\) decreases. For \({{\varvec{e}}_o}\) that does not satisfy the above equation, \({\left\| {{{\varvec{e}}_o}} \right\|_2}<2{\left\| {{\varvec{J}}{\varvec{d}}} \right\|_2}\), \({\dot {V}_1}>0\), then \({\left\| {{{\varvec{e}}_o}} \right\|_2}\) increases. Therefore, it can be concluded that\({{\varvec{e}}_o}\)is bounded.

In the servo feed system of permanent magnet synchronous motor, the sampling frequency of position ring and speed ring is much higher than the change frequency of disturbance. This means that the perturbation changes very slowly in a single control period, and the perturbation can be regarded as a constant, then \({\dot {{\varvec{f}}}_s}=0\). In the current cycle, \(d=0\). So, \({\left\| {{{\varvec{e}}_o}} \right\|_2}>2{\left\| {{\varvec{J}}{\varvec{d}}} \right\|_2}\),and \({\dot {V}_1}<0\), GESO is asymptotically stable.

Firstly, define the tracking error of the system as

where \({{\varvec{x}}_r}={[{x_{r1}},{x_{r2}},{\dot {x}_{r1}},{\dot {x}_{r2}}]^T}\), corresponding to the expected values of various state variables in the system.

Furthermore, the structure of the integrated sliding mode surface is designed as follows

where proportional coefficient \({\varvec{K}}=[{k_{_{1}}},{k_2},{k_3},{k_4}]\), integral coefficient \({\varvec{P}}=[{p_{_{1}}},{p_2},{p_3},{p_4}]\). And each element in \({\varvec{K}}\) and \({\varvec{P}}\) is a positive real number. The proportional coefficient mainly affects the speed of the system’s dynamic response, while the integral coefficient mainly affects the steady-state error of the system. Reasonable and appropriate selection of proportional and integral coefficients can enable the system to respond quickly and achieve steady-state errors that meet the expected results.

Taking the derivative of Eq. (21) can obtain

By substituting Eq. (20) into Eq. (22), it can obtain

Choosing the law of convergence is also an important part of design. Choosing an appropriate approach law and adjusting the parameters of the approach law can improve the quality of the stage reaching the sliding mode surface and reduce system chattering. The mathematical expression for selecting the constant velocity approach law is as follows

Chattering has a significant effect on the performance of the controller, and the specific degree of impact depends on the application scenario and system characteristics of the sliding mode controller. Chattering will cause the output signal of the controller to fluctuate rapidly, and the desired state cannot be accurately tracked and maintained, which will affect the long-term stability and reliability of the system. When the chattering frequency is close to the natural frequency of the system, it may also cause mechanical vibration, affecting the safety and service life of the equipment. Therefore, it is necessary to use the saturation function to approximate the sign function. The saturation function is used to replace the sign function in the original formula. Its expression is

Its form is hard saturation function, which is a special form of saturation function. Where \(\Delta\) is a positive real number. An increase in \(\Delta\) will lead to a decrease in the speed of the system entering steady state and an increase in steady-state error; A decrease in \(\Delta\) will lead to significant changes in the control quantity. And when \(\Delta \to 0\), it will approximate a sign function, causing significant oscillations in the control signal.

By combining equations (23) to (25), the control quantity of the integral sliding mode controller for the servo motor of the screw drive shaft in TDMS can be designed as

where \(\eta\) represents the speed approaching the switching surface. The speed decreases as \(\eta\) decreases, but increasing \(\eta\) can also cause system oscillations

The system state equations and controller design forms of the screw drive shaft and nut drive shaft are similar, with only slight differences in modeling assumptions. Therefore, by proving the stability of one of the shaft controllers, it can be extended to the overall system. Taking the screw drive shaft as an example, the stability analysis and proof of the designed controller are as follows.

Theorem 1: By selecting appropriate parameters \(\eta\) and \(\Delta\), the system can be on the sliding surface for a finite time and it’s asymptotically stable, with the sliding surface ultimately stabilizing at zero.

Proof: By substituting Eq. (26) into Eq. (23), it can be obtained that

Further define the expression of the Lyapunov function as

Taking the derivative of Eq. (28) and substituting Eq. (27) into the derived expression, it can be obtained that

To prove the asymptotic stability of the system, it is necessary to prove \(\dot {V} \leqslant 0\). Based on the expression of the saturation function \(sat(\sigma )\), the discussion can be divided into the following three situations.

(1) When \(\sigma >\Delta\), substituting \(sat(\sigma ){\text{=}}1\) into Eq. (29) can be simplified as

It is known that only by ensuring \({\varvec{K}}{{\varvec{B}}_1}\eta \sigma \geqslant {\varvec{K}}{{\varvec{B}}_2}{{\varvec{e}}_f}\sigma\), we can obtain \(\dot {V} \leqslant 0\). Therefore, the following inequality should be satisfied

According to Lemma 1 and related proofs, it is known that the perturbation error \({{\varvec{e}}_f}\) is bounded. Because \({\varvec{K}}\) and \({{\varvec{B}}_2}\) are constant coefficient matrices, \({\varvec{K}}{{\varvec{B}}_2}{{\varvec{e}}_f}\) is bounded. Therefore, there exists a constant \(\varepsilon\) that satisfies \(\varepsilon {\text{=}}su{p_{t>0}}(\left| {{\varvec{K}}{{\varvec{B}}_2}{{\varvec{e}}_f}} \right|)\). Designing a suitable and appropriate \(\eta\) can satisfy \({\varvec{K}}{{\varvec{B}}_1}\eta \geqslant \varepsilon\), that is, \(\dot {V} \leqslant 0\).

(2) When \(\sigma < - \Delta\), substituting \(sat(\sigma ){\text{=}} - 1\) into Eq. (29) can be simplified as

To satisfy \(\dot {V} \leqslant 0\), simply ensure that the following inequality holds

Designing a suitable and appropriate \(\eta\) can satisfy\({\varvec{K}}{{\varvec{B}}_1}\eta \geqslant \varepsilon {\text{=}}\sup (\left| {{\varvec{K}}{{\varvec{B}}_2}{{\varvec{e}}_f}} \right|) \geqslant - {\varvec{K}}{{\varvec{B}}_2}{{\varvec{e}}_f}\), that is, \(\dot {V} \leqslant 0\).

(3) When \(- \Delta \leqslant \sigma \leqslant \Delta\), substituting \(sat(\sigma ){\text{=}}{\Delta ^{ - 1}}\sigma\) into Eq. (29), further divide it into the following two situations for argumentation.

①When \(- \Delta \leqslant \sigma \leqslant 0\), it can obtain that

To satisfy \(\dot {V} \leqslant 0\), simply ensure that the following inequality holds

It is known that \({{\varvec{e}}_f}\) is bounded, \(\Delta\) and \(\sigma\) are real number, so exist \(\eta\) that can satisfy \({\varvec{K}}{{\varvec{B}}_1}\eta \geqslant \varepsilon \Delta {\left| \sigma \right|^{ - 1}} \geqslant - {\varvec{K}}{{\varvec{B}}_2}{{\varvec{e}}_f}\Delta {\left| \sigma \right|^{ - 1}}\). Therefore, \(\dot {V} \leqslant 0\).

②When\(0<\sigma \leqslant \Delta\), it can obtain that

Because \({{\varvec{e}}_f}\) is bounded, exist \(\eta\) that can satisfy \({\varvec{K}}{{\varvec{B}}_1}\eta \geqslant \varepsilon \Delta {\sigma ^{ - 1}} \geqslant {\varvec{K}}{{\varvec{B}}_2}{{\varvec{e}}_f}\Delta {\sigma ^{ - 1}}\). Therefore, \(\dot {V} \leqslant 0\).

In summary, a reasonable design of \(\eta\) can ensure \(\dot {V} \leqslant 0\). So, the sliding surface ultimately stabilizes at zero and the system asymptotically stabilizes. Theorem 1 has been proven.

Theorem 2: The tracking error is bounded and converges to zero after reaching the sliding surface in finite time.

Proof: After the system stabilizes, it will always be on the sliding surface, therefore\(\sigma {\text{=}}\dot {\sigma }{\text{=}}0\). Combining Eq. (22), it can be inferred that

Solving differential Eq. (37) can obtain

where \({\varvec{Q}}{\text{=}}{\varvec{P}} \odot {{\varvec{K}}^{ - 1}}\), \({{\varvec{K}}^{ - 1}}=[k_{1}^{{ - 1}},k_{2}^{{ - 1}},k_{3}^{{ - 1}},k_{4}^{{ - 1}}]\). And all elements in \({\varvec{Q}}\) are positive real numbers. This means that the tracking error is bounded.

Therefore, according to Eq. (38), it can be known that when \(t \to \infty\), \({{\varvec{e}}_s} \to 0\). The length of error convergence time is related to the controller parameters, and selecting appropriate parameters can ensure that the tracking error converges to zero within a finite time. Theorem 2 has been proven.

In traditional servo feed systems, the expected displacement and expected speed of the workbench are usually issued as control commands. However, for TDMS, the speed of the workbench is the difference in speed between the screw drive shaft and the nut drive shaft. This requires motion planning in advance based on the displacement and speed of the workbench. Design the expected speed and displacement of the screw drive shaft and the nut drive shaft in advance. Taking the screw drive shaft as an example, in order to calculate the control quantity given in Eq. (26), it is necessary to know the expected value \({{\varvec{x}}_r}\). The expected displacement \({x_{r2}}\) and expected velocity \({\dot {x}_{r2}}\) are known in the early motion planning. However, the expected displacement \({x_{r1}}\) and expected speed \({\dot {x}_{r1}}\) of servo motor for the screw drive shaft cannot be directly determined in motion planning. Therefore, further calculations need to be made based on relevant formulas.

According to Eq. (2), the relationship between the expected values \({x_{r1}}\) and \({x_{r2}}\) is

Performing a Laplace transform on Eq. (39) can get

Further Laplace inverse transformation can be performed on Eq. (40) to obtain the expected displacement \({x_{r1}}\) of the screw drive shaft servo motor. When the expected displacement \({x_{r2}}\) of the table is known, the expected displacement \({x_{r1}}\) of the servo motor of the lead screw drive shaft can be obtained by Eq. (40). Using this value to input into the control quantity, achieve sliding mode control for the flexible two mass drive model of TDMS.

In addition, the controller parameters that need to be tuned include the gain coefficient matrix \({\varvec{L}}\) of GESO, the proportional coefficient \({\varvec{K}}\), the integral coefficient \({\varvec{P}}\), as well as the convergence law parameters \(\eta\) and \(\Delta\). The tuning process of controller parameters is very complex and tedious. Usually, a relatively simple engineering tuning method is used to continuously debug the system parameters to determine a set of control parameters that can meet the expected effect.

However, the control parameters of the controller proposed in this article are more complex and cumbersome compared to PID controllers. Therefore, in order to simplify the complex tuning process, the integration coefficient is set to

where \({\varvec{M}}\) is the feedback gain matrix of the system, and satisfies that

Determine the optimal feedback gain matrix \({\varvec{M}}\) based on the principle of quadratic optimal control. In order to reflect the vibration situation and tracking performance of the system, the cost function J is introduced and defined as

here \({e_{s1}} - {e_{s2}}={x_{s1}} - {x_{s2}}+{x_{r2}} - {x_{r1}}={x_{s1}} - {x_{s2}}+{c_{r1}}\), \({c_{r1}}{\text{=}}{x_{r2}} - {x_{r1}}\) and it is a constant. \({e_{s3}} - {e_{s4}}={\dot {x}_{s1}} - {\dot {x}_{s2}}+{\dot {x}_{r2}} - {\dot {x}_{r1}}={\dot {x}_{s1}} - {\dot {x}_{s2}}+{c_{r2}}\), \({c_{r2}}{\text{=}}{\dot {x}_{r2}} - {\dot {x}_{r1}}\) and it is a constant. Use \({\dot {x}_{s1}} - {\dot {x}_{s2}}\) and \({x_{s1}} - {x_{s2}}\) to reflect the axial vibration of the system. \({q_5}\) and \({q_6}\) are the suppression weights for axial vibration. In order to effectively suppress vibration, \({q_5}\) and \({q_6}\) should take a larger value. Further introducing the control variable u into the cost function can get

where weight matrix \({\varvec{Q}}=\left[ {\begin{array}{*{20}{c}} {{q_5}+{q_1}}&{ - {q_5}}&0&0 \\ { - {q_5}}&{{q_5}+{q_2}}&0&0 \\ 0&0&{{q_6}+{q_3}}&{ - {q_6}} \\ 0&0&{ - {q_6}}&{{q_6}+{q_4}} \end{array}} \right]\), W is the weight of the control quantity. By minimizing the cost function, the optimal feedback gain matrix can be obtained. On this basis, only the proportion coefficient needs to be adjusted, and the integration coefficient can also be determined. This means that by designing and adjusting the proportional coefficient, the tuning process of the proportional coefficient and integral coefficient can be completed.

The control process of the nut drive shaft is roughly similar to that of the screw drive shaft. Based on the above design process, a control model of the dual axis differential micro feed system can be constructed as shown in Fig. 5.

Control model of TDMS.

The controller proposed in the third section considers friction as a disturbance and estimates it through GESO to achieve friction compensation. That is to say, the observation effect of the observer directly affects the control performance of the sliding mode controller. And a smaller estimation error is beneficial for the controller to better compensate for nonlinear friction, system uncertainty, and external disturbances. However, the tuning process of gain coefficient matrix about GESO is complex and cumbersome. The commonly used engineering tuning methods are time-consuming and difficult to obtain the optimal parameters. To solve the above problems, a deep reinforcement learning method is introduced to achieve dynamic adjustment of parameters about GESO.

The dynamic adjustment process of control parameters is best regarded as a continuous action process in order to get the best result. The deep deterministic policy gradient34 is commonly used on agents to output continuous actions. It is an actor-critic algorithm of off-policy. The policy function based on neural network corresponds to the actor, and the value function based on neural network corresponds to the critic.

Deterministic policy refers to the direct output of a definite action a based on the agent’s state s, rather than the probability of the action occurring. Deterministic policy can be expressed as

The actor network can fit the actions in different states, and then output the action \({a_t}\) corresponding to the current state \({s_t}\). Through the method of policy gradient, the action distribution of the actor network can be changed, and its parameters \({\theta ^\mu }\) can be updated. The deterministic policy gradient can be expressed as

where \({\rho ^\mu }\) is a state distribution used to collect action policy.

Critic network can fit value functions \(Q(s,a|{\theta ^Q})\) under different states and actions. The parameters \({\theta ^Q}\) of the critic network can be further updated through the loss function, which is represented by the expected value of the square of TD error. The expression is as follows

where \({\theta ^{{Q^T}}}\) and \({\theta ^{{\mu ^T}}}\) are the parameters of the target network \({Q^T}\) for the critic network and the target network \({\mu ^T}\) for the actor network, respectively. And \(\gamma\) is the discount coefficient, who’s general value is \(\gamma \in [0,1]\).

Because when calculating the value function of the target network, the value function of the critic network is also updated simultaneously, which can easily cause instability in neural network training. Therefore, the soft target update method is used to update the target network, which follows the following formula

where \(\tau \in (0,1]\), it is the soft update factor, usually a value much smaller than 1.

In addition, to reduce the overestimation of the value function, two independent critic networks are used, which have the same structure but are trained independently of each other. When estimating the Q value, two values are generated. The smaller Q value is selected as the current calculated value to form the double critical deep deterministic policy gradient (DCDDPG). In summary, the adopted DCDDPG architecture is shown in Fig. 6. The buffer in the figure is used to store a certain amount of data \(({s_t},{a_t},{r_t},{s_{t+1}})\) sampled from the environment. The agent randomly selects a small batch of samples that meet independent and identically distributed conditions from the buffer to update the network.

Frame diagram of DCDDPG.

Furthermore, state s is the information that generates the optimal control parameters of GESO for the intelligent agent. The current state \({s_t}\) includes the expected output of the screw drive shaft \({x_{r2}}\), the error of linear displacement observation \({e_1}\), and the error of rotational displacement observation \({e_2}\). The current action \({a_t}\) is the value of the gain coefficient matrix for GESO. But the gain coefficient matrix contains 12 parameter values, which makes the learning process complex, time-consuming, and difficult to ensure that the parameter values meet the requirements of Lemma 1, resulting in non- convergence of observation errors. In addition, during the research process, engineering tuning was used to explore the influence of various parameters on the performance of the observer. It was found that most parameters mainly affect whether the error converges and the response speed of the observer, with a weak impact on the size of the observation error. However, \({l_{52}}\) and \({l_{62}}\) have a significant regulatory effect on the size of the error. Therefore, before introducing reinforcement learning, a gain coefficient matrix that can make the observer error bounded is first determined through engineering tuning. Then determine the upper and lower values of \({l_{52}}\) and\({l_{62}}\), so that the matrix can still ensure the convergence of observation errors. When \({l_{52}}\) and \({l_{62}}\) exceed this value range, the observer error is no longer bounded. Furthermore, select \({l_{52}}\) and \({l_{62}}\) as the output values of the action for deep reinforcement learning.

Reward r is generally a function of state and action, which usually set as a negative absolute value of error. As the error decreases, the agent does not receive a significant reward, resulting in the error not reaching the ideal state and the learning process approaching convergence. Therefore, for actions that significantly reduce errors, it is necessary to give the agent an incentive to introduce gradually increasing rewards to guide the agent to explore the ideal state and actions faster. So that the error can quickly converge towards a smaller direction. Reward r is defined as

where \({e_f}{\text{=}}\left| {{e_{f1}}} \right|+\left| {{e_{f2}}} \right|+\left| {{e_1}} \right|+\left| {{e_2}} \right|\), using the sum of absolute values of the observed errors as variables in the reward function. What’s more, \({k_{r1}}<0\), \({c_{r3}}>{c_{r2}}>{c_{r1}}>0\). This means that the smaller the error, the greater the motivation given to the intelligent agent.

For the actor network and the critic network, their main and target networks have the same structural form, as shown in Tables 1 and 2.. The critic network input layer contains states and actions, respectively, through the first hidden layer of size 32. And merge in the second hidden layer of size 64. Then, after passing through the third hidden layer of size 64, output a value, which is Q. In addition, all three hidden layers use rectified linear activation functions. The input layer of the actor network only has states, with two hidden layers of sizes 32 and 16, and then outputs actions. The activation functions use hyperbolic tangent activation functions.

Based on the above research content, a sliding mode controller based on a generalized extended state observer with the double critic deep deterministic policy gradient algorithm is established. The specific algorithm process is as follows

Algorithm 1 Algorithm to train the DCDDPG agent

1: Initialize two critic networks \({Q_1}\), \({Q_2}\) and actor network \(\mu\) using random parameters \({\theta ^{{Q_1}}}\), \({\theta ^{{Q_2}}}\), and \({\theta ^\mu }\).

2: Initialize target network, \({\theta ^{Q_{1}^{T}}} \leftarrow {\theta ^{{Q_1}}}\), \({\theta ^{Q_{2}^{T}}} \leftarrow {\theta ^{{Q_2}}}\), \({\theta ^{{\mu ^T}}} \leftarrow {\theta ^\mu }\).

3: Initialize the size of the experience replay buffer, noise variance, decay rate, discount factor for rewards, soft update factor, and batch size.

4: Initialize the two-axis differential micro-feed system and parameters of the control model.

5: for each episode step = 1, 2, …, \({E_{\hbox{max} }}\), do

6: Set the expected speed and displacement of the TDMS model, as well as matched interference and mismatched interference.

7: for i= 1, 2, …, \({i_{\hbox{max} }}\), do

8: Receive the current state \({s_t}\) from the environment and select action \({a_t}\) based on the actor network.

9: Call TDMS control model and apply actions to GESO of control model.

10: Observe reward \({r_t}\) and the next state \({s_{t+1}}\).

11: Storing historical experience \(({s_t},{a_t},{r_t},{s_{t+1}})\) in experience replay buffer.

12: Select a small batch of samples in the playback buffer.

13: Calculate \(Q_{1}^{T}\), \(Q_{2}^{T}\), and select the smaller value as \({Q^T}\). Calculate \(Q_{1}^{{}}\)and \(Q_{2}^{{}}\), and select the smaller value as Q.

14: Insert \(Q_{1}^{{}}\), \(Q_{2}^{{}}\) and \({Q^T}\) into the loss function L, and update the two critic network parameters \({\theta ^{{Q_1}}}\) and \({\theta ^{{Q_2}}}\) respectively.

15: Update the parameter \({\theta ^\mu }\) of the actor network with \(\frac{1}{N}\sum\limits_{i} {{\nabla _a}Q({s_i},\mu ({s_i}|{\theta ^\mu })|{\theta ^Q})} \cdot {\nabla _{{\theta ^\mu }}}\mu ({s_i}|{\theta ^\mu })\).

16: Update the target network, \({\theta ^{{Q_1}^{T}}} \leftarrow \tau {\theta ^{{Q_1}}}+(1 - \tau ){\theta ^{{Q_1}^{T}}}\),\({\theta ^{{Q_2}^{T}}} \leftarrow \tau {\theta ^{{Q_2}}}+(1 - \tau ){\theta ^{{Q_2}^{T}}}\), \({\theta ^{{\mu ^T}}} \leftarrow \tau {\theta ^\mu }+(1 - \tau ){\theta ^{{\mu ^T}}}\).

17: end for

18: end for

The DCDDPG agent designed above is continuously trained. During the training process, the input interference is known. First, the parameters of the critic network and the actor network are initialized and the iterative update is started. The dual-axis differential micro-feeding system is operated. In each control cycle, the intelligent experience fits the output action under different states according to the actor network in the current iteration, that is, the gain coefficient of the GESO. Based on the gain coefficient of the above output, GESO observes the interference, displacement and velocity at the current moment. Compared with the actual value, the observation error can be calculated. Then, the error is substituted into the designed agent reward function, and the corresponding reward value in the current state can be calculated. On this basis, the parameters of the critic network are updated according to the loss function, and the parameters of the actor network are further updated according to the designed deterministic policy gradient expression (46). In the next iteration, repeat the above process. The critic network is to evaluate the output of the actor network. The larger the value of value function is, the more the output action is in line with expectations. In addition, the update of the critic network is developing in the direction of increasing the value function of the agent, that is, the observation error is getting smaller and smaller. This means that the gain coefficient of the output after the actor network update will make the observation error of the GESO gradually smaller. After continuous iterative training, the value of value function will gradually increase and converge. At this time, the gain coefficient of the agent ‘s output is the optimal coefficient that can maximize the reward value, maximize the value function, and minimize the loss function. The GESO corresponding to the optimal gain coefficient can minimize the observation error, so as to realize the parameter adjustment process of reinforcement learning on GESO.

Firstly, the performance of the generalized extended state observer with the double critic deep deterministic policy gradient algorithm (DCDDPG_GESO) for parameter dynamic self-tuning is verified. And it is compared with the GESO based on DQN35 algorithm (DQN_GESO) for parameter tuning and the GESO without reinforcement learning. The parameters used during the DCDDPG training process are shown in Table 3. In order to accelerate the convergence speed of reinforcement learning, the parameter values of the reward function are \({c_{r1}}{\text{=1}}\), \({c_{r2}}{\text{=5}}\), \({c_{r1}}{\text{=10}}\). By gradually increasing rewards in a stepwise manner, agent is encouraged to explore and learn, avoiding receiving rewards only when expected errors are achieved. This can lead to faster convergence.

The control model parameters of TDMS are shown in Table 4.

Training process of DCDDPG.

Figure 7 shows the training process of DCDDPG. In the early stages of training, due to randomly initializing the parameters of the actor network, the agent has a certain degree of exploratory ability. The amplitude of the episode reward oscillation is large. Under the stimulation of the reward function, the agent quickly learns actions that can obtain higher reward and obtains better strategies, but the exploratory ability will decrease. With further iteration and learning, the average reward gradually converges to the maximum value.

The output of DCDDPG agent is the final optimization result of parameters \({l_{52}}\) and \({l_{62}}\) in the gain coefficient matrix. After the average reward convergence, the output of the agent is shown in Fig. 8. The agent can dynamically output the optimal result according to the current state.

The optimized output of DCDDPG.

Observation results of sinusoidal interference signals from DCDDPG_GESO.

Figure 9 shows the results of DCDDPG_GESO on matched interference and mismatched interference. The two types of input interference are sine reference signals with different amplitudes and frequencies. The amplitude of matched interference is 1.2 N·m, and the amplitude of mismatched interference is 0.8 N·m. From the graph, it can be seen that DCDDPG_GESO can effectively observe the input of matched interference and mismatched interference simultaneously in two channels. And the observed values are highly consistent with the reference values. Figure 10 shows the sinusoidal disturbance observation errors of three observers. And Fig. 10a shows the observation errors of matched interference. The maximum observation error of GESO is 0.022 N·m, and the maximum observation error of DQN_GESO is 0.012 N·m. Even after the system stabilized, there were still small fluctuations in the error at 3 s and 6.5 s. This is because DQN discretizes actions into n values within a range of values, resulting in significant changes in the output actions and fluctuations in errors. This can be solved by increasing the learning time and increasing the value of n to further refine the action values. But it also consumes more time for exploratory learning. However, DCDDPG can learn continuous actions, and the maximum error of DCDDPG_GESO is only 0.007 N·m. Figure 10b shows the observation error of mismatched interference. DCDDPG_GESO still has lower observation error, with an error stable within the range of ± 0.0075 N·m, which is 68.2% higher than the observation error of GESO. It is obvious that DCDDPG_GESO has better observation results for matched interference and mismatched interference in two channels.

Observation error of sinusoidal interference signal: (a) Matched interference signal, (b) Mismatched interference signal.

Figure 11 shows the step signal interference observation results of three observers. Figure 11a shows the matched interference response with a reference signal of 1.2 N·m. All observers can stably observe the given signal, but their transient response varies greatly. GESO has a fast response speed but a large overshoot, and it takes a long time to reach stability. DQN_GESO achieved the same response speed and stability time as GESO but with smaller overshoot. Although the response speed of DCDDPG_GESO is slightly slow, the overshoot is small and can reach a stable state in a shorter time. Figure 11b shows the observation error of matched interference. The observation error of GESO is stable at -0.018 N·m, the observation error of DQN_GESO is stable at 0.008 N·m, and the observation error of DCDDPG_GESO is stable at 0.001 N·m. Figure 11c shows the response to mismatched interference, while Fig. 11d shows the observation error of mismatched interference. DCDDPG_GESO still has better transient response performance and smaller observation error, with an observation error increase of 94.4% compared to GESO.

Observation results of step signal: (a) Response to matched interference, (b) Observation error of matched interference, (c) Response to mismatched interference, (d) Observation error of mismatched interference.

The main purpose of GESO is to observe friction that is considered a disturbance, achieve friction compensation, and high-precision motion tracking control. Therefore, friction is further regarded as an interference signal to verify the performance of GESO. Establish a LuGre friction model for TDMS7, which can accurately express the complex and variable friction characteristics of the mechanical system, such as pre-sliding displacement and Stribeck effect. The friction parameters are shown in Table 5.

Taking the screw drive shaft as an example, friction is input as an interference signal into the control model. The expected displacement of the screw drive shaft is set to \({x_s}=9\sin (0.4\pi t)\), in millimeters. Figure 12 shows the observation results of the observer. Figure 12a shows the friction interference response. According to the Stribeck effect of the LuGre friction model, it is known that in the motion stage where the velocity is close to zero, due to the influence of nonlinear friction, the friction shows a phenomenon of first decreasing and then increasing with the increase of velocity. The local magnified image in Fig. 12a is in line with the above phenomenon. In the Stribeck effect stage, the three observers produce significant observation errors. When the velocity exceeds the Stribeck effect stage, all three observers have good observation effects on friction. As shown in Fig. 12b, the observation error of GESO for nonlinear friction during the Stribeck effect stage is relatively large, reaching 0.023 N·m. The observation error of DQN_GESO in this stage is significantly reduced, about 0.011 N·m. The observation error of DCDDPG_GESO further decreases, only 0.004 N·m. Compared with GESO, its observation error is reduced by 82.6%, indicating that DCDDPG_GESO has better observation performance for nonlinear friction. After the speed exceeds the Stribeck stage, among all observers, DCDDPG_GESO has the smallest observation error for friction, only 1.6 × 10−5 N·m.

The observation error of DCDDPG_GESO in the Stribeck effect stage will significantly increase compared to other stages. However, for the working process of TDMS, the low-speed motion of the workbench is composed of the motion of the screw drive shaft and the nut drive shaft. Both the screw drive shaft and the nut drive shaft work in the higher speed stage, which can effectively avoid the Stribeck effect. That is to say, during the motion process, nonlinear friction does not exist in the two drive shafts. It only exists at the guide rail of the workbench, where the nonlinear friction is much smaller compared to the drive shaft. Therefore, combining precise observation with DCDDPG_GESO can effectively reduce the interference of friction on the entire motion process, which is very beneficial for TDMS to achieve high-precision feed motion.

Next, compared with common PID controllers based on friction feedforward compensation (PID + FC)36 and adaptive friction compensation controllers (AFCC)28, the performance of the proposed sliding mode controller based on DCDDPG_GESO is verified, which is later abbreviated as a sliding mode controller based on reinforcement learning observer (RLOSMC).

Results of friction observation: (a) Response of friction interference, (b) Observation error of friction interference.

The expected speed of the screw drive shaft is \({v_{rs}}=5+3\sin (0.4\pi t)\), and the expected speed of the nut drive shaft is \({v_{rn}}= - 5\). Therefore, the expected speed of the synthesized TDMS workbench is \({v_r}=3\sin (0.4\pi t)\), measured in mm/s. Figure 13 shows the position tracking error curves of the workbench under the control of three controllers. The maximum tracking error of PID + FC reaches 13 μm. The tracking error of AFCC is within ± 5.5 μm. The tracking error of RLOSMC is stable within ± 0.9 μm, achieving a smaller tracking error. Preliminary simulation has verified that the proposed RLOSMC has better position tracking performance compared to the other two control methods.

Position tracking error of workbench.

To further verify the position tracking performance of the proposed controller, experimental verification was conducted on the experimental device for TDMS shown in Fig. 14. It mainly consists of a TDMS, servo driver, control circuit, industrial computer, and motion control upper computer. And use an incremental grating ruler with a resolution of 10 nm for position feedback and detection.

Parameter selection is a complicated process, which needs constant debugging to achieve better control performance. Firstly, the proportional coefficients \({\varvec{K}}\) and integral coefficients \({\varvec{P}}\) are determined according to the simplified parameter tuning method proposed in Sect. 3.4. In the weighting matrix, the value of q5 and q6 are 1000, and the remaining values are 10. Based on Eq. (41), the integral coefficient can be obtained after determining the proportional coefficient. Secondly, the parameters in the controller are fine-tuned by the method of engineering tuning. The final parameters are as follows:\({\varvec{K}}=[0.5,2,15,30]\), \({\varvec{P}}=[7.97 \times {10^5},7.97 \times {10^5},887,1261]\), \(\Delta =0.5\), \(\eta =20\).

Schematic diagram of experimental device for TDMS.

Firstly, verify the stability of the TDMS workbench in a quasi-static state, which refers to the situation where the speed of the screw drive shaft and the nut drive shaft are the same, but the direction of motion is opposite. At this time, the combined speed and displacement of the workbench are zero. This is because during TDMS work, the two drive shafts need to be synchronously accelerated to a reference speed, which should exceed the Stribeck area. Then, the screw drive shaft accelerates and decelerates to achieve the target speed of the workbench, ensuring that the drive shaft is not affected by nonlinear friction. Therefore, the stability of the workbench in a quasi-static state will have a significant impact on the tracking performance of TDMS.

The equivalent linear speed of the screw drive shaft is 5 mm/s, the equivalent linear speed of the nut drive shaft is -5 mm/s, and the expected speed of the synthesized workbench is 0 mm/s. The position error curve of the workbench under quasi-static state is shown in Fig. 15. Under the control of PID + FC controller, TDMS can barely maintain a relatively static state, and the position error of the workbench is within 3 μm fluctuates up and down. Under the control of AFCC, the position error of the workbench is stable at 1.3 μm ~ 1.5 μm. The position error has been reduced by about 50%. After adopting the control method of RLOSMC, the position error of the workbench in the quasi stationary state is further reduced, and the error is maintained at 0.8 μm ~ 1 μm. It can be seen that the workbench error under RLOSMC control is lower and more stable, which can better meet the requirements of quasi-static state.

Position error of workbench in quasi-static state.

Figure 16 shows the position tracking error of the workbench with different methods. The equivalent linear velocity of the screw drive shaft is \({v_{rs}}=5+3\sin (0.4\pi t)\), and the equivalent linear velocity of the nut drive shaft is \({v_{rn}}= - 5\). The expected velocity of the synthesized workbench is \(v=3\sin (0.4\pi t)\), measured in mm/s. Under the control of the three controllers mentioned above, the position tracking error curve of the workbench is shown in Fig. 16, and the position tracking error results are shown in Table 6. For the PID + FC control method, the maximum tracking error of the workbench reached 11.1 μm. The average value is 6.8 μm. Feedforward friction compensation relies on the accuracy of friction model parameter identification. Environmental changes such as temperature, humidity, lubrication, and wear can cause changes in friction parameters, resulting in poor friction compensation performance. AFCC has improved the position tracking accuracy of the workbench and has a certain adaptive ability to changes in friction parameters. However, it has not effectively compensated for the nonlinear friction generated by the workbench guide rail, resulting in the maximum position error when the workbench speed approaches zero. The position error is stable within ± 7.1 μm, and the average value is reduced to 3.6 μm. The proposed RLOSMC has better position tracking performance, with a position error between − 4.5 μm ~ 2.2 μm and an average value of only 2.1 μm. The tracking accuracy is 69.1% higher than PID + FC. This method does not require precise friction parameters identification to compensate for friction. And the error at speeds close to zero is similar to the error at other speed stages, indicating that nonlinear friction is more effectively suppressed compared to the AFCC control method.

Position tracking error of workbench with different methods.

The working mode of TDMS in the above experiment is optimal. But the system has multiple working modes that can move under different speed combinations. To verify the superiority of TDMS working mode, the following experiments will continue.

Table 7 shows four speed combinations. Combination 1 is driven by the screw shaft alone, and Combination 2 is driven by the nut shaft alone. These two combinations are the transmission methods of traditional ball screw pairs. Combination 3 is the dual drive motion mode, with the screw drive shaft speed command of \({v_{rs}}=6\sin (0.4\pi t)\) and the nut drive shaft speed command of \({v_{rn}}= - 3\sin (0.4\pi t)\). Combination 4 is the dual drive motion mode, with the screw drive shaft speed command of \({v_{rs}}=5+3\sin (0.4\pi t)\) and the nut drive shaft speed command of \({v_{rn}}= - 5\). Under the four speed combinations, the expected speed of the workbench is \(v=3\sin (0.4\pi t)\).

Using the proposed RLOSMC control method, experiments are carried out under different speed combinations. Figure 17 shows the curves of the position tracking error of the workbench under different speed combinations. And Table 8 shows the results of the position tracking error. The position tracking error of combination 3 is greater than that of other combination methods, with a maximum value of 7.4 μm and an average value of 4.1 μm. Because the speed commands of the screw drive shaft and the nut drive shaft are both sine signals, this leads to a reverse speed stage for both drive shafts and the workbench, which experiences greater nonlinear friction. And this seriously affects the position tracking accuracy. Combination 1 and Combination 2 are only affected by nonlinear friction interference from one drive shaft and the workbench, resulting in relatively small position tracking errors. The maximum position tracking error of Combination 1 is 6 μm, with an average value of 2.8 μm. The maximum position tracking error of combination 2 is 5.6 μm, and the average value is 3.4 μm. Combination 4 fully leverages the advantages of TDMS, as neither drive shaft has a reverse speed stage and nonlinear friction only exists at the workbench. The impact of friction on the system is further reduced, and the position tracking accuracy is significantly better than other combination methods. Verified that TDMS has significant advantages over the traditional ball screw transmission mode in the working mode of speed combination 4.

In summary, compared to the other two common controllers, the sliding mode controller proposed in this paper based on the generalized extended state observer with the double critic deep deterministic policy gradient algorithm has better performance in motion control of two-axis differential micro-feed systems. It performs well in a quasi-static state, and the position tracking accuracy has been significantly improved.

Position tracking error of workbench with the control of RLOSMC under different speed combinations.

In order to improve the tracking performance of TDMS, the sliding mode controller based on a generalized extended state observer with a double critic deep deterministic policy gradient algorithm is proposed in this paper. On the basis of the flexible body two mass drive model about TDMS, the design process of GESO is derived, and it is proved that the observation error of GESO was bounded. The design of the sliding mode controller has been further completed. And based on the Lyapunov equation, it has been verified that the system is asymptotically stable and the error can converge to zero. By combining the principle of quadratic optimal control, the tuning process of controller parameters has been simplified. The DCDDPG algorithm is proposed to simplify the GESO parameter tuning process. And the DCDDPG_GESO is validated through simulation has better transient response performance and smaller observation error for both step and sine signals, with an 82.6% improvement in observation accuracy for nonlinear friction compared to GESO. Experimental verification is conducted on TDMS, and the results show that the proposed controller can better meet the stability under quasi-static conditions and has higher tracking accuracy for sine commands. Compared with the adaptive friction compensation algorithm, it further reduces the impact of nonlinear friction on tracking accuracy, with a maximum tracking error of 4.5 μm. The average value is only 2.1 μm. The position tracking accuracy is improved by 69.1% compared to the PID algorithm based on friction feedforward compensation.

The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.

Yang, X., Lu, D., Ma, C., Zhang, J. & Zhao, W. Analysis on the multi-dimensional spectrum of the thrust force for the linear motor feed drive system in machine tools. Mech. Syst. Signal. Proc. 82, 68–79 (2017).

Article ADS Google Scholar

Kim, S. & Lee, K. Active second-order pole-zero cancellation control for speed servo systems with current sensor fault tolerance. IEEE Trans. Circuits Syst. II Express Briefs. 70, 2196–2200 (2023).

Google Scholar

Zhang, W., Zhang, X. & Zhao, W. Influence of nonlinearity of servo system electrical characteristics on motion smoothness of precision cnc machine tools. Precis. Eng. 83, 82–101 (2023).

Article Google Scholar

Zhang, L. et al. A rapid vibration reduction method for macro–micro composite precision positioning stage. IEE Trans. Ind. Electron. 64, 401–411 (2017).

Article Google Scholar

Zhang, L., Zhang, P., Jiang, B. & Yan, H. Research trends in methods for controlling macro-micro motion platforms. Nanatechnol. Precision Eng. 6, 1–15 (2023).

Wang, Z., Feng, X., Du, F., Li, H. & Su, Z. A novel method for smooth low-speed operation of Linear feed systems. Precis. Eng. 60, 215–221 (2019).

Article Google Scholar

Du, F. et al. Identification and compensation of friction for a novel two-axis differential micro-feed system. Mech. Syst. Signal. Proc. 106, 453–465 (2018).

Article ADS Google Scholar

Liu, C., Tsai, M., Lin, M. & Tang, P. Novel multi-square-pulse compensation algorithm for reducing quadrant protrusion by injecting signal with optimal waveform. Mech. Mach. Theory. 150, 103875 (2020).

Article Google Scholar

Li, F., Jiang, Y., Li, T. & Ehmann, K. F. Compensation of dynamic mechanical tracking errors in ball screw drives. Mechatronics. 55, 27–37 (2018).

Article Google Scholar

Farrage, A. & Uchiyama, N. Improvement of motion accuracy and energy consumption of a mechanical feed drive system using a fourier series-based nonlinear friction model. Int. J. Adv. Manuf. Technol. 99, 1203–1214 (2018).

Article Google Scholar

Liang, X. et al. A novel steering-by-wire system with road sense adaptive friction compensation. Mech. Syst. Signal. Proc. 169, 108741 (2022).

Article Google Scholar

Zeng, T., Ren, X. & Zhang, Y. Fixed-time sliding mode control and high-gain nonlinearity compensation for dual-motor driving system. IEEE Trans. Ind. Inf. 16, 4090–4098 (2020).

Article Google Scholar

Wan, M., Dai, J., Zhang, W., Xiao, Q. & Qin, X. Adaptive feed-forward friction compensation through developing an asymmetrical dynamic friction model. Mech. Mach. Theory. 170, 104691 (2022).

Article Google Scholar

Sun, Y., Yang, M., Wang, B., Chen, Y. & Xu, D. Precise position control based on resonant controller and second-order sliding mode observer for PMSM-driven feed servo system. IEEE Trans. Transp. Electrif. 9, 196–209 (2023).

Article Google Scholar

Cheng, G. & Yu, W. A. Universal digital motion controller design for servo positioning mechanisms in industrial manufacturing. Robot Comput. -Integr Manuf. 64, 101943 (2020).

Article Google Scholar

Bao, D., Tang, W. & Dong, L. Integral sliding mode control for flexible ball screw drives with matched and mismatched uncertainties and disturbances. J. Cent. South. Univ. 24, 1992–2000 (2017).

Article Google Scholar

Gao, P. et al. Active disturbance rejection control for speed control of PMSM based on auxiliary model and supervisory Rbf. Appl. Sci. 12, 10880 (2022).

Article CAS Google Scholar

Lu, E., Li, W., Wang, S., Zhang, W. & Luo, C. Disturbance rejection control for PMSM using integral sliding mode based composite nonlinear feedback control with load observer. Isa Trans. 116, 203–217 (2021).

Article PubMed Google Scholar

Zeng, T. et al. An integrated optimal design for guaranteed cost control of motor driving system with uncertainty. IEEE/ASME Trans. Mechatron. 24, 2606–2615 (2019).

Article Google Scholar

Hu, S. et al. Adaptive predefined-time synchronization and tracking control for multimotor driving servo systems. IEEE/ASME Trans. Mechatron. 1–11 (2024).

Li, S. et al. Generalized extended state observer based control for systems with mismatched uncertainties. IEEE Trans. Industr. Electron. 59, 4792–4802 (2012).

Article Google Scholar

Li, S. et al. Sliding mode active disturbance rejection control of permanent magnet synchronous motor based on improved genetic algorithm. Actuators. 12, 209 (2023).

He, J., Su, S., Wang, H., Chen, F. & Yin, B. Online PID tuning strategy for hydraulic servo control systems via sac-based deep reinforcement learning. Machines. 11, 593 (2023).

Shuprajhaa, T., Sujit, S. K. & Srinivasan, K. Reinforcement learning based adaptive pid controller design for control of linear/nonlinear unstable processes. Appl. Soft Comput. 128, 109450 (2022).

Article Google Scholar

Ding, Y., Ren, X., Zhang, X., Liu, X. & Wang, X. Multi-phase focused pid adaptive tuning with reinforcement learning. Electronics. 12, 3925 (2023).

Article Google Scholar

Zhao, J., Yang, C., Gao, W. & Zhou, L. Reinforcement learning and optimal control of pmsm speed servo system. IEEE Trans. Ind. Electron. 70, 8305–8313 (2023).

Article Google Scholar

Oh, T. et al. Deep rl based notch filter design method for complex industrial servo systems. Int. J. Control Autom. Syst. 18, 2983–2992 (2020).

Article Google Scholar

Song, Z., Yang, J., Mei, X., Tao, T. & Xu, M. Deep reinforcement learning for permanent magnet synchronous motor speed control systems. Neural Comput. Appl. 33, 5409–5418 (2021).

Article Google Scholar

Wang, Y., Shen, H., Wu, J., Yan, H. & Xu, S. Reinforcement-learning-based composite optimal control for looper hydraulic servo systems in hot strip rolling. IEEE/ASME Trans. Mechatron. 28, 2495–2504 (2023).

Article Google Scholar

Yu, H., Feng, X. & Sun, Q. Kinematic analysis and simulation of a new type of differential micro-feed mechanism with friction. Sci. Prog. 103, 39952402 (2020).

Article Google Scholar

Liu, Y. et al. Modeling, identification, and compensation control of friction for a novel dual-drive hydrostatic lead screw micro-feed system. Machines. 10, 914 (2022).

Kamalzadeh, A. & Erkorkmaz, K. Compensation of axial vibrations in ball screw drives. CIRP Ann. 56, 373–378 (2007).

Article Google Scholar

Gao, Z. Active disturbance rejection control: a paradigm shift in feedback control system design. IEEE 1–12, 2399–2405 (2006).

Zhang, Z., Chen, J., Chen, Z. & Li, W. Asynchronous episodic deep deterministic policy gradient: toward continuous control in computationally complex environments. Ieee T Cybern. 51, 604–613 (2021).

Han, Z. et al. Deep forest-based DQN for cooling water system energy saving control in Hvac. Buildings-Basel. 12, 1787 (2022).

Lu, Z., Feng, X., Su, Z., Liu, Y. & Yao, M. Friction parameters dynamic change and compensation for a Novel Dual-Drive Micro-feeding System. Actuators. 11, 236 (2022).

Article Google Scholar

Download references

We are grateful for the financial support from the National Natural Science Foundation of China (Grant no.51875325) and the Key Research and Development Plan of Shandong Province (2022CXGC010101).

School of Software, Shandong University, Jinan, 250061, China

Anning Wang

School of Mechanical Engineering, Shandong University, Jinan, 250061, China

Xianying Feng, Haiyang Liu & Ming Yao

Key Laboratory of High Efficiency and Clean Mechanical Manufacture of Ministry of Education, Shandong University, Jinan, 250061, China

Anning Wang, Xianying Feng, Haiyang Liu & Ming Yao

You can also search for this author in PubMed Google Scholar

A.W. and X.F. wrote the main manuscript text and finished all theoretical and experimental analyses, H.L. and M.Y. gave suggestions for the design of algorithm. All authors reviewed the manuscript.

Correspondence to Xianying Feng.

The authors declare no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

Wang, A., Feng, X., Liu, H. et al. Design of sliding mode controller for servo feed system based on generalized extended state observer with reinforcement learning. Sci Rep 14, 24976 (2024). https://doi.org/10.1038/s41598-024-75598-5

Download citation

Received: 18 December 2023

Accepted: 07 October 2024

Published: 23 October 2024

DOI: https://doi.org/10.1038/s41598-024-75598-5

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Blog

Design of sliding mode controller for servo feed system based on generalized extended state observer with reinforcement learning | Scientific Reports