Chapter 1: Fundamental Concepts of PI and Data Utilization in Process Industries

Understand the basic concepts of Process Informatics (PI) and learn about the characteristics of process industries and types of data. Through real improvement cases with data-driven approaches, experience the value of PI firsthand.

Learning Objectives

By reading this chapter, you will master the following:

✅ Explain the definition and purpose of Process Informatics (PI)
✅ Understand the characteristics of process industries and differences from conventional materials development
✅ Classify major types of process data (sensor, operation, quality data)
✅ Explain specific examples of data-driven process improvement
✅ Perform basic process data visualization using Python

1.1 What is Process Informatics (PI)?

Definition of PI

Process Informatics (PI) is an academic field that uses data-driven approaches to understand, optimize, and control processes in chemical plants, pharmaceuticals, food, semiconductors, and other process industries.

Specifically, PI encompasses five core activities: process data collection and analysis (gathering and analyzing sensor data, operating conditions, and quality data), process modeling (constructing quality prediction models using machine learning and soft sensors), process optimization (improving yield, reducing energy consumption, and enhancing quality through optimal conditions), process control (building automatic control systems based on real-time data), and anomaly detection (early identification of abnormal states through data analysis).

Difference from Materials Informatics (MI)

PI is often confused with Materials Informatics (MI). Let's clearly understand the difference between them.

Item	Materials Informatics (MI)	Process Informatics (PI)
Target	Materials themselves (composition, structure, properties)	Manufacturing processes (operating conditions, control, quality)
Purpose	Discovery and design of new materials	Process optimization and control
Data types	Physical property values, crystal structures, composition data	Time-series sensor data, operating conditions
Time axis	Static (intrinsic material properties)	Dynamic (changing moment by moment)
Main methods	Descriptors, materials databases, screening	Time-series analysis, soft sensors, process control
Typical challenges	Search for materials with bandgap of 2.5 eV	Reduce energy consumption by 10% while maintaining product purity above 98%

Important point: MI focuses on "what to make," while PI focuses on "how to make it." The two are complementary, collaborating in ways such as efficiently manufacturing new materials discovered by MI using PI.

1.2 Characteristics of Process Industries

To understand PI, you first need to know the characteristics of process industries.

Classification of Processes

Processes in process industries are broadly classified into two types:

Aspect	Continuous Process	Batch Process
Characteristics	24/7, 365 days operation with continuous raw material feed and product extraction	Cyclic operation: raw material input → reaction → product extraction
Examples	Oil refining, chemical plants (ethylene), papermaking	Pharmaceutical manufacturing, food processing, fine chemicals
Advantages	High productivity, stable quality	Flexible product switching, suitable for small-volume high-variety production
Challenges	High shutdown/restart costs, low flexibility	Batch-to-batch variability, lower productivity

Major Fields of Process Industries

Industry Field	Main Processes	Typical PI Applications
Chemical	Distillation, reaction, separation	Yield optimization, energy reduction
Petrochemical	Refining, cracking	Product quality prediction, equipment anomaly detection
Pharmaceutical	Synthesis, crystallization, drying	Batch quality control, accelerated process development
Food	Fermentation, sterilization, mixing	Quality consistency improvement, shelf-life prediction
Semiconductor	CVD, etching, cleaning	Yield improvement, real-time control

Process Complexity

Processes in process industries are very complex due to the following characteristics:

Multivariate nature
- Dozens to hundreds of variables mutually influence each other
- Example: In a distillation column, temperature (multiple stages), pressure, flow rate, and composition are interconnected
Nonlinearity
- The relationship between inputs and outputs is not linear
- Example: Raising reaction temperature by 10°C increases yield by 5%, but raising it another 10°C causes side reactions, reducing yield
Time delay (time lag)
- It takes time for the effect of operational changes to appear
- Example: Changing the reboiler temperature in a distillation column affects the overhead product only after several minutes to tens of minutes
Presence of disturbances
- Raw material composition variations, ambient temperature changes, etc.
- These need to be compensated for in real-time

Conclusion: Due to this complexity, data-driven approaches (PI) are effective. Machine learning and data analysis can capture complex relationships that cannot be handled by conventional rules of thumb or simple models.

1.3 Types of Process Data

The data handled in PI is mainly classified into three categories.

1. Sensor Data (Measurement Data)

Data that measures process states in real-time.

Major sensor types:

Sensor Type	Measurement Target	Typical Measurement Frequency	Usage Examples
Temperature sensor	Process temperature	1 second to 1 minute	Reaction temperature monitoring, furnace control
Pressure sensor	Process pressure	1 second to 1 minute	Distillation column pressure, reactor pressure
Flow meter	Fluid flow rate	1 second to 1 minute	Raw material feed rate, product extraction rate
Level meter	Tank liquid level	10 seconds to 1 minute	Reactor level, storage tank
Concentration meter	Component concentration	1 minute to 1 hour	Product purity, reaction progress
pH meter	pH value	1 second to 1 minute	Chemical reaction control

Sensor data is characterized by high-frequency generation (second to minute intervals), accumulation as time-series data, and often contains missing values, outliers, and noise that require preprocessing.

2. Operation Condition Data (Setpoints and Control Data)

Values set by humans or DCS (Distributed Control System) to control the process.

Major operation variables include temperature setpoints (reactor and furnace settings), pressure setpoints (distillation column pressure), flow rate setpoints (raw material feed and cooling water), valve openings (0-100%), and agitation speeds (reactor agitator rotation). These variables are directly linked with sensor data for setpoint-vs-measured comparisons, and finding optimal operating conditions is a core objective of PI.

3. Quality Data (Product Properties and Analytical Data)

Data for evaluating product quality. Often obtained through offline analysis.

Major quality indicators:

Quality Indicator	Measurement Method	Measurement Frequency	Importance
Product purity	Gas chromatography (GC)	1 hour to 1 day	Main item of product specifications
Yield	Material balance calculation	Per batch	Economic indicator
Viscosity	Viscometer	Several hours to 1 day	Product quality
Color	Spectrophotometer	Several hours to 1 day	Product appearance
Impurity content	HPLC, GC-MS	1 day to 1 week	Quality specifications, safety

Quality data has low measurement frequency (hour to day intervals) and requires significant time and cost. An important PI challenge is building "soft sensors" that predict quality in real-time from sensor data, enabling continuous quality monitoring without expensive offline analysis.

4. Event Data (Auxiliary Data)

Event data records various occurrences in the process, including alarms (warnings of abnormal conditions such as temperature or pressure anomalies), operation logs (manual records by operators), equipment maintenance records (maintenance history), and batch records (start/end times and raw material information for each batch).

1.4 Case Studies of Data-Driven Process Improvement

Beyond theory, let's look at specific examples of how PI is being utilized in practice.

Case Study 1: Chemical Plant Yield Improvement (5% improvement)

Background:

At a chemical plant, product C was manufactured from raw materials A and B. The theoretical yield was 95%, but the actual yield remained at an average of 85%. Traditionally, veteran operators adjusted reaction temperature and pressure based on experience, but it was not clear why the yield varied.

PI Approach:

Data collection (1 month)
- Collected 2 years of operational data (temperature, pressure, flow rate, raw material composition)
- Collected product quality data for the same period (yield, purity)
- Number of data points: approximately 1 million
Exploratory Data Analysis (EDA)
- Correlation analysis between variables
- Discovery: Not only reaction temperature, but the ratio of raw material A feed rate to reactor agitation speed strongly affected yield
- This was a previously unknown relationship
Machine learning model construction
- Built a yield prediction model using Random Forest
- Prediction accuracy: R² = 0.82
- Identified important operation variables through feature importance analysis
Optimization and implementation
- Searched for optimal operating conditions based on the model
- Optimal conditions: reaction temperature 175°C, raw material A flow rate 2.5 m³/h, agitation speed 300 rpm
- Conducted trial operation in actual plant

Results:

Yield improvement: 85% → 90% (+5%)
Economic effect: Profit increase of approximately 500 million yen per year (in product value terms)
Secondary effect: Energy consumption also reduced by 3%

Case Study 2: Energy Consumption Reduction (15% reduction)

Background:

The distillation column at a petrochemical plant used large amounts of steam to refine products. Energy costs were over 1 billion yen per year, a major business challenge.

PI Approach:

Energy consumption analysis
- Collected data on reboiler heat duty and product quality in the distillation column
- Discovery: Operation at unnecessarily high reflux ratio for many time periods
- Target product purity was 99.5%, but actual operation was around 99.8% (excessive quality)
Soft sensor construction
- Purpose: Real-time prediction of overhead product purity
- Conventional: Purity could only be measured once a day by GC analysis
- Built PLS model to predict purity from temperature profile
- Prediction accuracy: R² = 0.88
Optimal control
- Dynamically adjusted reflux ratio using soft sensor predictions
- Minimized energy usage while maintaining product purity at 99.5-99.6%

Results:

Energy reduction: 15% reduction
Economic effect: Cost reduction of approximately 150 million yen per year
CO₂ reduction: Approximately 5,000 tons of CO₂ emissions reduced per year
Payback period: Approximately 1 year (including soft sensor development and implementation costs)

ROI (Return on Investment) Analysis

Let's look at the typical ROI of PI projects.

Item	Case Study 1 (Yield improvement)	Case Study 2 (Energy reduction)
Initial investment	Data infrastructure: 20 million yen Model development: 10 million yen	Soft sensor development: 15 million yen Control system modification: 30 million yen
Annual benefit	500 million yen/year (profit increase)	150 million yen/year (cost reduction)
ROI	First year 1,567%	First year 233%
Payback period	Approximately 2 months	Approximately 4 months

Conclusion: When properly implemented, PI projects can achieve extremely high ROI.

1.5 Introduction to Process Data Visualization with Python

Now, let's actually visualize process data using Python. Through the following five code examples, you'll acquire fundamental PI skills.

Environment Setup

First, install the necessary libraries.

# Install required libraries
pip install pandas matplotlib seaborn numpy plotly

Code Example 1: Time-Series Sensor Data Visualization

Plot time-series temperature data from a chemical reactor.

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: Plot time-series temperature data from a chemical reactor.

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data: 1 day of reactor temperature data (1-minute intervals)
np.random.seed(42)
time_points = pd.date_range('2025-01-01 00:00', periods=1440, freq='1min')
temperature = 175 + np.random.normal(0, 2, 1440) + \
              5 * np.sin(np.linspace(0, 4*np.pi, 1440))  # Simulate temperature fluctuation

# Store in DataFrame
df = pd.DataFrame({
    'timestamp': time_points,
    'temperature': temperature
})

# Visualization
plt.figure(figsize=(14, 5))
plt.plot(df['timestamp'], df['temperature'], linewidth=0.8, color='#11998e')
plt.axhline(y=175, color='red', linestyle='--', label='Target: 175°C')
plt.fill_between(df['timestamp'], 173, 177, alpha=0.2, color='green', label='Acceptable range')
plt.xlabel('Time', fontsize=12)
plt.ylabel('Temperature (°C)', fontsize=12)
plt.title('Reactor Temperature - 24 Hour Trend', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate statistics
print(f"Average temperature: {df['temperature'].mean():.2f}°C")
print(f"Standard deviation: {df['temperature'].std():.2f}°C")
print(f"Maximum temperature: {df['temperature'].max():.2f}°C")
print(f"Minimum temperature: {df['temperature'].min():.2f}°C")

Output example:

Average temperature: 174.98°C
Standard deviation: 3.45°C
Maximum temperature: 183.12°C
Minimum temperature: 166.54°C

Explanation: In this example, we plot time-series data of reactor temperature. By displaying the target temperature (175°C) and acceptable range (173-177°C), we can visually confirm the process stability.

Code Example 2: Simultaneous Plotting of Multiple Sensors

Display distillation column temperature, pressure, and flow rate simultaneously.

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0

"""
Example: Display distillation column temperature, pressure, and flow 

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
np.random.seed(42)
time_points = pd.date_range('2025-01-01 00:00', periods=1440, freq='1min')

df = pd.DataFrame({
    'timestamp': time_points,
    'temperature': 85 + np.random.normal(0, 1.5, 1440),
    'pressure': 1.2 + np.random.normal(0, 0.05, 1440),
    'flow_rate': 50 + np.random.normal(0, 3, 1440)
})

# Display in 3 subplots
fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

# Temperature
axes[0].plot(df['timestamp'], df['temperature'], color='#11998e', linewidth=0.8)
axes[0].set_ylabel('Temperature (°C)', fontsize=11)
axes[0].set_title('Distillation Column - Multi-Sensor Data', fontsize=14, fontweight='bold')
axes[0].grid(alpha=0.3)

# Pressure
axes[1].plot(df['timestamp'], df['pressure'], color='#f59e0b', linewidth=0.8)
axes[1].set_ylabel('Pressure (MPa)', fontsize=11)
axes[1].grid(alpha=0.3)

# Flow rate
axes[2].plot(df['timestamp'], df['flow_rate'], color='#7b2cbf', linewidth=0.8)
axes[2].set_ylabel('Flow Rate (m³/h)', fontsize=11)
axes[2].set_xlabel('Time', fontsize=12)
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Correlation between variables
print("Correlation coefficients between variables:")
print(df[['temperature', 'pressure', 'flow_rate']].corr())

Explanation: By displaying multiple sensor data on the same time axis, you can visually grasp relationships between variables and anomaly patterns. Correlation coefficients also confirm quantitative relationships between variables.

Code Example 3: Correlation Matrix Heatmap

Grasp correlation relationships among multiple variables at a glance.

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
# - seaborn>=0.12.0

"""
Example: Grasp correlation relationships among multiple variables at 

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data: 8 variables in distillation column
np.random.seed(42)
n = 500

df = pd.DataFrame({
    'Feed_Temp': np.random.normal(60, 5, n),
    'Reflux_Ratio': np.random.uniform(1.5, 3.5, n),
    'Reboiler_Duty': np.random.normal(1500, 200, n),
    'Top_Temp': np.random.normal(85, 3, n),
    'Bottom_Temp': np.random.normal(155, 5, n),
    'Pressure': np.random.normal(1.2, 0.1, n),
    'Purity': np.random.uniform(95, 99.5, n),
    'Yield': np.random.uniform(85, 95, n)
})

# Adjust data based on correlations (create realistic correlations)
df['Top_Temp'] = df['Top_Temp'] + 0.3 * df['Reflux_Ratio']
df['Purity'] = df['Purity'] + 0.5 * df['Reflux_Ratio'] - 0.2 * df['Top_Temp']
df['Yield'] = df['Yield'] + 0.3 * df['Reboiler_Duty'] / 100

# Calculate correlation matrix
corr = df.corr()

# Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdYlGn', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix - Distillation Column Variables', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Display strongly correlated pairs
print("\nStrongly correlated variable pairs (|r| > 0.5):")
for i in range(len(corr.columns)):
    for j in range(i+1, len(corr.columns)):
        if abs(corr.iloc[i, j]) > 0.5:
            print(f"{corr.columns[i]} vs {corr.columns[j]}: {corr.iloc[i, j]:.3f}")

Explanation: The correlation matrix heatmap allows you to visually grasp which variables are strongly related to each other. This is useful for preprocessing and feature selection in modeling.

Code Example 4: Scatter Plot Matrix (Pair Plot)

Observe relationships between variables in detail with scatter plots.

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
# - seaborn>=0.12.0

"""
Example: Observe relationships between variables in detail with scatt

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
np.random.seed(42)
n = 300

df = pd.DataFrame({
    'Temperature': np.random.normal(175, 5, n),
    'Pressure': np.random.normal(1.5, 0.2, n),
    'Flow_Rate': np.random.normal(50, 5, n),
    'Yield': np.random.uniform(80, 95, n)
})

# Add relationships (realistic correlations)
df['Yield'] = df['Yield'] + 0.5 * (df['Temperature'] - 175) + 2 * (df['Pressure'] - 1.5)

# Create pair plot
sns.pairplot(df, diag_kind='kde', plot_kws={'alpha': 0.6, 'color': '#11998e'},
             diag_kws={'color': '#11998e'})
plt.suptitle('Pairplot - Process Variables vs Yield', y=1.01, fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Explanation: The pair plot allows you to check scatter plots of all variable pairs and the distribution of each variable at once. It's effective for detecting nonlinear relationships and outliers.

Code Example 5: Interactive Visualization (Plotly)

Create interactive graphs with zoom and hover display capabilities in a web browser.

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0
# - pandas>=2.0.0, <2.2.0
# - plotly>=5.14.0

"""
Example: Create interactive graphs with zoom and hover display capabi

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np

# Generate sample data
np.random.seed(42)
time_points = pd.date_range('2025-01-01', periods=1440, freq='1min')

df = pd.DataFrame({
    'timestamp': time_points,
    'temperature': 175 + np.random.normal(0, 2, 1440),
    'pressure': 1.5 + np.random.normal(0, 0.1, 1440),
    'yield': 90 + np.random.normal(0, 3, 1440)
})

# Create subplots
fig = make_subplots(
    rows=3, cols=1,
    subplot_titles=('Reactor Temperature', 'Reactor Pressure', 'Yield'),
    vertical_spacing=0.12
)

# Temperature
fig.add_trace(
    go.Scatter(x=df['timestamp'], y=df['temperature'],
               mode='lines', name='Temperature',
               line=dict(color='#11998e', width=1.5)),
    row=1, col=1
)

# Pressure
fig.add_trace(
    go.Scatter(x=df['timestamp'], y=df['pressure'],
               mode='lines', name='Pressure',
               line=dict(color='#f59e0b', width=1.5)),
    row=2, col=1
)

# Yield
fig.add_trace(
    go.Scatter(x=df['timestamp'], y=df['yield'],
               mode='lines', name='Yield',
               line=dict(color='#7b2cbf', width=1.5)),
    row=3, col=1
)

# Layout settings
fig.update_xaxes(title_text="Time", row=3, col=1)
fig.update_yaxes(title_text="Temp (°C)", row=1, col=1)
fig.update_yaxes(title_text="Pressure (MPa)", row=2, col=1)
fig.update_yaxes(title_text="Yield (%)", row=3, col=1)

fig.update_layout(
    title_text="Interactive Process Data Visualization",
    height=900,
    showlegend=False
)

# Display graph (for Jupyter Notebook)
fig.show()

# Save as HTML file
fig.write_html("process_data_interactive.html")
print("Interactive graph saved to 'process_data_interactive.html'.")

Explanation: Using Plotly, you can create graphs with interactive features like zoom, pan, and hover display. This is very convenient when exploring large amounts of data.

1.6 Chapter Summary

What We Learned

1. Definition of Process Informatics (PI) — PI enables process understanding, optimization, and control through data-driven approaches. While MI focuses on "what to make," PI focuses on "how to make it."

2. Characteristics of process industries — Processes are classified as continuous or batch, with inherent complexity arising from multivariate interactions, nonlinear relationships, and time delays.

3. Three main categories of process data — Sensor data (temperature, pressure, flow rate at high frequency), operation condition data (setpoints and control parameters), and quality data (purity, yield at low frequency via offline analysis).

4. Real examples of data-driven improvement — A 5% yield improvement delivered ¥500M annual profit increase; 15% energy reduction achieved ¥150M annual cost savings. PI projects consistently demonstrate high ROI (hundreds to thousands of percent) with short payback periods (2-4 months).

5. Data visualization with Python — We covered time-series plots, multi-variable displays, correlation matrices, pair plots, and interactive visualization with Plotly.

Important Points

PI is a practical technology with immediate effectiveness, typically achieving ROI within months. Data-driven approaches are particularly effective given the inherent complexity of industrial processes. Data visualization serves as the foundational first step in any PI project, while soft sensors (real-time quality prediction) represent one of PI's core enabling technologies.

To the Next Chapter

In Chapter 2, we will explore process data preprocessing and visualization in detail, covering time-series data handling (resampling and rolling statistics), practical methods for missing value handling and outlier detection, data scaling and normalization techniques, advanced visualization approaches, and strategies for addressing challenges specific to process data.

Chapter 1: Fundamental Concepts of PI and Data Utilization in Process Industries

Chapter 1: Fundamental Concepts of PI and Data Utilization in Process Industries

Learning Objectives

1.1 What is Process Informatics (PI)?

Definition of PI

Difference from Materials Informatics (MI)

1.2 Characteristics of Process Industries

Classification of Processes

Major Fields of Process Industries

Process Complexity

1.3 Types of Process Data

1. Sensor Data (Measurement Data)

2. Operation Condition Data (Setpoints and Control Data)

3. Quality Data (Product Properties and Analytical Data)

4. Event Data (Auxiliary Data)

1.4 Case Studies of Data-Driven Process Improvement

Case Study 1: Chemical Plant Yield Improvement (5% improvement)

Case Study 2: Energy Consumption Reduction (15% reduction)

ROI (Return on Investment) Analysis

1.5 Introduction to Process Data Visualization with Python

Environment Setup

Code Example 1: Time-Series Sensor Data Visualization

Code Example 2: Simultaneous Plotting of Multiple Sensors

Code Example 3: Correlation Matrix Heatmap

Code Example 4: Scatter Plot Matrix (Pair Plot)

Code Example 5: Interactive Visualization (Plotly)

1.6 Chapter Summary

What We Learned

Important Points

To the Next Chapter

References

Disclaimer