本文最后更新于 2024-10-17，本文发布时间距今超过 90 天, 文章内容可能已经过时。最新内容请以官方内容为准

RUST Language Virtual Machine Learning digest

在学习该项目得过程中,我也在不断地将原文英文文章翻译为中文,特此将其分享出来: LanguageVM Docs
希望有缘人得之, 能够帮到有缘人.

同时我学习过程中, 制作出来的项目也放在了 Github 上 ---> LRVM
其中使用 nom 最新版本进行 Language 解析, 解决了原文中老版本 nom 语法不再适用得问题.

00

进度

笔记

操作码表

#[derive(Debug, PartialEq)]
pub enum Opcode {
    LOAD,    // 0
    ADD,     // 1
    SUB,     // 2
    MUL,     // 3
    DIV,     // 4
    HLT,     // 5
    JMP,     // 6
    JMPF,    // 7
    JMPB,    // 8
    EQ,      // 9
    NEQ,     // 10
    GTE,     // 11
    LTE,     // 12
    LT,      // 13
    GT,      // 14
    JMPE,    // 15
    NOP,     // 16
    ALOC,    // 17
    INC,     // 18
    DEC,     // 19
    DJMPE,   // 20
    IGL,     // _
    PRTS,    // 21
    LOADF64, // 22
    ADDF64,  // 23
    SUBF64,  // 24
    MULF64,  // 25
    DIVF64,  // 26
    EQF64,   // 27
    NEQF64,  // 28
    GTF64,   // 29
    GTEF64,  // 30
    LTF64,   // 31
    LTEF64,  // 32
    SHL,     // 33
    SHR,     // 34
    AND,     // 35
    OR,      // 36
    XOR,     // 37
    NOT,     // 38
    LUI,     // 39
    CLOOP,   // 40
    LOOP,    // 41
    LOADM,   // 42
    SETM,    // 43
    PUSH,    // 44
    POP,     // 45
    CALL,    // 46
    RET,     // 47
}

impl From<u8> for Opcode {
    fn from(value: u8) -> Self {
        match value {
            0 => Opcode::LOAD,
            1 => Opcode::ADD,
            2 => Opcode::SUB,
            3 => Opcode::MUL,
            4 => Opcode::DIV,
            5 => Opcode::HLT,
            6 => Opcode::JMP,
            7 => Opcode::JMPF,
            8 => Opcode::JMPB,
            9 => Opcode::EQ,
            10 => Opcode::NEQ,
            11 => Opcode::GTE,
            12 => Opcode::LTE,
            13 => Opcode::LT,
            14 => Opcode::GT,
            15 => Opcode::JMPE,
            16 => Opcode::NOP,
            17 => Opcode::ALOC,
            18 => Opcode::INC,
            19 => Opcode::DEC,
            20 => Opcode::DJMPE,
            21 => Opcode::PRTS,
            22 => Opcode::LOADF64,
            23 => Opcode::ADDF64,
            24 => Opcode::SUBF64,
            25 => Opcode::MULF64,
            26 => Opcode::DIVF64,
            27 => Opcode::EQF64,
            28 => Opcode::NEQF64,
            29 => Opcode::GTF64,
            30 => Opcode::GTEF64,
            31 => Opcode::LTF64,
            32 => Opcode::LTEF64,
            33 => Opcode::SHL,
            34 => Opcode::SHR,
            35 => Opcode::AND,
            36 => Opcode::OR,
            37 => Opcode::XOR,
            38 => Opcode::NOT,
            39 => Opcode::LUI,
            40 => Opcode::CLOOP,
            41 => Opcode::LOOP,
            42 => Opcode::LOADM,
            43 => Opcode::SETM,
            44 => Opcode::PUSH,
            45 => Opcode::POP,
            46 => Opcode::CALL,
            47 => Opcode::RET,
            _ => Opcode::IGL,
        }
    }
}

pub struct VM {
    /// 模拟硬件寄存器的数组
    registers: [i32; 32],
    /// 跟踪正在执行的字节的程序计数器
    pc: usize,
    /// 正在运行的程序的字节码
    program: Vec<u8>,
    /// 包含模除操作的余数
    remainder: usize,
    /// 包含最后一次比较操作的结果
    equal_flag: bool,
}

registers: [i32; 32] 这个东西 (寄存器) 就是拿来存放实际的值的一个数组，32 个 32 位的整数。
- 如果需要取用一个暂存的值，那么需要用索引寄存器的方式来取得。
- 例如：LOAD $1 #10 中 $1 的 1 就是索引， #10 的 10就是值。即 registers[1] = 10
- 再比如：ADD $1 $2 $3 $1 $2 $3 就是索引， ADD 就是操作符， $1 $2 $3 就是操作数。即 registers[3] = registers[1] + registers[2]
- 又比如：JMP $1 $1 就是索引， JMP 就是操作符， $1 就是操作数。即 pc = registers[1]
pc: usize 这个是程序计数器，用来记录当前执行的字节码在 program 数组中的索引。
program: Vec<u8> 这个是程序，用来存储字节码。
- 例如：[0, 1, 1, 244, 1, 2, 0, 3] 在查询了操作码表之后就可以表示为：LOAD #1 calc(1u16 << 8 +244); ADD $2 $0 #3;
- 翻译为代码：registers[1] = ((1u8 as u16) << 8) + 244 = 500; registers[3] = registers[2] + registers[0];
reminder: Vec<u8> 这个用来存储部分算术操作码的余数，
- 例如：DIV $1 $2 $3 即为 registers[3] = registers[1] / registers[2]; VM.remainder = registers[1] % registers[2];。
equal_flag: bool 用来表示两个数是否相等，
- 例如：EQ $1 $2 即为 VM.equal_flag = registers[1] == registers[2]; 但是本设计中为了和 MIPS 的 EQ $1 $2 $3 指令保持一致，本设计中的 EQ 指令的实现为 VM.equal_flag = registers[1] == registers[2]; VM.next_8_bits(); 使用了 next_8_bits() 函数来跳过一个字节。

01

进度

build_vm_part_06.md
build_vm_part_07.md
build_vm_part_08.md
build_vm_part_09.md

笔记

更快的编译链接：LLVM + LLD, 从而提升编译速度，加快编码循环
- 安装方式：On Linux:
  - Ubuntu, sudo apt-get install lld clang
  - Arch, sudo pacman -S lld clang
- 使用方式：在项目目录下，新建一个 .cargo/config.toml 文件，内容如下：[target.x86_64-unknown-linux-gnu] rustflags = ["-C", "linker=clang", "-C", "link-arg=-fuse-ld=ll]
cargo clippy: 静态代码分析，优化代码，以写出更好，更优美的代码
- 安装方式：rustup component add clipp or cargo install clippy
- 使用方式：cargo clippy or cargo clippy -- -D warnings
cargo fmt: 格式化代码，优化代码格式，使得代码更加美观
- 安装方式：rustup component add rustfmt or cargo install fmt
- 使用方式：cargo fmt or cargo fmt -- --check
cargo watch: 自动编译和运行代码，快速迭代和调试。
- 安装方式：cargo install cargo-watch
- 使用方式：cargo watch -x check or cargo watch -x run
  - cargo watch -x check -x test -x bench -x run: 先检测，再测试，再基准测试，再运行。仅当有修改时，并且前面执行成功之后后续指令才执行

项目内容学习

big endian rule/ little: 大端/小端字节序规则，端模式（Endian）的这个词出自 JonathanSwift 书写的《格列佛游记》。这本书根据将鸡蛋敲开的方法不同将所有的人分为两类，从圆头开始将鸡蛋敲开的人被归为 BigEndian，从尖头开始将鸡蛋敲开的人被归为 LittileEndian。小人国的内战就源于吃鸡蛋时是究竟从大头（Big-Endian）敲开还是从小头（Little-Endian）敲开。在计算机业 BigEndian 和 Little Endian 也几乎引起一场战争。在计算机业界，Endian 表示数据在存储器中的存放顺序。采用大端方式进行数据存放符合人类的正常思维，而采用小端方式进行数据存放利于计算机处理。
1. 大端模式（Big_endian）：字数据的高字节存储在低地址中，而字数据的低字节则存放在高地址中。
2. 小端模式（Little_endian）：字数据的高字节存储在高地址中，而字数据的低字节则存放在低地址中。

addr	big-endian	little-endian
0x0000	0x12	0xcd
0x0001	0x34	0xab
0x0002	0xab	0x34
0x0003	0xcd	0x12

Nom 的各个功能作用以及使用：alt, map, map_res,terminated,tuple,context,multispace?,line_ending,alpha?,digit?,eof, preceded,tag,IResult...

Nom book for freshers


                                   ┌─► Ok(
                                   │      what the parser didn't touch,
                                   │      what matched the regex
                                   │   )
             ┌─────────┐           │
 my input───►│my parser├──►either──┤
             └─────────┘           └─► Err(...)

IResult<I, O> type.

The Ok variant has a tuple of (remaining_input: I, output: O);

whereas the Err variant stores an error.

List of parsers and combinators

Note: this list is meant to provide a nicer way to find a nom parser than reading through the documentation on docs.rs. Function combinators are organized in module so they are a bit easier to find.

Links present in this document will nearly always point to complete version of the parser. Most of the parsers also have a streaming version.

Basic elements

Those are used to recognize the lowest level elements of your grammar, like, "here is a dot", or "here is an big endian integer".

combinator	usage	input	output	comment
char	`char('a')`	`"abc"`	`Ok(("bc", 'a'))`	Matches one character (works with non ASCII chars too)
is_a	`is_a("ab")`	`"abbac"`	`Ok(("c", "abba"))`	Matches a sequence of any of the characters passed as arguments
is_not	`is_not("cd")`	`"ababc"`	`Ok(("c", "abab"))`	Matches a sequence of none of the characters passed as arguments
one_of	`one_of("abc")`	`"abc"`	`Ok(("bc", 'a'))`	Matches one of the provided characters (works with non ASCII characters too)
none_of	`none_of("abc")`	`"xyab"`	`Ok(("yab", 'x'))`	Matches anything but the provided characters
tag	`tag("hello")`	`"hello world"`	`Ok((" world", "hello"))`	Recognizes a specific suite of characters or bytes
tag_no_case	`tag_no_case("hello")`	`"HeLLo World"`	`Ok((" World", "HeLLo"))`	Case insensitive comparison. Note that case insensitive comparison is not well defined for unicode, and that you might have bad surprises
take	`take(4)`	`"hello"`	`Ok(("o", "hell"))`	Takes a specific number of bytes or characters
take_while	`take_while(is_alphabetic)`	`"abc123"`	`Ok(("123", "abc"))`	Returns the longest list of bytes for which the provided function returns true. `take_while1` does the same, but must return at least one character, while `take_while_m_n` must return between m and n
take_till	`take_till(is_alphabetic)`	`"123abc"`	`Ok(("abc", "123"))`	Returns the longest list of bytes or characters until the provided function returns true. `take_till1` does the same, but must return at least one character. This is the reverse behaviour from `take_while`: `take_till(f)` is equivalent to `take_while(\|c\| !f(c))`
take_until	`take_until("world")`	`"Hello world"`	`Ok(("world", "Hello "))`	Returns the longest list of bytes or characters until the provided tag is found. `take_until1` does the same, but must return at least one character

Choice combinators

combinator	usage	input	output	comment
alt	`alt((tag("ab"), tag("cd")))`	`"cdef"`	`Ok(("ef", "cd"))`	Try a list of parsers and return the result of the first successful one
permutation	`permutation((tag("ab"), tag("cd"), tag("12")))`	`"cd12abc"`	`Ok(("c", ("ab", "cd", "12"))`	Succeeds when all its child parser have succeeded, whatever the order

Sequence combinators

combinator	usage	input	output	comment
delimited	`delimited(char('('), take(2), char(')'))`	`"(ab)cd"`	`Ok(("cd", "ab"))`	Matches an object from the first parser and discards it, then gets an object from the second parser, and finally matches an object from the third parser and discards it.
preceded	`preceded(tag("ab"), tag("XY"))`	`"abXYZ"`	`Ok(("Z", "XY"))`	Matches an object from the first parser and discards it, then gets an object from the second parser.
terminated	`terminated(tag("ab"), tag("XY"))`	`"abXYZ"`	`Ok(("Z", "ab"))`	Gets an object from the first parser, then matches an object from the second parser and discards it.
pair	`pair(tag("ab"), tag("XY"))`	`"abXYZ"`	`Ok(("Z", ("ab", "XY")))`	Gets an object from the first parser, then gets another object from the second parser.
separated_pair	`separated_pair(tag("hello"), char(','), tag("world"))`	`"hello,world!"`	`Ok(("!", ("hello", "world")))`	Gets an object from the first parser, then matches an object from the sep_parser and discards it, then gets another object from the second parser.
tuple	`tuple((tag("ab"), tag("XY"), take(1)))`	`"abXYZ!"`	`Ok(("!", ("ab", "XY", "Z")))`	Chains parsers and assemble the sub results in a tuple. You can use as many child parsers as you can put elements in a tuple

Applying a parser multiple times

combinator	usage	input	output	comment
count	`count(take(2), 3)`	`"abcdefgh"`	`Ok(("gh", vec!["ab", "cd", "ef"]))`	Applies the child parser a specified number of times
many0	`many0(tag("ab"))`	`"abababc"`	`Ok(("c", vec!["ab", "ab", "ab"]))`	Applies the parser 0 or more times and returns the list of results in a Vec. `many1` does the same operation but must return at least one element
many0_count	`many0_count(tag("ab"))`	`"abababc"`	`Ok(("c", 3))`	Applies the parser 0 or more times and returns how often it was applicable. `many1_count` does the same operation but the parser must apply at least once
many_m_n	`many_m_n(1, 3, tag("ab"))`	`"ababc"`	`Ok(("c", vec!["ab", "ab"]))`	Applies the parser between m and n times (n included) and returns the list of results in a Vec
many_till	`many_till(tag( "ab" ), tag( "ef" ))`	`"ababefg"`	`Ok(("g", (vec!["ab", "ab"], "ef")))`	Applies the first parser until the second applies. Returns a tuple containing the list of results from the first in a Vec and the result of the second
separated_list0	`separated_list0(tag(","), tag("ab"))`	`"ab,ab,ab."`	`Ok((".", vec!["ab", "ab", "ab"]))`	`separated_list1` works like `separated_list0` but must returns at least one element
fold_many0	`fold_many0(be_u8, \|\| 0, \|acc, item\| acc + item)`	`[1, 2, 3]`	`Ok(([], 6))`	Applies the parser 0 or more times and folds the list of return values. The `fold_many1` version must apply the child parser at least one time
fold_many_m_n	`fold_many_m_n(1, 2, be_u8, \|\| 0, \|acc, item\| acc + item)`	`[1, 2, 3]`	`Ok(([3], 3))`	Applies the parser between m and n times (n included) and folds the list of return value
length_count	`length_count(number, tag("ab"))`	`"2ababab"`	`Ok(("ab", vec!["ab", "ab"]))`	Gets a number from the first parser, then applies the second parser that many times

Integers

Parsing integers from binary formats can be done in two ways: With parser functions, or combinators with configurable endianness.

The following parsers could be found on docs.rs number section.

configurable endianness: i16, i32, i64, u16, u32, u64 are combinators that take as argument a nom::number::Endianness, like this: i16(endianness). If the parameter is nom::number::Endianness::Big, parse a big endian i16 integer, otherwise a little endian i16 integer.
fixed endianness: The functions are prefixed by be_ for big endian numbers, and by le_ for little endian numbers, and the suffix is the type they parse to. As an example, be_u32 parses a big endian unsigned integer stored in 32 bits.
- be_f32, be_f64: Big endian floating point numbers
- le_f32, le_f64: Little endian floating point numbers
- be_i8, be_i16, be_i24, be_i32, be_i64, be_i128: Big endian signed integers
- be_u8, be_u16, be_u24, be_u32, be_u64, be_u128: Big endian unsigned integers
- le_i8, le_i16, le_i24, le_i32, le_i64, le_i128: Little endian signed integers
- le_u8, le_u16, le_u24, le_u32, le_u64, le_u128: Little endian unsigned integers

eof: Returns its input if it is at the end of input data
complete: Replaces an Incomplete returned by the child parser with an Error

Modifiers

Parser::and: method to create a parser by applying the supplied parser to the rest of the input after applying self, returning their results as a tuple (like sequence::tuple but only takes one parser)
Parser::and_then: method to create a parser from applying another parser to the output of self
map_parser: function variant of Parser::and_then
Parser::map: method to map a function on the output of self
map: function variant of Parser::map
Parser::flat_map: method to create a parser which will map a parser returning function (such as take or something which returns a parser) on the output of self, then apply that parser over the rest of the input. That is, this method accepts a parser-returning function which consumes the output of self, the resulting parser gets applied to the rest of the input
flat_map: function variant of Parser::flat_map
cond: Conditional combinator. Wraps another parser and calls it if the condition is met
map_opt: Maps a function returning an Option on the output of a parser
map_res: Maps a function returning a Result on the output of a parser
into: Converts the child parser's result to another type
not: Returns a result only if the embedded parser returns Error or Incomplete. Does not consume the input
opt: Make the underlying parser optional
cut: Transform recoverable error into unrecoverable failure (commitment to current branch)
peek: Returns a result without consuming the input
recognize: If the child parser was successful, return the consumed input as the produced value
consumed: If the child parser was successful, return a tuple of the consumed input and the produced output.
verify: Returns the result of the child parser if it satisfies a verification function
value: Returns a provided value if the child parser was successful
all_consuming: Returns the result of the child parser only if it consumed all the input

Error management and debugging

dbg_dmp: Prints a message and the input if the parser fails

Text parsing

escaped: Matches a byte string with escaped characters
escaped_transform: Matches a byte string with escaped characters, and returns a new string with the escaped characters replaced

Binary format parsing

length_data: Gets a number from the first parser, then takes a subslice of the input of that size, and returns that subslice
length_value: Gets a number from the first parser, takes a subslice of the input of that size, then applies the second parser on that subslice. If the second parser returns Incomplete, length_value will return an error

Bit stream parsing

bits: Transforms the current input type (byte slice &[u8]) to a bit stream on which bit specific parsers and more general combinators can be applied
bytes: Transforms its bits stream input back into a byte slice for the underlying parser

Remaining combinators

success: Returns a value without consuming any input, always succeeds
fail: Inversion of success. Always fails.

Character test functions

Use these functions with a combinator like take_while:

is_alphabetic: Tests if byte is ASCII alphabetic: [A-Za-z]
is_alphanumeric: Tests if byte is ASCII alphanumeric: [A-Za-z0-9]
is_digit: Tests if byte is ASCII digit: [0-9]
is_hex_digit: Tests if byte is ASCII hex digit: [0-9A-Fa-f]
is_oct_digit: Tests if byte is ASCII octal digit: [0-7]
is_bin_digit: Tests if byte is ASCII binary digit: [0-1]
is_space: Tests if byte is ASCII space or tab: [ \t]
is_newline: Tests if byte is ASCII newline: [\n]

Alternatively there are ready to use functions:

alpha0: Recognizes zero or more lowercase and uppercase alphabetic characters: [a-zA-Z]. alpha1 does the same but returns at least one character
alphanumeric0: Recognizes zero or more numerical and alphabetic characters: [0-9a-zA-Z]. alphanumeric1 does the same but returns at least one character
anychar: Matches one byte as a character
crlf: Recognizes the string \r\n
digit0: Recognizes zero or more numerical characters: [0-9]. digit1 does the same but returns at least one character
double: Recognizes floating point number in a byte string and returns a f64
float: Recognizes floating point number in a byte string and returns a f32
hex_digit0: Recognizes zero or more hexadecimal numerical characters: [0-9A-Fa-f]. hex_digit1 does the same but returns at least one character
hex_u32: Recognizes a hex-encoded integer
line_ending: Recognizes an end of line (both \n and \r\n)
multispace0: Recognizes zero or more spaces, tabs, carriage returns and line feeds. multispace1 does the same but returns at least one character
newline: Matches a newline character \n
not_line_ending: Recognizes a string of any char except \r or \n
oct_digit0: Recognizes zero or more octal characters: [0-7]. oct_digit1 does the same but returns at least one character
bin_digit0: Recognizes zero or more binary characters: [0-1]. bin_digit1 does the same but returns at least one character
rest: Return the remaining input
rest_len: Return the length of the remaining input
space0: Recognizes zero or more spaces and tabs. space1 does the same but returns at least one character
tab: Matches a tab character \t

02

进度

笔记

label 跳转的原理：两遍遍历器实现跳转
因为存在俩个阶段，所以需要两遍遍历器实现跳转。那么也就是需要一个这样的数据结构 (Assembler)：

// 枚举：第一阶段/第二阶段
#[derive(Debug, PartialEq, Clone)]
pub enum AssemblerPhase { 
    First,
    Second,
}


#[derive(Debug)]
pub struct Assembler {
    pub phase: AssemblerPhase, // 当前阶段：第一阶段/第二阶段
    pub symbols: SymbolTable   // 标签表：存储所有的 label: 标签和行号
}

impl Assembler {
    pub fn new() -> Assembler {
        Assembler {
            phase: AssemblerPhase::First,
            symbols: SymbolTable::new()
        }
    }
}

第一阶段：解析代码
将一串代码输入之后，尝试对代码进行解析，解析出所有的 label: 标签，并记录下来，同时记录对应后续调用 @label 标签对应的行号。
1. 因为需要存储所有的 label: 标签，所以需要一个哈希表来存储，同时需要一个数组来存储行号。
2. 继而就可以构建一个 SymbolTable，用来存储所有的 label: 标签和行号。

#[derive(Debug, PartialEq, Clone)]
pub struct SymbolTable {
    // 存储所有的 Symbols (Labels)
    pub symbols: Vec<Symbol>,
}

3. 有了 SymbolTable , 就需要考虑一下 Symbol 的数据结构了，要有 name 吧，要有 line 行号吧，额外还要有个 label 的类型，以便于后续拓展。

#[derive(Debug, PartialEq, Clone)]
pub enum SymbolType {
    Label,
}

#[derive(Debug, PartialEq, Clone)]
pub struct Symbol {
    name: String,
    symbol_type: SymbolType,
    offset: Option<u32>,
}

4. 名字有了，类型也有了，为什么要用 offset 来代替行号 line 呢？因为在第二遍的时候，要跳转，只用偏移量就可以按照需要跳转的位置 + 偏移量的计算方式就可以跳转了。
5. 如此这般，便在第一次遍历的时候，构建好了 SymbolTable，那么在第二次遍历的时候，就可以根据 SymbolTable 来进行跳转了。

2. 第二阶段：构建虚拟机代码
1. 它只是在每个 AssemblerInstruction 上调用 to_bytes 方法
2. 所有字节都被添加到 Vec<u8> 中，包含完全汇编的字节码
.

03

进度

build_vm_part_16.md

笔记

Clap 是一个用于命令行参数解析的 Rust 库，它提供了一种简单的方式来解析命令行参数，并生成帮助信息。

derive 模式：允许使用 derive 模式来定义命令行参数，同时使用 builder 模式来定义更复杂的参数组合。
builder 模式：允许使用 builder 模式来定义命令行参数，并使用 derive 模式来定义更复杂的参数组合。

clap 的 FAQ

clap 优秀教程：

Derive/Builder: 十分优秀的教程：深入探索 Rust 的 clap 库：命令行解析的艺术
Derive/Builder: Rust 命令行库 Clap 快速入门教程
Derive/Builder: Writing a CLI Tool in Rust with Clap
Builder: Rust: Take your CLI to the Next Level with Clap

今日有些疲惫，因而代码并没有写什么，更多的是看资料和了解要编写的 Clap 内容。后续将 clap 部分加入到 lrvm 中。

04

进度

build_vm_part_17.md
build_vm_part_18.md

笔记

关于输入输出流在单元测试中的问题

今天在学习过程中，的确看到了一个问题：如何在测试中确保输出的内容就是想要的字符/字符串呢？
我记得在最开始学习 rust 的时候，在 rust bible, rust 圣经里有说; 使用 write trait 来实现输出功能，就能够更方便的测试。

但是，在实际项目中，这种情况，比如 println!, print! 输出文字时是在整个函数的中间部分，而不能作为整个函数调用的返回部分..
例如：

fn main() {
    // <snip> 
    if let Some(name) = abc()
    {
        println!("{}",name);
    }
    // <snip>
}

那么则没有办法直接将 name 这个字符串保留下来，进而直接判断。
该如何截取 stdin, stdout 这部分输入输出流才是正确的办法吧。但是我还不知道该如何处理，以后找一找解决办法，加一个中间层。

-----------------------------         ---------------------------------------           -----------------------------
|                           |         |          |              |          |           |                           |
|      stdin 用户输入       |  ---->   |  输入流  |   程序处理    |   输出流  |  ---->    |        stdout 输出        | 
|                           |         |   拦截   |              |    拦截   |           |                           |
----------------------------          ---------------------------------------           ----------------------------

关于 VM 架构和思考

Opcode 和 Directive 的区别
1. Opcode 是指在虚拟机中，对内存的操作，例如：MOV, ADD, SUB, MUL, DIV 等等。表示具体机器指令的操作码，可以直接被执行。
2. Directive 是指在虚拟机中，对虚拟机自身进行操作，例如：LOAD, SAVE, CALL, RETURN 等等。用来指示汇编器如何处理源代码的伪指令，不会被执行，而是在汇编阶段被处理。
  好的，让我们详细探讨一下 Opcode（操作码）和 Directive（伪指令/指令）之间的区别。这两个术语虽然在某些上下文中可能会有重叠，但在大多数情况下它们有着明确的不同意义。

Opcode（操作码）

操作码（Opcode）是指在机器语言中用来标识特定操作的代码。它是 CPU 指令集中的一部分，表示了一条具体的机器指令。在汇编语言中，操作码通常由助记符表示，例如 MOV、ADD、JMP 等。

特点

直接映射到硬件操作：操作码直接对应处理器上的物理操作，如移动数据、执行算术运算、跳转等。
可执行性：操作码是一条可以直接被执行的指令。
助记符表示：在汇编语言中，操作码通常用助记符表示，便于程序员理解和编写。
操作数：操作码通常带有一个或多个操作数，这些操作数指定了操作的对象或地址。

示例

假设一个简单的虚拟机，其中包含以下操作码：

LOAD: 从内存加载一个值到寄存器。
STORE: 将寄存器中的值存储到内存。
ADD: 将两个寄存器的值相加并存入另一个寄存器。

在汇编语言中，这些操作码可能表示为：

LOAD $1 #0
ADD $1 $2 $3
SUB $2 $3 $1

Directive（伪指令/指令）

伪指令（Directive）并不是真正的机器指令，而是在汇编过程中用来控制汇编器行为的命令。伪指令主要用于提供元信息，如定义变量、分配内存、引入外部文件等。

特点

与硬件无关：伪指令在汇编阶段处理，不直接映射到任何特定的硬件操作。
非可执行性：伪指令本身不会被执行，而是用来改变汇编器的行为。
助记符表示：伪指令通常也有助记符表示，但这些助记符不是用来直接执行的。
控制汇编过程：伪指令用来控制汇编过程，如文件包含、宏定义等。

示例

假设一个简单的虚拟机，其中包含以下伪指令：

EQU: 定义一个常量。
ORG: 设置汇编起始地址。
DB: 分配内存并初始化字节。
DW: 分配内存并初始化字。

在汇编语言中，这些伪指令可能表示为：

EQU START_ADDR 0x1000  ; 定义常量 START_ADDR
ORG $START_ADDR        ; 设置汇编起始地址为 START_ADDR
DB 'Hello, World!'     ; 分配内存并初始化字符串 "Hello, World!"
DW 0x1234              ; 分配内存并初始化一个16位数 0x1234

可以说伪指令提供了高层次的控制和描述，而操作码实现了这些描述的具体功能。

伪指令在汇编阶段被处理成相应的数据和符号定义，而操作码则被翻译成最终执行的机器码。