需求

  1. 可以识别C语言编写的源程序中的每个单词符号,并以记号的形式输出每个单词符号
  2. 可以识别并跳过源程序中的注释
  3. 可以统计源程序中的语句行数,各类单词的个数,以及字符总数,并输出统计结果
  4. 检查源程序中出现的词法错误,并报告错误所在位置
  5. 对源程序中出现的错误进行适当恢复,使词法分析可以继续进行,对源程序进行一次扫描即可检查并报告源程序中所有词法错误

设计思路

1.可以识别C语言编写的源程序中的每个单词符号,并以记号的形式输出每个单词符号

  • 枚举Token类型,从文件中按字符读取,检查是否与Token中匹配

2.可以识别并跳过源程序中的注释

  • 可以参考之前写过的识别注释的程序

3.可以统计源程序中的语句行数,各类单词的个数,以及字符总数,并输出统计结果

  • 在检查匹配时就记录数据

4.检查源程序中出现的词法错误,并报告错误所在位置

  • 在检查匹配时,出现不匹配的情况就记录错误

5.对源程序中出现的错误进行适当恢复,使词法分析可以继续进行,对源程序进行一次扫描即可检查并报告源程序中所有词法错误

  • 还没有什么思路

实现

遇到的问题

在处理数值时,无法识别科学计数, 可能要特殊处理

已解决

在处理操作符时,无法识别多条件操作符

使用file.peek(),然后组成string类型,与预先设定的匹配

输出结果想要与原文匹配,那么输出必须保证token.line相同的值被输出到同一行

本以为会很麻烦,因为如果tokens里的token.line是乱序的,那么想要一次遍历就能将token.line相同的token输出到一行比较困难,我想不出好办法,但转念一想,储存的时候token.line应该是顺序储存的,因此直接输出就好。

在解决转义字符、字符串读取、括号配对时,三者交合产生的逻辑问题使我困扰

已解决

在前面提到的file.peek(),在部分逻辑中需要file.get()使字符回到文件流中, 带来了一些困扰

已解决

一些感想

因为是用C++写的,写着写着就偏向了实现C++的编译器,一想到很多复杂的用法还没有实现就感到心累,后面突然想到其实是实现C的编译器,顿时轻松不少

对补充要求的说明

文档

  • 运行/开发环境在CMakeLists中可设置,Debug版本相比Release版本只有一个是否按源程序打印Token的输出,版本号1.0.0
  • 设计、实现、测试的不完整说明即为本markdown文档, 输出文件位于~/build/stdout.txt,截图位于下方
  • 错误处理只处理了简单的几个错误,在注释中有说明。错误恢复只是简单地跳过
    源程序
    token
    总结

源代码

~/LAC.cc为源程序, 采用命令行参数的设计,使用时使用-f filename运行
LAC.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <unordered_map>
#include <cctype>
#include <set>
#include <cassert>

enum class kTokenType { // Token类型枚举
KEYWORD,
IDENTIFIER,
NUM_LITERAL,
OPERATOR,
STRING
};

enum class kErrorType {
UNKNOWN,
INVALID_CHARACTER, // 无效的字符
INVALID_NUM, //错误数字
PARENDISMATCH, // 括号不配对
UNCLOSED_COMMENT, // 不完整的多行注释
UNCLOSED_STRING // 不完整的字符串字面量
};

struct kToken { // Token结构体
kTokenType type;
std::string value;
int line;
int column;
};

struct kError {
kErrorType type;
std::string message;
int line;
int column;
};


class CLexicalAnalyzer { // 词法分析器
private:
std::vector<kToken> tokens;
std::vector<kError> errors;
std::unordered_map<std::string, kTokenType> keyword_map;
std::set<char> compute_operator = { '+', '-', '*', '/', '%', '=', '&', '|', '!', '<', '>' };
std::set<char> front_closure_operator = { '(', '{', '[' };
std::set<char> back_closure_operator = { ')', '}', ']' };
std::set<std::string> double_operator = { "==", "!=", ">=", "<=", "&&", "||", "+=", "-=", "*=", "/=", "%=" };
const char transfer_operator = '\\';
const char end_operator = ';';
const char header_operator = '#';
int line_count;
int word_count;
int char_count;
int error_count;

void addToken(kTokenType type, const std::string& value, int line, int column);
void reportError(kErrorType type, std::string message, int line, int column);
public:
CLexicalAnalyzer() : line_count(1), word_count(0), char_count(0) {

const std::string keywords[] = {"auto", "break", "case", "char", "const", "continue", "default", "do", "double", "else", // 初始化关键字映射表
"enum", "extern", "float", "for", "goto", "if", "inline", "int", "long", "register", "restrict", "return", "short", "signed",
"sizeof", "static", "struct", "switch", "typedef", "union", "unsigned", "void", "volatile", "while", "_Alignas", "_Alignof",
"_Atomic", "_Bool", "_Complex", "_Generic", "_Imaginary", "_Noreturn", "_Static_assert", "_Thread_local"
};

for (const auto& kw : keywords) {
keyword_map[kw] = kTokenType::KEYWORD;
}
}
void analyze(const std::string& filename);
void printStatistics() const;
void printTokens() const;
void printErrors() const;

#ifndef NDEBUG
void printContext() const {
int current_line = tokens.front().line;

std::cout << tokens[0].line << ": ";
for (const auto& token : tokens) {
if (token.line != current_line) {
current_line = token.line;
std::cout << "\n" << token.line << ": " << token.value << " ";
continue;
}

std::cout << token.value << " "; // 在同一行输出token
}
std::cout << std::endl;
}
#endif
};

inline void CLexicalAnalyzer::addToken(kTokenType type, const std::string& value, int line, int column) {
tokens.push_back({type, value, line, column});
word_count++;
}

inline void CLexicalAnalyzer::reportError(kErrorType type, std::string message, int line, int column) {
errors.push_back({type, message, line, column});
error_count++;
}

void CLexicalAnalyzer::analyze(const std::string& filename) {
std::ifstream file(filename);

if (!file.is_open()) {
std::cerr << "Can't open file: " << filename << "\n";
exit(-1);
}

std::string current_token;
char c;
int current_line = 1;
int current_column = 0;
bool in_comment = false;
bool in_error = false;
bool string_unclosure = false;
bool single_dismatch = false;
bool double_dismatch = false;
bool reading = false;
int big_paren_dismatch = 0;
int mid_paren_dismatch = 0;
int small_paren_dismatch = 0;
char start_reading;

std::cout << filename << ": \n";
while (file.get(c)) {

char_count++;
current_column++;

if (in_comment) {
continue;
}
if (reading) {
file.unget();

while(file.get(c)) {
if (c == start_reading) {
break;
} else {
current_token += c;

if (c == transfer_operator) {
current_column++;
char_count++;
file.get(c);
current_token += c;
current_column++;
char_count++;
} else {
current_column++;
char_count++;
}

}
}
addToken(kTokenType::STRING, current_token, current_line, current_column - current_token.length() + 1);
current_token.clear();
}
if (static_cast<unsigned char>(c) > 127) { //处理非ASCII码字符
if (!in_error) {
in_error = true;
reportError(kErrorType::INVALID_CHARACTER, "can't recognize the char", current_line, current_column);
continue;
} else {
continue;
}
}

if (std::isspace(c)) { // 跳过空格和换行
if (c == '\n') {
line_count++;
current_line++;
current_column = 0;
}
in_error = false;
continue;
} else if (c == '/' && !in_comment) {
char next_char = file.peek();
if (next_char == '/') { // 单行注释
file.get();
while (file.get(c)) {
if (c == '\n') break;
}
line_count++;
current_line++;
current_column = 0;
continue;
} else if (next_char == '*') { // 多行注释
in_comment = true;
std::cout << in_comment << "\n";
file.get(); // peek并没有将字符从流中移出,因此需要get一次
continue;
}
} else if (in_comment && c == '*' && file.peek() == '/') { // 多行注释的收尾
file.get();
in_comment = false;
continue;
} else if (std::isalpha(c) || c == '_' && !in_error) { // 识别关键字和标识符,注意标识符只能以字母开头
current_token = c;

while (file.get(c) && (std::isalnum(c) || c == '_')) {
current_token += c;
current_column++;
char_count++;
}

file.unget(); // 如果是非字母或数字,应该将这个字符放回流中,留给后续逻辑处理
kTokenType token_type = keyword_map.count(current_token) > 0 ? kTokenType::KEYWORD : kTokenType::IDENTIFIER;
addToken(token_type, current_token, current_line, current_column - current_token.length() + 1);
current_token.clear();
} else if (std::isdigit(c) && !in_error) { // 数值
bool scientific = false;
bool point = false;

current_token = c;

while (file.get(c) && (std::isdigit(c) || c == '.' || c == 'e' || c == 'E')) {
if (c == '.') { //处理小数点
if (!point) {
char next_char = file.peek();

if(std::isdigit(next_char)) {
current_token += c;
char_count++;
current_column++;
point = true;
} else { //防止.后跟e或E
reportError(kErrorType::INVALID_NUM, "please check if you write a correct value.\n", current_line, current_column);
}
} else { //防止.后跟.
reportError(kErrorType::INVALID_NUM, "please check if you write a correct value.\n", current_line, current_column);
}
} else if (c == 'e' || c == 'E') { //处理科学计数法
if (!scientific) {
char next_char = file.peek();

if(std::isdigit(next_char)) {
current_token += c;
char_count++;
current_column++;
scientific = true;
} else { //防止e或E后跟字母或其他字符
reportError(kErrorType::INVALID_NUM, "please check if you write a correct value.\n", current_line, current_column);
}
} else {
reportError(kErrorType::INVALID_NUM, "please check if you write a correct value.\n", current_line, current_column);
}
} else {
current_token += c;
char_count++;
current_column++;
}
}

file.unget();
addToken(kTokenType::NUM_LITERAL, current_token, current_line, current_column - current_token.length() + 1);
current_token.clear();
} else if (c == header_operator || c == end_operator) { // 操作符
addToken(kTokenType::OPERATOR, std::string(1, c), current_line, current_column);
} else if (compute_operator.find(c) != compute_operator.end()) {
char next_char = file.peek();
std::string peek_string = std::string(1, c) + next_char;

if (double_operator.find(peek_string) != double_operator.end()) {
addToken(kTokenType::OPERATOR, peek_string, current_line, current_column);
file.get(); //成功了一定要get,peek不会将字符移出
}
else addToken(kTokenType::OPERATOR, std::string(1, c), current_line, current_column);
} else if (front_closure_operator.find(c) != front_closure_operator.end()) { // 括号匹配
switch(c) {
case '{': big_paren_dismatch++;break;
case '[': mid_paren_dismatch++;break;
case '(': small_paren_dismatch++;break;
}
addToken(kTokenType::OPERATOR, std::string(1, c), current_line, current_column);
} else if (back_closure_operator.find(c) != back_closure_operator.end()) {
switch(c) {
case '}': big_paren_dismatch--;break;
case ']': mid_paren_dismatch--;break;
case ')': small_paren_dismatch--;break;
}
addToken(kTokenType::OPERATOR, std::string(1, c), current_line, current_column);
} else if (c == '\'' || c == '"') {
if(!reading) {
reading = true;
start_reading = c;
} else {
reading = false;
}

if (c == '\'') {
if(!single_dismatch) single_dismatch = true;
else single_dismatch = false;
}
if (c == '\"'){
if(!double_dismatch) double_dismatch = true;
else double_dismatch = false;
}
addToken(kTokenType::OPERATOR, std::string(1, c), current_line, current_column);
}
}
if (in_comment) {
reportError(kErrorType::UNCLOSED_COMMENT, "comment unclosure.\n", current_line, current_column);
}
if (big_paren_dismatch != 0) reportError(kErrorType::PARENDISMATCH, "lack of big paren", -1, -1);
if (mid_paren_dismatch != 0) reportError(kErrorType::PARENDISMATCH, "lack of mid paren", -1, -1);
if (small_paren_dismatch != 0) reportError(kErrorType::PARENDISMATCH, "lack of small paren", -1, -1);
if (single_dismatch != 0) reportError(kErrorType::UNCLOSED_STRING, "lack of \'", -1, -1);
if (double_dismatch != 0) reportError(kErrorType::UNCLOSED_STRING, "lack of \"", -1, -1);

file.close();
}

void CLexicalAnalyzer::printStatistics() const {
std::cout << "Lines: " << line_count << ", Words: " << word_count << ", Characters: " << char_count << "\n";
}

void CLexicalAnalyzer::printTokens() const {
for (const auto& token : tokens) {
std::cout << "Token: " << token.value << ", Type: ";
switch (token.type) {
case kTokenType::KEYWORD: std::cout << "Keyword"; break;
case kTokenType::IDENTIFIER: std::cout << "Identifier"; break;
case kTokenType::NUM_LITERAL: std::cout << "Numeric Literal"; break;
case kTokenType::OPERATOR: std::cout << "Operator"; break;
case kTokenType::STRING: std::cout << "String"; break;
}
std::cout << ", Line: " << token.line << ", Column: " << token.column << "\n";
}
}

void CLexicalAnalyzer::printErrors() const {
for (const auto& error : errors) {
std::cout << "Error on line " << error.line << ", columne " << error.column << ": ";
switch (error.type) {
case kErrorType::INVALID_CHARACTER: std::cout << error.message; break;
case kErrorType::UNCLOSED_COMMENT: std::cout << error.message; break;
case kErrorType::INVALID_NUM: std::cout << error.message; break;
case kErrorType::PARENDISMATCH: std::cout << error.message; break;
case kErrorType::UNCLOSED_STRING: std::cout << error.message; break;
}
std::cout << "\n";
}
}

int main(int argc, char* argv[]) {
std::string value;
std::string arg;
CLexicalAnalyzer analyzer;

if (argc != 3) {
std::cerr << "wrong arg, plz try again\n";
exit(-1);
}
arg = argv[1];
if(arg == "-f") {
value = argv[2];
}

analyzer.analyze(value);

#ifndef NDEBUG
analyzer.printContext();
#endif

analyzer.printTokens();
analyzer.printStatistics();
analyzer.printErrors();
return 0;
}

测试用例集

我将LAC.cc作为正向测试,程序有接近400行,应该覆盖了较广的正向情况。针对错误随便写了两个程序分别为error1.txt和error2.txt,输出在~/build/stderror.txt