数据库学习

Week1

four main types of actions involve databases

Defining a database: It includes the data types, structures, and constraints of the data have to store in the database. The database descriptive information is also stored by the DBMS in the form of a database catalog or dictionary; it is called meta-data.
Constructing the database: It is the process of data storing on some storage medium that is maintained by the DBMS.
Manipulating a database: It includes functions such as retrieve the database by using query, updating the database to reflect changes in the system, and generate reports from the data.
Sharing a database: It allows multiple users and programs to access the database simultaneously.

four main characteristics of the database approach. （四个数据库方法的主要特征）

self-describing nature of a database system（自描述性质）
insulation between programs and data（程序和数据间的隔离）
data abstraction（数据抽象）
support of multiple views of the data. （支持数据的多个视图）

The differences between controlled and uncontrolled redundancy（受控冗余和不受控冗余）

Redundancy is the state of being not or no longer needed or useful. Uncontrolled redundancy in storing the same data or information many times in the database leads to several problems including duplication of effort, wastage of storage space and inconsistent data. Controlled redundancy is a technique to use redundant fields in a database. This speed up the database access and also improves the performance of queries. Usually, the DBMS ensures the allocation of the data in the records. It should have the capability to control this redundancy in order to prohibit inconsistencies among the files.

Week2

differences between a database schema and a database state. （数据库模式和数据库状态的区别）

A database schema represents the overall design of the database. In contrast, the database state represents the current state of data in the database. The database schema is changed occasionally. The database state is changed frequently. Initially when defining a database, only thedatabase schema is specified. The database state is the empty state when the database is defined.

answer:

A database schema represents the overall design of the database. In contrast, the database state represents the current state of data in the database.
The database schema is changed occasionally. In contrast, the database state is changed frequently.
Initially when defining a database, only the database schema is specified. The database state is the empty state when the database is defined.

differences between logical data independence and physical data independence. （逻辑数据独立性和物理数据独立性区别）

Logical data independence is mainly concerned about the structure or the changing data definition. In contrast, physical data independence is mainly concerned about how the data is stored on the system. Any changes made at the logical level require a change at the application level. In contrast, any changes made at the physical level need not be made at the application level. The conceptual schema is the primary concern in achieving logical data independence. In contrast, the internal schema is the primary concern in achieving physical data independence. Achieving physical data independence is much easier as compared to logical data independence. In contrast, achieving logical data independence is more difficult as compared to physical data independence.

Why logical data independence is harder to achieve compared to physical data independence?

Logical data independence is harder to achieve as the application programs are heavily dependent on the logical format of the data they access. Hence, a change at the conceptual level might require changing the entire program application. When it comes to physical data independence, a change in the location of the database or modifying the file organisation or use of new storage devices will not require change at the higher logical levels.

Answer:

Logical data independence is harder to achieve as the application programs are heavily dependent on the logical format of the data they access
Hence, a change at the conceptual level might require changing the entire program application.
When it comes to physical data independence, a change in the location of the database or modifying the file organisation or use of new storage devices will not require change at the higher logical levels.

procedural and non-procedural data manipulation languages.（程序性数据操作语言和非程序性操作语言）

In terms of the mode of operation of a procedural DML, a programmer needs to provide a sequential set of instructions regarding what the system needs to perform and how it needs to perform them In contrast, a programmer only needs to provide the information regarding what the system need to d and not how to do it when using a non-procedural DML.
In terms of language driving, a procedural DML is a command-driven kind of programming language. In contrast, a non-procedural DML is a function-kind of programming language.
In terms of the approach, a procedural DML is mainly aimed at solving a particular set of available problems. In contrast, a non-procedural DML is a domain-specific type of language.
In terms of dependency, a procedural DML mainly depends on the execution order. In contrast, a non procedural DML mainly depends on the parameter’s values from the given function.
In terms of writing of language, a procedural DML is written in the form of a set of the required instructions. In contrast, a non-procedural DML is written in the form of natural instructions of a language.

Week3

two cases where use of a NULL value would be appropriate.

Setting a NULL value is appropriate when the actual value is unknown, or when a value would not be meaningful.

the differences among an entity, an entity type, and an entity set. （实体，实体类型，实体集合）

An entity is a thing in the real world with independent existence. An entity type is a category of a particular entity. An entity set is the set of all entities of a particular entity type.

differences between an attribute and a value set (属性和值集的区别)

Attributes describe the instances in the row of a database. For example, attributes in an invoice can be price, invoice number and date. A value set specifies the set of values that may be assigned to that attribute for each individual entity.

two conditions where an attribute of a binary relationship type can be migrated to become an attribute of one of the participating entity types.

The attributes of 1:1 or 1:N relationship types can be migrated to one of the participating entity types. In case of 1:1 cardinality, attributes can be moved to either of entity types in binary relationship. In case of 1:N cardinality, attributes can be migrated only to N side of relationship. In both 1:1 and I:N relationship types, the decision as to where a relationship attribute should be placed – as a relationship type attribute or as an attribute of a participating entity type – is determined subjectively by the schema designer. For relationship types with cardinality M:N, attributes cannot be migrated to become attributes of one of participating entity types.

two examples of recursive relationship types.（递归关系类型）

The first example of a recursive relationship type is the supervisor employee relationship. An employee can supervise multiple employees. Hence, it is a recursive relationship of entity EMPLOYEE with itself. This is a one-to-many recursive relationship as one employee supervises many employees. The second example of a recursive relationship type is the class leader and students. A student can be a class leader and handle other students but a person who is working as a class leader is itself a student of the class. Hence, a class leader has a recursive relationship of entity STUDENT with itself.

Week16

SQL

SQL is a very large and powerful language, but every type of SQL statement falls withinone of three main categories (or sub-languages):

• Data Definition Language (DDL) for creating a DB
CREATE , DROP , ALTER

• Data Control Language (DCL) for administering a DB
GRANT , DENY , USE

• Data Manipulation Language (DML) to access a DB
SELECT , INSERT , UPDATE , DELETE

Brieflfly explain what an ontology and a database schema are and discuss one similarity between the two.（本体论和数据库模式的区别）

An ontology is a model that clarifies and specifies a set of meanings in a formal language. Those meanings reflect the ontologist’s understanding of the target subject matter, regarding the kinds of things there are and how those things are related to each other.
A database schema defines the structure of a database in a formal language. There are three kinds of database schemas: conceptual, logical and physical.
The fundamental similarity between an ontology and a database schema is that, at the conceptual level, both consist of set of type definitions expressed in a formal notation.

Discuss insertion, deletion, and modification anomalies. Why are they considered bad?（插入，删除，修改异常为什么认为是不好的）

Anomalies are considered to be bad in the sense that they create dirty data that would be incomplete and inconsistent.
Moreover, improper insertion, deletion, or update operations will violate the integrity properties.
Thus, the entire database would be inconsistent.

Discuss the problem of spurious tuples and how we may prevent it.（伪元组）

A spurious tuple is mainly a record in a database that gets created while two tables are joined badly.
In a database, spurious tuples are formed while two tables are joined on attributes which are neither primary nor foreign keys. To prevent spurious tuples, avoid joining relations that consist of matching attributes that are not primary or foreign key combinations as joining on such attributes may generate spurious tuples.

Why should NULLs in a relation be avoided as much as possible?

NULLs should be avoided to avoid complexity in SELECT and UPDATE queries.
Furthermore, columns which have constraints like primary or foreign key constraints cannot contain a NULL value.

What is meant by granting a privilege? What is meant by revoking a privilege?（给予和撤销权限）

Granting and revoking privileges is a task that you would perform when you want to allow or disallow users of the database to be able to reference data within the database as part of a security practice.
The GRANT privilege statement grants privileges on the database as a whole or on individual tables, views, sequences or procedures.
It controls access to database objects, roles and DBMS resources. The REVOKE statement revokes privileges. It removes database privileges or role access granted to the specified users, groups, role or PUBLIC.
You cannot revoke privileges granted by other users.

What is the difference between discretionary and mandatory access control?（自由权限和强制权限）

In mandatory access control, the system and not the users, determines which subjects can access specific data objects.
Discretionary access control is different from mandatory access control because the owner of the data object specifies which subjects can access the data object.

What is meant by row-level access control?（行级权限控制）

Row level access control refers to the practice of controlling access to data in a database by row, so that users are only able to access the data they are authorized for.
This contrasts with database-level or table-level access control, which controls access to entire databases or tables, respectively.

Discuss the atomicity, durability, isolation, and consistency preservation properties of a database transaction.（原子性，持久性，隔离性，一致性）

Atomicity is a property that ensures that a database follows the all or nothing rule. In other words, the database considers all transaction operations as one whole unit or atom.
Consistency is a property ensuring that only valid data following all rules and constraints is written in the database. When a transaction results in invalid data, the database reverts to its previous state, which abides by all customary rules and constraints.
Isolation is a property that guarantees the individuality of each transaction, and prevents them from being affected from other transactions. It ensures that transactions are securely and independently processed at the same time without interference, but it does not ensure the order of transactions.
Durability is a property that enforces completed transactions, guaranteeing that once each one of them has been committed, it will remain in the system even in case of subsequent failures. These failures include transaction failures and catastrophic failures.

What is a serial schedule? What is a serialisable schedule? Why is a serial schedule considered correct? Why is a serialisable schedule considered correct?（串行调度和可串行调度）

A serial schedule is a type of schedule where one transaction is executed completely before starting another transaction. A serial schedule always gives the correct result. Consider two transactions T1 and T2 which perform some operations. If it has no interleaving of operations, then there are two possible outcomes: either execute all of T1 operations, which were followed by T2 operations.
A serialisable schedule is a non-schedule serial that can be converted to its equivalent serial schedule. In other words, if a non-serial schedule and a serial schedule result in the same, then the non-serial schedule is called a serialisable schedule. A serialisable schedule is accepted as correct because the database is not influenced by the concurrent execution of the transactions.（数据库不受并发事务的影响）

Discuss the actions taken by the read_item, modify_item and write_item operations on a database.

The read item operation reads a data item from storage to main memory. The modify item operation changes the value of item in the main memory. The write item operation writes the modified value from main memory to storage.

Define the violations caused by each of the following: (a) dirty read, and (b) phantoms.（脏读取和幻象）

A dirty read occurs if one transaction reads data that has been modified by another transaction. [1 mark] This results in a violation of transaction isolation, if the transaction that modified the data is rolled back. [1 mark] A phantom read occurs when a transaction retrieves a set of rows twice [1 mark] and new rows are inserted or removed from that set by another transaction that is committed in between. [1 mark]

Describe two differences between primary and secondary storage. （一次存储和二次存储）

Primary storage is the computer’s main memory and stores data temporarily. Secondary storage is external memory and saves data permanently. Data stored in primary storage can be directly accessed by the CPU, which cannot be accessed in secondary storage.

Explain two techniques for allocating file blocks on a disk in detail.（磁盘分配文件块）

contiguous file allocation and linked file allocation. （连续文件分配和链接文件分配）
In contiguous file allocation, the block is allocated in such a manner that all the allocated blocks in the hard disk are adjacent. [1 mark] Assuming a file needs n number of blocks in the disk and the file begins with a block at position x, the next blocks to be assigned to it will be x + 1, x + 2, x + 3, …, x + n–1 so that they are in a contiguous manner.
In linked file allocation, the file which we store on the hard disk is stored in a scattered manner according to the space available on the hard disk. [1 mark] To remember the blocks that belong to the same file, the linked file allocation technique uses pointers to point to the next block of the same file. [1 mark] Therefore, along with the entry of each file each block also stores the pointer to the next block.

Explain how double buffering improves block access time. （双缓冲提高访问时间）

With double buffers, when the CPU is processing the current block of data in buffer 1 [1 mark], it can also retrieve the next block into buffer 2 at the same time. When processing of buffer 1 is done, the CPU can then move on to the next block in buffer 2 immediately without waiting.

What is label security? How does an administrator enforce it?（标签安全，管理员如何管理）

Rows of data are labelled to indicate the level and nature of their sensitivity. A label on a row of data specifies the sensitivity of the information in the row and explicitly defines the criteria that must be met for a user to access that row.
Create the label security policy container.（标签安全策略容器）
Create data labels for the label security policy. （创建数据标签）
Authorise users for the label security policy.（授权用户）
Grant privileges to users and trusted stored program units.（给予权限）

What is flow control as a security measure? What types of flow control exists?（流量控制类型，作用）

Flow controls, most commonly, use a concept of security classes, where transmission of data is blocked if the receiver has a security level lower than the sender. There are two types of flow control:
Explicit flows, consequences of assignments.（显示流量控制）
Implicit flows, generated by conditions.（隐式流量控制）

Discuss three types of arrays in PHP.（php中的数组）

An array in PHP is actually an ordered map. A map is a type that associates values to keys. This type is optimised for several difference users; it can be treated as an array, a list, a hash table, a dictionary, etc. In PHP, the three types of arrays are: (a) indexed arrays, (b) associative arrays, and (c) multidimensional arrays. （索引数组，关联数组，多维数组）
An indexed array is an array with a numeric key. It stores each array element with a numeric index.
An associative array is an array where each key has its own specific value. The keys assigned to values can be arbitrary and user defined strings.
A multidimensional array is an array containing one or more arrays within itself. Each element in the array can also be an array and each element in the sub-array can be an array or further contain array within itself.

What are PHP auto-global-variables?（自动全局变量）

PHP auto-global-variables are built-in variables that are always available in all scopes. Some predefined variables in PHP are super globals, which means that they are always accessible, regardless of scope, and you can access them from any function, class or file without having to do anything special.

What is the CAP theorem? Which of the three properties (consistency, availability, partition tolerance) are most important in NoSQL systems?（一致性，可用性和分区容错性）

In theoretical computer science, the CAP theorem states that any distributed data store can provide only two of the following three guarantees: consistency, availability, and partition tolerance. Consistency guarantees that every read receives the most recent write or an error. Availability guarantees that every request receives a non-error response, without the guarantee that it contains the most recent write. Partition tolerance guarantees that the system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes.
In a networked NoSQL system, partition tolerance is a must. Network partitions and dropped or delayed messages are a fact in these networked shared-data systems and it must be handled appropriately. [1 mark] Consequently, system designers must choose between consistency and availability. A NoSQL system that prioritises availability over consistency can respond with possibly stale data. [1 mark] In contrast, a NoSQL system that prioritises consistency over availability can respond with the latest updated data. The system can be distributed across multiple servers and is designed to operate reliably even in the presence of network partitions. [1 mark] So, the partition tolerance and consistency properties are most important in NoSQL systems that require immediate consistency.

What are the similarities and differences between consistency in CAP versus consistency in ACID

What are the differences between structured, semi-structured and unstructured data?（结构化数据和非结果化数据）

Structured data can be displayed in rows, columns and relational databases. Unstructured data cannot be displayed in rows, columns and relational databases.
Structured data is comprised of numbers, dates and strings. Unstructured data is comprised of images, audio, video, word processing files, emails, spreadsheets, etc.
In contrast to structured data, which requires less storage, unstructured data requires more storage. Also, it is more difficult to manage and protect unstructured data with legacy solutions.
Semi-structured data is information that doesn’t consist of structured data but still has some structure to it. It includes key-value stores and graph databases.

What are the differences between the use of tags in XML versus HTML?（html和xml的标签区别）

XML provides namespaces support while HTML does not provide namespaces support. XML namespaces provide a method to avoid element name conflicts.
In contrast to HTML tags, which are used for displaying data, XML tags are used for describing data not displaying it.

What is the difference between XML schema and XML DTD? （xml模式和文档类型描述符）

DTD stands for Document Type Definition. [1 mark] A DTD defines the structure and the legal elements and attributes of an XML document. [1 mark] DTD is the predecessor of XML schema. [1 mark] While DTD provides the basic structure and grammar for defining a XML document, in addition to that XML schema provides methods to define constraints on the data contained in the document. [2 marks] Therefore, XML schema is considered to be richer and more powerful than DTD.

What are the differences between row-level and statement-level active trigger?（行级触发器和语句触发器）

Row level triggers execute once for each and every row in the transaction. In contrast, statement level triggers execute only once for each single transaction.
Row level triggers are specifically used for data handling purposes. In contrast, statement level trigger are used for enforcing all additional security on the transactions performed on the table.
For example, if 200 rows are to be inserted into a table, the row level trigger would execute 200 times. In contrast, if 200 rows are to be inserted into a table, a statement level trigger would only execute once.

How does clustering differ from classification?（聚类和分类）

Classification is a supervised machine learning approach where a specific label is provided to the machine learning model to classify new observations. In contrast to classification, clustering is an unsupervised machine learning approach where no specific labels are available in the dataset and grouping is performed on similarities basis.

What are the difficulties of mining association rules from large databases?（从大型数据库挖掘关联规则）

Mining of association rules in large databases is a challenging task. An Apriori algorithm is widely used to find out the frequent itemsets from the database. [1 mark] However, it will be inefficient on large databases because it will require more I/O load. Another disadvantage of association rule mining is that they can produce boring rules [1 mark], have many discovered rules [1 mark], and perform poorly. [1 mark]