Code
library(tidyverse)
# result <- read.csv(file = 'total_elements_mindat.csv')
<- read.csv(file = 'mineral.csv') result
Describing the dataset
Jiyin Zhang
February 7, 2023
This dataset is generated based on the elements coexistence counts from Mindat.org database. The original data source is retrieved via Mindat API and stored in JSON format. Then after data pre-process and data cleaning steps, the retrieved data is cured and stored in CSV format. The dataset can be accessed at the Github repository, in the name of mineral.csv.
The dataset was retrieved via Mindat API as a JSON file. In the data preprocessing step, the elements information are extracted in a new JSON file, in which some of the hierarchical structures have been removed in convenience of python’s to_csv
function. Then the exported csv file can be read directly with R’s read.csv
function.
I’m going to use the built-in read.csv
package to import CSV file.
The glimpse
command in the Tidyverse
package is a nice way to summarize the data frame:
Rows: 5,883
Columns: 17
$ id <int> 1, 2, 3, 4, 9, 10, 13, 14, 18, 19, 21, 23, 27, 31, 32,…
$ name <chr> "Abelsonite", "Abenakiite-(Ce)", "Abernathyite", "Abhu…
$ elements <chr> "-Ni-N-C-H-", "-Ce-Na-Si-O-P-C-S-", "-As-O-K-H-U-", "-…
$ sigelements <chr> "-Ni-N-C-H-", "-Ce-Na-Si-O-P-C-S-", "-As-O-K-H-U-", "-…
$ yeardiscovery <chr> "1975", "", "1956", "1983", "1990", "1855", "1974", "1…
$ hmin <dbl> 2.0, 4.0, 2.5, 2.0, 6.5, 2.0, 1.0, 2.5, 5.0, 3.5, 3.5,…
$ hmax <dbl> 3.0, 5.0, 3.0, 2.0, 6.5, 2.5, 1.5, 2.5, 6.0, 3.5, 3.5,…
$ hardtype <int> 0, 0, 0, 3, 0, 3, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0, …
$ specificgravity <chr> "1.4", "3.21", "", "4.29", "", "7.24 (calc) 7.2-7.4(m…
$ strunz10ed1 <int> 10, 9, 8, 3, 9, 2, 10, 2, 9, 3, 8, 8, 6, 9, 9, 9, 9, 8…
$ strunz10ed2 <chr> "C", "C", "E", "D", "A", "B", "A", "C", "D", "C", "B",…
$ strunz10ed3 <chr> "A", "K", "B", "A", "G", "A", "A", "C", "E", "C", "B",…
$ strunz10ed4 <chr> "20", "10", "15", "30", "05", "35", "20", "05", "10", …
$ dana8ed1 <chr> "50", "61", "40", "10", "7", "2", "50", "0", "0", "11"…
$ dana8ed2 <chr> "4", "4", "2a", "5", "5", "4", "4", "0", "0", "6", "6"…
$ dana8ed3 <chr> "9", "1", "9", "9", "1", "1", "7", "0", "0", "17", "6"…
$ dana8ed4 <chr> "1", "1", "1", "1", "4", "1", "1", "0", "0", "1", "3",…
The dataset is stored as a great Flat Table, the items are 5883 mineral species from OpenMindat data server, and the columns representing the corresponding attributes.
[1] "id" "name" "elements" "sigelements"
[5] "yeardiscovery" "hmin" "hmax" "hardtype"
[9] "specificgravity" "strunz10ed1" "strunz10ed2" "strunz10ed3"
[13] "strunz10ed4" "dana8ed1" "dana8ed2" "dana8ed3"
[17] "dana8ed4"
The attributes of the data are recorded in a 2-dimensional format, therefore the data frame rows will looks similar to the result of glimpse
function. The ‘id’ field is in a strict ascending order, while not continuous. The ‘id’ field of each row is determined by the website managers or data providers, therefore it has nothing to do with some standard identifications. The ‘names’ field indicates the IMA approved mineral species names. The ‘elements’ and ‘sigelements’ fields indicating the elements of the mineral chemical formual, while the ‘sigelement’ is determined by some significant elements as a subset of ‘elements’. In compatable with csv format, the elements in this field are separated by hyphens \(-\).
id name elements sigelements yeardiscovery hmin
1 1 Abelsonite -Ni-N-C-H- -Ni-N-C-H- 1975 2.0
2 2 Abenakiite-(Ce) -Ce-Na-Si-O-P-C-S- -Ce-Na-Si-O-P-C-S- 4.0
3 3 Abernathyite -As-O-K-H-U- -As-O-K-H-U- 1956 2.5
4 4 Abhurite -Cl-Sn-O-H- -Cl-Sn-O-H- 1983 2.0
5 9 Abswurmbachite -Cu-Mn-Si-O- -Cu-Mn-Si-O- 1990 6.5
6 10 Acanthite -Ag-S- -Ag-S- 1855 2.0
hmax hardtype specificgravity strunz10ed1 strunz10ed2 strunz10ed3
1 3.0 0 1.4 10 C A
2 5.0 0 3.21 9 C K
3 3.0 0 8 E B
4 2.0 3 4.29 3 D A
5 6.5 0 9 A G
6 2.5 3 7.24 (calc) 7.2-7.4(meas) 2 B A
strunz10ed4 dana8ed1 dana8ed2 dana8ed3 dana8ed4
1 20 50 4 9 1
2 10 61 4 1 1
3 15 40 2a 9 1
4 30 10 5 9 1
5 05 7 5 1 4
6 35 2 4 1 1
Elements <- c('H', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Cs', 'Ba', 'La', 'Ce', 'Nd', 'Sm', 'Gd', 'Dy', 'Er', 'Yb', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Th', 'U')
library("tidyverse")
df <- data.frame(Elements)
df1 <- df %>%
add_column(hmin = NA, hmin_mineral = NA, hmax = NA, hmax_mineral = NA, hmean = NA)
# for (row in 1:nrow(result)) {
# mineral_name <- result[row, "name"]
# elements <- str_extract_all(
# result[row, "elements"], regex("(?<=-)[A-Z]+[a-z]*(?=-)")
# )
#
# hmin <- result[row, "hmin"]
# for (element in elements){
# comparing_hmin <- df1[df1$Elements == element, 'hmin']
# if (is.na(comparing_hmin)){
# df1[df1$Elements == element, 'hmin'] <- hmin
# }
# }
# # hmax <- result[row, "hmax"]
# }
print(df1)
Elements hmin hmin_mineral hmax hmax_mineral hmean
1 H NA NA NA NA NA
2 Li NA NA NA NA NA
3 Be NA NA NA NA NA
4 B NA NA NA NA NA
5 C NA NA NA NA NA
6 N NA NA NA NA NA
7 O NA NA NA NA NA
8 F NA NA NA NA NA
9 Na NA NA NA NA NA
10 Mg NA NA NA NA NA
11 Al NA NA NA NA NA
12 Si NA NA NA NA NA
13 P NA NA NA NA NA
14 S NA NA NA NA NA
15 Cl NA NA NA NA NA
16 K NA NA NA NA NA
17 Ca NA NA NA NA NA
18 Sc NA NA NA NA NA
19 Ti NA NA NA NA NA
20 V NA NA NA NA NA
21 Cr NA NA NA NA NA
22 Mn NA NA NA NA NA
23 Fe NA NA NA NA NA
24 Co NA NA NA NA NA
25 Ni NA NA NA NA NA
26 Cu NA NA NA NA NA
27 Zn NA NA NA NA NA
28 Ga NA NA NA NA NA
29 Ge NA NA NA NA NA
30 As NA NA NA NA NA
31 Se NA NA NA NA NA
32 Br NA NA NA NA NA
33 Rb NA NA NA NA NA
34 Sr NA NA NA NA NA
35 Y NA NA NA NA NA
36 Zr NA NA NA NA NA
37 Nb NA NA NA NA NA
38 Mo NA NA NA NA NA
39 Ru NA NA NA NA NA
40 Rh NA NA NA NA NA
41 Pd NA NA NA NA NA
42 Ag NA NA NA NA NA
43 Cd NA NA NA NA NA
44 In NA NA NA NA NA
45 Sn NA NA NA NA NA
46 Sb NA NA NA NA NA
47 Te NA NA NA NA NA
48 I NA NA NA NA NA
49 Cs NA NA NA NA NA
50 Ba NA NA NA NA NA
51 La NA NA NA NA NA
52 Ce NA NA NA NA NA
53 Nd NA NA NA NA NA
54 Sm NA NA NA NA NA
55 Gd NA NA NA NA NA
56 Dy NA NA NA NA NA
57 Er NA NA NA NA NA
58 Yb NA NA NA NA NA
59 Hf NA NA NA NA NA
60 Ta NA NA NA NA NA
61 W NA NA NA NA NA
62 Re NA NA NA NA NA
63 Os NA NA NA NA NA
64 Ir NA NA NA NA NA
65 Pt NA NA NA NA NA
66 Au NA NA NA NA NA
67 Hg NA NA NA NA NA
68 Tl NA NA NA NA NA
69 Pb NA NA NA NA NA
70 Bi NA NA NA NA NA
71 Th NA NA NA NA NA
72 U NA NA NA NA NA
[1] NA
[1] NA
[1] "hello"
elements <- str_extract_all(
result[1, "elements"], regex("(?<=-)[A-Z]+[a-z]*(?=-)")
)
print(elements)
[[1]]
[1] "Ni" "N" "C" "H"
# for (element in elements){
# comparing_name <- df1[df1$Elements == element, 'Elements']
# #v comparing_hmin <- df1[df1$Elements == element, 'hmin']
# #print(element, sep = '\n')
# print(comparing_name, sep = '\n')
# #cat(comparing_hmin, sep = '\n')
# }
print(class(elements))
[1] "list"
[[1]]
[1] "Ni" "N" "C" "H"
[1] "hello"
[1] "Ni" "N" "C" "H"
[1] "hello"
# comparing_hmin <- df1[df1$Elements == 'H', 'hmin']
# print(comparing_hmin)
# print(class(comparing_hmin))
# if (is.na(comparing_hmin)){
# print('test')
# }
# x <- c('-Ce-Na-Si-O-P-C-S-')
# y <- str_extract_all(x, regex("(?<=-)[A-Z]+[a-z]*(?=-)"))
# for (item in y){
# cat(item, sep="\n")
# }
#
#
# primes_list <- list(2, 3, 5, 7, 11, 13)
# # loop version 1
# for (p in primes_list) {
# print(p)
# }