Assignment 2: Your Data Ver2.0

Describing the dataset

Author

Jiyin Zhang

Published

February 7, 2023

MY DATASET

This dataset is generated based on the elements coexistence counts from Mindat.org database. The original data source is retrieved via Mindat API and stored in JSON format. Then after data pre-process and data cleaning steps, the retrieved data is cured and stored in CSV format. The dataset can be accessed at the Github repository, in the name of mineral.csv.

Data Collection

The dataset was retrieved via Mindat API as a JSON file. In the data preprocessing step, the elements information are extracted in a new JSON file, in which some of the hierarchical structures have been removed in convenience of python’s to_csv function. Then the exported csv file can be read directly with R’s read.csv function.

IMPORTING THE DATA

I’m going to use the built-in read.csv package to import CSV file.

Code
library(tidyverse)
# result <- read.csv(file = 'total_elements_mindat.csv')
result <- read.csv(file = 'mineral.csv')

The glimpse command in the Tidyverse package is a nice way to summarize the data frame:

Code
glimpse(result)
Rows: 5,883
Columns: 17
$ id              <int> 1, 2, 3, 4, 9, 10, 13, 14, 18, 19, 21, 23, 27, 31, 32,…
$ name            <chr> "Abelsonite", "Abenakiite-(Ce)", "Abernathyite", "Abhu…
$ elements        <chr> "-Ni-N-C-H-", "-Ce-Na-Si-O-P-C-S-", "-As-O-K-H-U-", "-…
$ sigelements     <chr> "-Ni-N-C-H-", "-Ce-Na-Si-O-P-C-S-", "-As-O-K-H-U-", "-…
$ yeardiscovery   <chr> "1975", "", "1956", "1983", "1990", "1855", "1974", "1…
$ hmin            <dbl> 2.0, 4.0, 2.5, 2.0, 6.5, 2.0, 1.0, 2.5, 5.0, 3.5, 3.5,…
$ hmax            <dbl> 3.0, 5.0, 3.0, 2.0, 6.5, 2.5, 1.5, 2.5, 6.0, 3.5, 3.5,…
$ hardtype        <int> 0, 0, 0, 3, 0, 3, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0, …
$ specificgravity <chr> "1.4", "3.21", "", "4.29", "", "7.24 (calc)  7.2-7.4(m…
$ strunz10ed1     <int> 10, 9, 8, 3, 9, 2, 10, 2, 9, 3, 8, 8, 6, 9, 9, 9, 9, 8…
$ strunz10ed2     <chr> "C", "C", "E", "D", "A", "B", "A", "C", "D", "C", "B",…
$ strunz10ed3     <chr> "A", "K", "B", "A", "G", "A", "A", "C", "E", "C", "B",…
$ strunz10ed4     <chr> "20", "10", "15", "30", "05", "35", "20", "05", "10", …
$ dana8ed1        <chr> "50", "61", "40", "10", "7", "2", "50", "0", "0", "11"…
$ dana8ed2        <chr> "4", "4", "2a", "5", "5", "4", "4", "0", "0", "6", "6"…
$ dana8ed3        <chr> "9", "1", "9", "9", "1", "1", "7", "0", "0", "17", "6"…
$ dana8ed4        <chr> "1", "1", "1", "1", "4", "1", "1", "0", "0", "1", "3",…

DESCRIBE THE DATA

Data Set Type

The dataset is stored as a great Flat Table, the items are 5883 mineral species from OpenMindat data server, and the columns representing the corresponding attributes.

Mineral Species Attributes
c('id', 'name', 'elements', 'sigelements', 'yeardiscovery', 'hmin', 'hmax', 'hardtype', 'specificgravity', 'strunz10ed1', 'strunz10ed2', 'strunz10ed3', 'strunz10ed4', 'dana8ed1', 'dana8ed2', 'dana8ed3', 'dana8ed4')
 [1] "id"              "name"            "elements"        "sigelements"    
 [5] "yeardiscovery"   "hmin"            "hmax"            "hardtype"       
 [9] "specificgravity" "strunz10ed1"     "strunz10ed2"     "strunz10ed3"    
[13] "strunz10ed4"     "dana8ed1"        "dana8ed2"        "dana8ed3"       
[17] "dana8ed4"       

Attribute Types

The attributes of the data are recorded in a 2-dimensional format, therefore the data frame rows will looks similar to the result of glimpse function. The ‘id’ field is in a strict ascending order, while not continuous. The ‘id’ field of each row is determined by the website managers or data providers, therefore it has nothing to do with some standard identifications. The ‘names’ field indicates the IMA approved mineral species names. The ‘elements’ and ‘sigelements’ fields indicating the elements of the mineral chemical formual, while the ‘sigelement’ is determined by some significant elements as a subset of ‘elements’. In compatable with csv format, the elements in this field are separated by hyphens \(-\).

Code
head(result)
  id            name           elements        sigelements yeardiscovery hmin
1  1      Abelsonite         -Ni-N-C-H-         -Ni-N-C-H-          1975  2.0
2  2 Abenakiite-(Ce) -Ce-Na-Si-O-P-C-S- -Ce-Na-Si-O-P-C-S-                4.0
3  3    Abernathyite       -As-O-K-H-U-       -As-O-K-H-U-          1956  2.5
4  4        Abhurite        -Cl-Sn-O-H-        -Cl-Sn-O-H-          1983  2.0
5  9  Abswurmbachite       -Cu-Mn-Si-O-       -Cu-Mn-Si-O-          1990  6.5
6 10       Acanthite             -Ag-S-             -Ag-S-          1855  2.0
  hmax hardtype            specificgravity strunz10ed1 strunz10ed2 strunz10ed3
1  3.0        0                        1.4          10           C           A
2  5.0        0                       3.21           9           C           K
3  3.0        0                                      8           E           B
4  2.0        3                       4.29           3           D           A
5  6.5        0                                      9           A           G
6  2.5        3 7.24 (calc)  7.2-7.4(meas)           2           B           A
  strunz10ed4 dana8ed1 dana8ed2 dana8ed3 dana8ed4
1          20       50        4        9        1
2          10       61        4        1        1
3          15       40       2a        9        1
4          30       10        5        9        1
5          05        7        5        1        4
6          35        2        4        1        1

Visualization

The correaltion between elements and hardness

Elements <- c('H', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Cs', 'Ba', 'La', 'Ce', 'Nd', 'Sm', 'Gd', 'Dy', 'Er', 'Yb', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Th', 'U')
library("tidyverse")

df <- data.frame(Elements)
df1 <- df %>%
  add_column(hmin = NA, hmin_mineral = NA, hmax = NA, hmax_mineral = NA, hmean = NA)

# for (row in 1:nrow(result)) {
#     mineral_name <- result[row, "name"]
#     elements <- str_extract_all(
#       result[row, "elements"], regex("(?<=-)[A-Z]+[a-z]*(?=-)")
#       )
# 
#     hmin <- result[row, "hmin"]
#     for (element in elements){
#         comparing_hmin <- df1[df1$Elements == element, 'hmin']
#         if (is.na(comparing_hmin)){
#           df1[df1$Elements == element, 'hmin'] <- hmin
#         }
#     }
#     # hmax <- result[row, "hmax"]
# }

print(df1)
   Elements hmin hmin_mineral hmax hmax_mineral hmean
1         H   NA           NA   NA           NA    NA
2        Li   NA           NA   NA           NA    NA
3        Be   NA           NA   NA           NA    NA
4         B   NA           NA   NA           NA    NA
5         C   NA           NA   NA           NA    NA
6         N   NA           NA   NA           NA    NA
7         O   NA           NA   NA           NA    NA
8         F   NA           NA   NA           NA    NA
9        Na   NA           NA   NA           NA    NA
10       Mg   NA           NA   NA           NA    NA
11       Al   NA           NA   NA           NA    NA
12       Si   NA           NA   NA           NA    NA
13        P   NA           NA   NA           NA    NA
14        S   NA           NA   NA           NA    NA
15       Cl   NA           NA   NA           NA    NA
16        K   NA           NA   NA           NA    NA
17       Ca   NA           NA   NA           NA    NA
18       Sc   NA           NA   NA           NA    NA
19       Ti   NA           NA   NA           NA    NA
20        V   NA           NA   NA           NA    NA
21       Cr   NA           NA   NA           NA    NA
22       Mn   NA           NA   NA           NA    NA
23       Fe   NA           NA   NA           NA    NA
24       Co   NA           NA   NA           NA    NA
25       Ni   NA           NA   NA           NA    NA
26       Cu   NA           NA   NA           NA    NA
27       Zn   NA           NA   NA           NA    NA
28       Ga   NA           NA   NA           NA    NA
29       Ge   NA           NA   NA           NA    NA
30       As   NA           NA   NA           NA    NA
31       Se   NA           NA   NA           NA    NA
32       Br   NA           NA   NA           NA    NA
33       Rb   NA           NA   NA           NA    NA
34       Sr   NA           NA   NA           NA    NA
35        Y   NA           NA   NA           NA    NA
36       Zr   NA           NA   NA           NA    NA
37       Nb   NA           NA   NA           NA    NA
38       Mo   NA           NA   NA           NA    NA
39       Ru   NA           NA   NA           NA    NA
40       Rh   NA           NA   NA           NA    NA
41       Pd   NA           NA   NA           NA    NA
42       Ag   NA           NA   NA           NA    NA
43       Cd   NA           NA   NA           NA    NA
44       In   NA           NA   NA           NA    NA
45       Sn   NA           NA   NA           NA    NA
46       Sb   NA           NA   NA           NA    NA
47       Te   NA           NA   NA           NA    NA
48        I   NA           NA   NA           NA    NA
49       Cs   NA           NA   NA           NA    NA
50       Ba   NA           NA   NA           NA    NA
51       La   NA           NA   NA           NA    NA
52       Ce   NA           NA   NA           NA    NA
53       Nd   NA           NA   NA           NA    NA
54       Sm   NA           NA   NA           NA    NA
55       Gd   NA           NA   NA           NA    NA
56       Dy   NA           NA   NA           NA    NA
57       Er   NA           NA   NA           NA    NA
58       Yb   NA           NA   NA           NA    NA
59       Hf   NA           NA   NA           NA    NA
60       Ta   NA           NA   NA           NA    NA
61        W   NA           NA   NA           NA    NA
62       Re   NA           NA   NA           NA    NA
63       Os   NA           NA   NA           NA    NA
64       Ir   NA           NA   NA           NA    NA
65       Pt   NA           NA   NA           NA    NA
66       Au   NA           NA   NA           NA    NA
67       Hg   NA           NA   NA           NA    NA
68       Tl   NA           NA   NA           NA    NA
69       Pb   NA           NA   NA           NA    NA
70       Bi   NA           NA   NA           NA    NA
71       Th   NA           NA   NA           NA    NA
72        U   NA           NA   NA           NA    NA
print(df1[df1$Elements == 'H', 'hmin'])
[1] NA
comparing_hmin <- df1[df1$Elements == 'H', 'hmin']
print(comparing_hmin)
[1] NA
if (is.na(comparing_hmin)){
  # df1[df1$Elements == element, 'hmin'] <- hmin
  print('hello')
}
[1] "hello"
elements <- str_extract_all(
      result[1, "elements"], regex("(?<=-)[A-Z]+[a-z]*(?=-)")
      )
print(elements)
[[1]]
[1] "Ni" "N"  "C"  "H" 
# for (element in elements){
#   comparing_name <- df1[df1$Elements == element, 'Elements']
#   #v comparing_hmin <- df1[df1$Elements == element, 'hmin']
#   #print(element, sep = '\n')
#   print(comparing_name, sep = '\n')
#   #cat(comparing_hmin, sep = '\n')
# }


print(class(elements))
[1] "list"
for (i in 1:length(elements)){
  print(elements[i])
  print('hello')
  }
[[1]]
[1] "Ni" "N"  "C"  "H" 

[1] "hello"
for (element in elements){
  print(element)
  print('hello')
 }
[1] "Ni" "N"  "C"  "H" 
[1] "hello"
# comparing_hmin <- df1[df1$Elements == 'H', 'hmin']
# print(comparing_hmin)
# print(class(comparing_hmin))
# if (is.na(comparing_hmin)){
#   print('test')
# }


# x <- c('-Ce-Na-Si-O-P-C-S-')
# y <- str_extract_all(x, regex("(?<=-)[A-Z]+[a-z]*(?=-)"))
# for (item in y){
#   cat(item, sep="\n")
# }
#   
# 
# primes_list <- list(2, 3, 5, 7, 11, 13)

# # loop version 1
# for (p in primes_list) {
#   print(p)
# }

Question

Note
I have no idea why the render result of this .qmd file failed to adapt to the html style.

The problem is that the .qmd file shall not consist of the hashtag # symbol.

Another issue fixed is that the json file is originally contains 5883 items, which should be converted into 5883 rows of csv items. While the converting results shows there were over 5900 rows. The reason for this problem is that there are some annoying \ns in the attributes, which will end up with new rows in the exported csv file. I fixed this issue by simply removing all the \ns in the json file.