\[FScore_i = \frac{ \sum_{k=1}^s \sum_{l=1}^{t_k} \mathbb{I}(a_i \in RED_{k,l}) } { \sum_{k=1}^s \mathbb{I}(a_i \in SUB_k) }\]
\[wAcc = \frac{ 1 }{ c } \sum_{i=1}^c \frac{ n_{ii} }{ n_{i1} + \dots + n_{ic} } \]
\[ RI_{g_k} = \sum_{\tau=1}^{st} wAcc^u \sum_{n_{g_k}(\tau)} IG(n_{g_k}(\tau)) \left( \frac{ \textrm{no. in } n_{g_k}(\tau) }{ \textrm{no. in } \tau } \right)^v \]
\[ N = \underset{ n \in \{1, \dots, d \} }{ \textrm{argmin} } \left( \left( 1 - \frac{\sum_{k=1}^n Score_k}{\sum_{k=1}^d Score_k} \right)^2 + \left( \frac{n}{d} \right)^2 \right) \]
... rok temu
Policzenie RR na pojedynczym komputerze zajęło 9.2x dłużej niż MapReduce
!!!
184min vs. 20min
10x ( 4x Intel Xeon E3-1220 @ 3.10GHz, 32GB RAM, 1TB HDD )
Leukemia: 22,227 attr x 190 obj
5000 podtablic, 150 atrybutów każda
private Collection<BitSet>
getAllCountedReducts(Vector<BitSet> discern_attrs){
HashSet<BitSet> reducts = new HashSet<BitSet>();
int[] startCount = getCountingAttributes(discern_attrs);
LinkedList<Integer[]> attributesList =
FindFirstAllMaximums(startCount);
(...)
return reducts;
}
Spark | MapRed | Standalone |
2 min 24 sec | 15 min 45 sec | 8 min 3 sec |
Ref | Reducts | Trees | |
Acc | 81.58% | 82.63% | 86.31% |
Attr | 22,277 | 855 | 167 |
Time | 148 sec | 2 min 24 sec | 20 min 5 sec |
1GB = 512 attr x 1,048,576 obj
Generowanie zajęło 48 sec na Sparku