Class TextFileDataSet

  • All Implemented Interfaces:
    java.lang.Iterable<java.util.List<Attribute>>, DataSet

    public class TextFileDataSet
    extends java.lang.Object
    implements DataSet
    • Constructor Summary

      Constructors 
      Constructor Description
      TextFileDataSet​(java.io.File dataSource, int output)  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void addRecords​(java.util.Collection<? extends java.util.List<Attribute>> list)  
      Attribute allTheSame()
      Returns the most common output attribute if the rest of the attributes are exactly the same over the whole data set.
      boolean allTheSameOutput()  
      void close()
      Closes the underlying data source if possible.
      java.util.HashMap<Attribute,java.lang.Integer> getFrequencies​(int lo, int hi, int fieldIndex)
      Gets a map between the different values of the attribute at the fieldIndex and their respective frequencies.
      int getItemsCount()  
      MetaData getMetaData()  
      int getOutputIndex()  
      DataSet getSubset​(int lo, int hi)
      Create a slice of this data set as a new data set from [lo, hi).
      java.util.Iterator<java.util.List<Attribute>> iterator()  
      java.lang.Iterable<java.util.List<Attribute>> sortOver​(int fieldIndex)
      Sort the data set over the fieldIndex as primary key and the output index to break any ties.
      java.lang.Iterable<java.util.List<Attribute>> sortOver​(int lo, int hi, int fieldIndex)
      Sort the data set over the fieldIndex as primary key and the output index to break any ties, and limit the elements to [lo, hi).
      DataSet[] splitKeepingRelation​(double proportion)
      Splits this data set into two new dataset where the proportion between the output classes is kept.
      java.lang.String toString()  
      • Methods inherited from interface java.lang.Iterable

        forEach, spliterator
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
    • Constructor Detail

      • TextFileDataSet

        public TextFileDataSet​(java.io.File dataSource,
                               int output)
    • Method Detail

      • getSubset

        public DataSet getSubset​(int lo,
                                 int hi)
        Description copied from interface: DataSet
        Create a slice of this data set as a new data set from [lo, hi).
        Specified by:
        getSubset in interface DataSet
        Parameters:
        lo - The lower bound (inclusive) of the data set to be returned
        hi - The upper bound (exclusive) of the data set to be returned.
        Returns:
        A new data set that is the copy of this from [lo, hi)
      • getOutputIndex

        public int getOutputIndex()
        Specified by:
        getOutputIndex in interface DataSet
        Returns:
        The index of the field used as output.
      • getItemsCount

        public int getItemsCount()
        Specified by:
        getItemsCount in interface DataSet
        Returns:
        The total count of elements in this data set.
      • getMetaData

        public MetaData getMetaData()
        Specified by:
        getMetaData in interface DataSet
        Returns:
        A MetaData object containing information about the attributes on the data set.
      • sortOver

        public java.lang.Iterable<java.util.List<Attribute>> sortOver​(int fieldIndex)
        Description copied from interface: DataSet
        Sort the data set over the fieldIndex as primary key and the output index to break any ties. This index is remember for future internal references.
        Specified by:
        sortOver in interface DataSet
        Parameters:
        fieldIndex - The field to sort over
        Returns:
        A iterable representation of this data set sorted over the field index.
      • sortOver

        public java.lang.Iterable<java.util.List<Attribute>> sortOver​(int lo,
                                                                      int hi,
                                                                      int fieldIndex)
        Description copied from interface: DataSet
        Sort the data set over the fieldIndex as primary key and the output index to break any ties, and limit the elements to [lo, hi). This index is remember for future internal references.
        Specified by:
        sortOver in interface DataSet
        Parameters:
        lo - The lower bound (inclusive) of the data set to be returned
        hi - The upper bound (exclusive) of the data set to be returned.
        fieldIndex - The field to sort over
        Returns:
        A iterable representation of this data set sorted over the field index from [lo, hi).
      • splitKeepingRelation

        public DataSet[] splitKeepingRelation​(double proportion)
        Description copied from interface: DataSet
        Splits this data set into two new dataset where the proportion between the output classes is kept. The first dataset contains the proportion of the original data set, for instance, if the data set has 100 elements distributed between 2 classes, in a 60/40 proportion, this first set will contain (60*proportion + 40*proportion) elements and the second data set will contain the rest. This method is useful to generate training/test sets from one massive data set.
        Specified by:
        splitKeepingRelation in interface DataSet
        Parameters:
        proportion - the percentage of element to be keep of each class for the first data set.
        Returns:
        an array of 2 positions with the dataset as described above.
      • addRecords

        public final void addRecords​(java.util.Collection<? extends java.util.List<Attribute>> list)
      • toString

        public java.lang.String toString()
        Overrides:
        toString in class java.lang.Object
      • iterator

        public java.util.Iterator<java.util.List<Attribute>> iterator()
        Specified by:
        iterator in interface java.lang.Iterable<java.util.List<Attribute>>
      • allTheSameOutput

        public boolean allTheSameOutput()
        Specified by:
        allTheSameOutput in interface DataSet
        Returns:
        True if all the classes (value of output index) are the same.
      • allTheSame

        public Attribute allTheSame()
        Description copied from interface: DataSet
        Returns the most common output attribute if the rest of the attributes are exactly the same over the whole data set. If there is one single record with one single attribute different from the rest, then this method will return null.
        Specified by:
        allTheSame in interface DataSet
        Returns:
        The most common output attribute or null if there is one record different from the rest.
      • getFrequencies

        public java.util.HashMap<Attribute,java.lang.Integer> getFrequencies​(int lo,
                                                                             int hi,
                                                                             int fieldIndex)
        Description copied from interface: DataSet
        Gets a map between the different values of the attribute at the fieldIndex and their respective frequencies. It limits the count space to [lo, hi).
        Specified by:
        getFrequencies in interface DataSet
        Parameters:
        lo - The lower bound (inclusive) of the data set to be returned
        hi - The upper bound (exclusive) of the data set to be returned.
        fieldIndex - The field to count
        Returns:
        a map with the different values and their respective frequencies.
      • close

        public void close()
        Description copied from interface: DataSet
        Closes the underlying data source if possible.
        Specified by:
        close in interface DataSet