String utilities¶
Utilities for working with strings
Encoding¶
encode_sequence (sequence, encoding_map, …) |
Extract the amino acid properties of the given sequence |
encode_all_sequences (sequences, …) |
Extract the amino acid feature vectors for each peptide sequence |
Length manipulation¶
pad_sequence (seq, max_seq_len, pad_value, align) |
Pad seq to max_seq_len with value based on the align strategy |
pad_trim_sequences (seq_vec, pad_value, …) |
Pad and/or trim a list of sequences to have common length |
trim_sequence (seq, maxlen, align) |
Trim seq to at most maxlen characters using align strategy |
Other operations¶
simple_fill (text, width) |
Split text into equal-sized chunks of length width |
split (s, delimiters, maxsplit) |
Split s on any of the given delimiters |
Human-readable data type helpers¶
bytes2human (n, format) |
Convert n bytes to a human-readable format |
human2bytes (s) |
Convert a human-readable byte string to an integer |
try_parse_float (s) |
Convert s to a float, if possible |
try_parse_int (s) |
Convert s to an integer, if possible |
str2bool (s) |
Convert s to a boolean value, if possible |
Definitions¶
Utilities for working with strings
-
pyllars.string_utils.
bytes2human
(n: int, format: str = '%(value)i%(symbol)s') → str[source]¶ Convert n bytes to a human-readable format
This code is adapted from: http://goo.gl/zeJZl
Parameters: - n (int) – The number of bytes
- format (string) – The format string
Returns: human_str – A human-readable version of the number of bytes
Return type: string
Examples
>>> bytes2human(10000) '9K' >>> bytes2human(100001221) '95M'
-
pyllars.string_utils.
encode_all_sequences
(sequences: Iterable[str], encoding_map: Mapping[str, numpy.ndarray], maxlen: Optional[int] = None, align: str = 'start', pad_value: str = 'J', same_length: bool = False, flatten: bool = False, return_as_numpy: bool = True, swap_axes: bool = False, progress_bar: bool = True) → Union[numpy.ndarray, List][source]¶ Extract the amino acid feature vectors for each peptide sequence
See get_peptide_aa_features for more details.
Parameters: - sequences (typing.Iterable[str]) – The sequences
- encoding_map (typing.Mapping[str, numpy.ndarray]) – The features for each character
- maxlen (typing.Optional[int]) –
- align (str) –
- pad_value (str) –
- same_length (bool) –
- flatten (bool) – Whether to (attempt to) convert the features of each peptide into a single long vector (True) or leave as a (presumably) 2d position-feature vector.
- return_as_numpy (bool) – Whether to return as a 2d or 3d numpy array (True) or a list containing 1d or 2d numpy arrays. (The dimensionality depends upon flatten.)
- swap_axes (bool) –
If the values are returned as a numpy tensor, swap axes 1 and 2.
N.B. This flag is only compatible with return_as_numpy=True and flatten=False.
- progress_bar (bool) – Whether to show a progress bar for collecting the features.
Returns: all_encoded_peptides – The resulting features. See the flatten and return_as_numpy parameters for the expected output.
Return type: typing.Union[numpy.ndarray, typing.List]
-
pyllars.string_utils.
encode_sequence
(sequence: str, encoding_map: Mapping[str, numpy.ndarray], flatten: bool = False) → numpy.ndarray[source]¶ Extract the amino acid properties of the given sequence
This function is designed with the idea of mapping from a sequence to numeric features (such as chemical properties or BLOSUM features for amino acid sequences). It may fail if other features are included in encoding_map.
Parameters: - sequence (str) – The sequence
- encoding_map (typing.Mapping[str, numpy.ndarray]) – A mapping from each character to a set of features. Presumably, the features are numpy-like arrays, though they need not be.
- flatten (bool) – Whether to flatten the encoded sequence into a single, 1d array or leave them as-is.
Returns: encoded_sequence – A 1d or 2d np.array, depending on flatten. By default (flatten=False), this is a 1d array of objects, in which the outer dimension indexes the position in the epitope. If flatten is True, then the function attempts to reshape the features into a single long feature vector. This will likely fail if the encoding_map values are not numpy-like arrays.
Return type:
-
pyllars.string_utils.
human2bytes
(s: str) → int[source]¶ Convert a human-readable byte string to an integer
This code is adapted from: http://goo.gl/zeJZl
Parameters: s (string) – The human-readable byte string Returns: num_bytes – The number of bytes Return type: int Examples
>>> human2bytes('1M') 1048576 >>> human2bytes('1G') 1073741824
-
pyllars.string_utils.
pad_sequence
(seq: str, max_seq_len: int, pad_value: str = 'J', align: str = 'end') → str[source]¶ Pad seq to max_seq_len with value based on the align strategy
If seq is already of length max_seq_len or longer it will not be changed.
Parameters: Returns: padded_seq – The padded string. In case seq was already long enough or longer, it will not be changed. So padded_seq could be longer than max_seq_len.
Return type:
-
pyllars.string_utils.
pad_trim_sequences
(seq_vec: Sequence[str], pad_value: str = 'J', maxlen: Optional[int] = None, align: str = 'start') → List[str][source]¶ Pad and/or trim a list of sequences to have common length
- The procedure is as follows:
- Pad the sequence with pad_value
- Trim the sequence
Parameters: - seq_vec (typing.Sequence[str]) – List of sequences that can have various lengths
- pad_value (str) – Neutral element with which to pad the sequence. This should be a single character.
- maxlen (typing.Optional[int]) – Length of padded/trimmed sequences. If None, maxlen is set to the longest sequence length.
- align (str) – To which end to align the sequences when triming/padding. Valid options are start, end, center
Returns: padded_sequences – The padded and/or trimmed sequences
Return type:
-
pyllars.string_utils.
simple_fill
(text: str, width: int = 60) → str[source]¶ Split text into equal-sized chunks of length width
This is a simplified version of textwrap.fill.
The code is adapted from: http://stackoverflow.com/questions/11781261
Parameters: - text (string) – The text to split
- width (int) – The (exact) length of each line after splitting
Returns: split_str – A single string with lines of length width (except possibly the last line)
Return type: string
-
pyllars.string_utils.
split
(s: str, delimiters: Iterable[str], maxsplit: int = 0) → List[str][source]¶ Split s on any of the given delimiters
This code is adapted from: http://stackoverflow.com/questions/4998629/
Parameters: - s (string) – The string to split
- delimiters (list of strings) – The strings to use as delimiters
- maxsplit (int) – The maximum number of splits (or 0 for no limit)
Returns: splits – the split strings
Return type: list of strings
-
pyllars.string_utils.
str2bool
(s: str) → bool[source]¶ Convert s to a boolean value, if possible
Parameters: s (string) – A string which may represent a boolean value Returns: bool_s – True if s is in _TRUE_STRING, and False otherwise Return type: boolean
-
pyllars.string_utils.
trim_sequence
(seq: str, maxlen: int, align: str = 'end') → str[source]¶ Trim seq to at most maxlen characters using align strategy
Parameters: Returns: trimmed_seq – The trimmed string. In case seq was already an appropriate length, it will not be changed. So trimmed_seq could be shorter than maxlen.
Return type: